The conference revolved around four clusters:
- How quickly can you get the data into your system (ingest)
- How fast can you show the results
- It's all about presentation (charts)
- Big Data doesn't mean Hadoop
How Quickly Can You Get Data
The presentation that left me mesmerized was Spark! I can't wait to use it. It is a very compelling product and it's now backed up by Cloudera. With Spark you can do the following:
- Get a compute engine for Hadoop data - no need to reinvent the wheel
- Speed up! A 100% faster MapReduce engine
- Sophisticated: it runs all the sophisticated algorithms. Get access to a library of sophisticated algorithms
- A a big community behind it; the most popular Big Data open source (followed by Hadoop)
- Learning from the big guys - Yahoo!, Conviva, and Cloudera are using it
Not to mention that it comes integrated with a analytic suite (Shark), a large-scale graph processing (Bagel), and real-time analysis (Spark Streaming). This is nice because rather than doing Hive, Hadoop, and Mahout, and Storm, I only have to learn one programming paradigm.
How Fast Can You Show The Results
Twitter explains how they monitor millions (+5,700 tweets per second) of Time Series. The presentation was superb. I found out that the stack that they're using, named "Observability", is composed on: Finnagle, Cassandra, and query language and execution engines based on Scala. Although is a work in progress the stack is about three years old. I hope that they open-sourced it stack so I can get more context on how they monitor a large distributed system.
Another very interesting product was Google's Big Query. This was one of those presentations in which we (my team and I) stumbled upon by accident. The presentation showed how to use Google's toolkit: Freebase, Maps, and BigQuery to do analytics.
It's All About Context, Results, or Charts
Another company that impressed me was Trifacta. With their tool you can clean data, see the model (graph) and recursively do it again in case you see patterns or not. The tool is targeted to data scientists, data wranglers, and data analysts. It's a great tool to mine data data, but most important, you can clean the data and show the results with relative ease.
IPython: This rekindled my interest in Python. IPythons notebooks are great for data scientists. You can get code, text, and graphics all in one page, so it's the perfect tool to show quick results. It's not that Python wasn't a popular language for data scientists. NumPy library provides a solid MATLAB-like matrix data structure, with efficient matrix and vector operations. It also provides other great APIs like SciPy and Pandas.
Big Data != Hadoop
Two topics that opened my eyes were Mesos and YARN. Mesos, what Twitter uses to manage its clusters, is similar to YARN (Yet Another Resource Negotiator). The Hadoop 2.0, or YARN, it's becoming more of an environment and operating system; not just a MapReduce. With YARN, the JobTracker is gone. The ResourceManager is what does the job of the JobTracker. The ResourceManager (RM) is a scheduler - it allocates resources based on a pluggable scheduling algorithm. RM manages and monitors all the applications, so it strictly limits to arbitrating available resources.
One of our favorite (me and two of my buddies), was Netflix Data Platform by Kurt Brown. A different and a great presentation. Rather than going on the technology side, they explained how the culture is intertwined with their technology stack or decisions. For example, they talked about the reason for using "the cloud". Obvious reasons like: it's cheaper, much flexible (growth, a better place to do tests/spikes), and having multi data center is definitely a plus. Also, Amazon and RackSpace have great services such as SQL, EMR, and S3. But the main reason is "focus". They are focused on getting movies and increasing their audience rather than to focus on the "plumbing". They expressed their commitment to "open-source software" (OSS). They mentioned the great talent that they can get and how they can "manage their own destiny" by following these principles and using these tools.
Netflix explained their philosophy and how it's the "soul" of their decision (technical and business). For example, they keep keyboards, mice, and other peripherals in vending machines (they are free), so that everyone knows to "act in Netflix best interest". Furthermore, every decision or project needs to answer a basic question: "what value are you adding?". They apply the rule "accept that things will break". Because of this, they build safety nets around their systems. Again, it was a very nice and interesting presentation.
I really enjoyed the conference. I also just purchased the videos. Which I highly recommend!! During the next few months, I'm going to try to learn some of these tools and present them at the Miami JVM Meetup. Hopefully I can get to see you there, or better yet, hope to see you at Strata 2015. If you're going to either one of these events, let's meet up and share a beer...or two and discuss Big Data. I promise that my eyes will get dilated.