The Hadoop Summit for 2013 has just concluded in San Jose. There were a few themes that seemed to recur throughout the two-day summit with over 2,500 people. The overall story is the continued progress to take Hadoop out of the experimental and fringe case environment, and move it into the enterprise with all the appropriate controls, function and management. A related objective is to have 50% of the world’s data on Hadoop in five years.
The latest Hadoop release 2.0 is known as Yarn (yet another resource negotiator). To be a little more precise Hadoop is still less than one at release 0.23, but MapReduce is now version 2.0 or MRv2. The new MRv2 release addresses some of MapReduces’ long known problems such as security and scheduling limitations. Hadoop’s Job Tracker/resource manager/job scheduling have been re-engineered to provide more control with a global resource manager and an application focused “application master”. The new Yarn APIs are backwards compatible with the previous version with a recompile. Good news, of course. You can get more details of the new Hadoop release at the Apache site hadoop.apache.org
The other themes in the Hadoop Summit included in-memory computing, DataLakes, 80% rule, and the role of open source in a commercial product.
Hadoop traditionally is a batch job. Enterprise applications demand an interactive capability. Hadoop is moving into an interactive capability. But it doesn’t stop there. The step beyond interactive capability is stream processing with In-memory computing. In-memory computing is becoming more popular as the cost of memory plummets and people are increasingly looking for “real-time” response from MapReduce related products like Hadoop. The leading player with in-memory computing is SAP’s Hana, but there are several alternatives. In-memory processing provides blazing speed, but higher costs than a traditional paging database that moves data in and out of rotating disc drives. Performance can be enhanced by the use of Flash memory, but it may still not be enough. In-memory typically will have the best performance, and several vendors like Qubole, Kognitio (which pre-dates Hadoop by quite a bit), Data Torrent as well as others showing at the conference were touting the benefits of their in-memory solutions. They provide a great performance boost, if that’s what your application needs.
DataLakes came up in the kickoff as a place to put your data till you figure out what to do with it. I immediately thought of data warehouses, but this is different. In a data warehouse you will usually need to create a schema and scrub the data before it goes in the warehouse so you can process it more efficiently. The idea of a DataLake is to put the data in, and figure out the schema as you do the processing. A number of people I spoke with are still scratching their heads about the details of how this might work, but the concept has some merit.
The 80% rule, the Pareto Principle, refers to 80% of the results coming from 20% of the work, be it customers, products or whatever. In regards to Big Data this is how I view many of the application specific products for Big Data. Due to the shortage of Data Scientists, creating products and platforms for people with more general skills provides 80% of the benefit of Big Data with only 20% of the skills required. I spoke with the guys at Talend and that is clearly their approach. They have a few application areas that have specific solutions aimed at user analyst skills to address the fat part of the market.
Finally, there remains tension between open source and proprietary products. There are some other examples of open source as a mainstream product, and Linux comes to mind as the poster child for the movement. Most of the open source projects are less mainstream. Commercial companies need to differentiate their products to justify their existence. The push behind Hadoop to be the real success story for open source is pretty exciting. Multiple participants I spoke with saw open source as the best way to innovate. It provides a far wider pool of talent to access, and has enough rigor to provide a base that other vendors can leverage for proprietary applications. The excitement at Hadoop Summit generated by moving this platform into the enterprise is audacious, and the promise of open source software seems to be coming true. Sometimes dreams do come true.