Do You Need a PhD to Analyze Big Data?

It has been said that you need a doctorate to adequately run Hadoop and the many attendant programs for Big Data analysis.  Is this truth or fiction? 


The Hadoop environment is complex.  Hadoop itself is an open source program from Apache, the people that bring you a wide range of internet related open-source software projects.  Hadoop, as of May 2012, is at release 1.0.2.  There is still a lot to learn from Hadoop and the code is not super mature.  As with many open-source projects, there are a lot of very talented people working on this, and it will gain more function and stability over time.  As you might imagine it has a lot of Java and Unix related stuffing, and there are a number of support programs required to have a real Big Data solution based on Hadoop.


At Big Data University, an IBM™ education effort on Hadoop, you get FREE hands-on experience with Hadoop and some related software.  In Hadoop Fundamentals I, you can get Hadoop code by downloading or using Hadoop in the cloud on Amazon™.  The coursework will take you on an adventure with some great narration for each module and hands-on labs.  The objective is to get you comfortable with Hadoop and Big Data frameworks.


The introductory Hadoop class gets into the IBM version of Hadoop called BigInsights,  as well as HDFS, Pig, Hive, MapReduce, JAQL, and Flume.  Oh, by the way, there might be some user developed code functions required to glue the parts together depending on your needs.  Some assembly is required.    


IBM InfoSphere BigInsights™ provides a stable configuration of Hadoop that has been provided by IBM for the class.  The software is free and IBM can provide support for a price, of course.  You get a stable Hadoop version with the IBM spin, but other companies like Cloudera, Apache and others can also provide Hadoop. 


To actually build a system to do something with Hadoop, you’ll need some other components.  HDFS is the file system that supports Hadoop.  The IBM course will take you through how Hadoop makes use of a file system designed to support parallel processing threads.  You also get a HDFS lab as part of the course.


Next they get into MapReduce, and the methodology on parallel processing and the logic of how data flows and how the pieces fit together.  This includes mergesort, data types, fault tolerance and scheduling/task execution.  Then you get the lab to see if you really got it.


Now that you have the big picture on Big Data, what do you do with it?  To create a MapReduce job could require a lot of Java code.  You use high level tools to create a MapReduce job and thusly reduce the effort and time required.  That’s where Pig, Hive, and Jaql come in.   Pig originated at Yahoo™   and provides a great way to handle unstructured data since there is no fixed schema.  For unstructured data think of analyzing twitter comments.  Hive was developed by Facebook™ and is a data warehouse function for Big Data.  It is a declarative language, so you specify the desired output, and it figures a way to get it.  Jaql was developed by IBM and is another schema optional language for data flow analysis. 


Finally, you might want to move data around your Hadoop cluster.  Flume was developed by Cloudera™ for this purpose.  With Flume you aggregate and move data to an internal store like HDFS. 


Let’s return to our original question from the title of this article.  Do you need a PhD to run a Big Data job?  Probably not.  But it wouldn’t hurt.  In any event you will need to be a highly competent programmer to get good results.  Hadoop solutions are best considered as an iterative process so there will be some experimentation.  A combination of skills that might include programming and analytics methodologies would produce the best results.  So if not a PhD, a really well accomplished computer science graduate would seem to be the focus to enable a Big Data solution.