Do You Need a PhD to Analyze Big Data?

It has been said that you need a doctorate to adequately run Hadoop and the many attendant programs for Big Data analysis.  Is this truth or fiction? 

 

The Hadoop environment is complex.  Hadoop itself is an open source program from Apache, the people that bring you a wide range of internet related open-source software projects.  Hadoop, as of May 2012, is at release 1.0.2.  There is still a lot to learn from Hadoop and the code is not super mature.  As with many open-source projects, there are a lot of very talented people working on this, and it will gain more function and stability over time.  As you might imagine it has a lot of Java and Unix related stuffing, and there are a number of support programs required to have a real Big Data solution based on Hadoop.

 

At Big Data University, an IBM™ education effort on Hadoop, you get FREE hands-on experience with Hadoop and some related software.  In Hadoop Fundamentals I, you can get Hadoop code by downloading or using Hadoop in the cloud on Amazon™.  The coursework will take you on an adventure with some great narration for each module and hands-on labs.  The objective is to get you comfortable with Hadoop and Big Data frameworks.

 

The introductory Hadoop class gets into the IBM version of Hadoop called BigInsights,  as well as HDFS, Pig, Hive, MapReduce, JAQL, and Flume.  Oh, by the way, there might be some user developed code functions required to glue the parts together depending on your needs.  Some assembly is required.    

 

IBM InfoSphere BigInsights™ provides a stable configuration of Hadoop that has been provided by IBM for the class.  The software is free and IBM can provide support for a price, of course.  You get a stable Hadoop version with the IBM spin, but other companies like Cloudera, Apache and others can also provide Hadoop. 

 

To actually build a system to do something with Hadoop, you’ll need some other components.  HDFS is the file system that supports Hadoop.  The IBM course will take you through how Hadoop makes use of a file system designed to support parallel processing threads.  You also get a HDFS lab as part of the course.

 

Next they get into MapReduce, and the methodology on parallel processing and the logic of how data flows and how the pieces fit together.  This includes mergesort, data types, fault tolerance and scheduling/task execution.  Then you get the lab to see if you really got it.

 

Now that you have the big picture on Big Data, what do you do with it?  To create a MapReduce job could require a lot of Java code.  You use high level tools to create a MapReduce job and thusly reduce the effort and time required.  That’s where Pig, Hive, and Jaql come in.   Pig originated at Yahoo™   and provides a great way to handle unstructured data since there is no fixed schema.  For unstructured data think of analyzing twitter comments.  Hive was developed by Facebook™ and is a data warehouse function for Big Data.  It is a declarative language, so you specify the desired output, and it figures a way to get it.  Jaql was developed by IBM and is another schema optional language for data flow analysis. 

 

Finally, you might want to move data around your Hadoop cluster.  Flume was developed by Cloudera™ for this purpose.  With Flume you aggregate and move data to an internal store like HDFS. 

 

Let’s return to our original question from the title of this article.  Do you need a PhD to run a Big Data job?  Probably not.  But it wouldn’t hurt.  In any event you will need to be a highly competent programmer to get good results.  Hadoop solutions are best considered as an iterative process so there will be some experimentation.  A combination of skills that might include programming and analytics methodologies would produce the best results.  So if not a PhD, a really well accomplished computer science graduate would seem to be the focus to enable a Big Data solution. 

Advertisements

About Big Data Perspectives

Erik Ottem has over 25 years of technology experience with IBM, Seagate, Gadzoox Networks and Agilent. Observations and comments about Big Data are presented for your review.
This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Do You Need a PhD to Analyze Big Data?

  1. Rishabh,
    I believe that in the early days of Big Data, like we see today, the tools are poor, and a skilled operator will make a significant difference in the quality of the output. I think PhDs have the potential to provide more sophisticated questions and structure appropriate analysis to uncover new perspectives on the given problem. As time progresses and tools improve, the better question will trump the ability to execute and produce an answer.

  2. This piece of writing presents clear idea in favor of the new viewers of blogging, that really how
    to do blogging.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s