Big Data Storage

The fundamental hardware view of Big Data, with its Open Source mindset, is for cheap commodity storage and servers.   A major tenet of the Big Data movement is to keep things inexpensive both for software and hardware.  The desire for this is understandable, with the size of Big Data implementations, costs can easily become scary. 

From a hardware view of Big Data, the baseline theory is that you might have blade servers in a rack with a few SATA hard disc drives directly attached to the servers.  These compute nodes would then be replicated to meet the needs of the job.  With the cost of terabyte and larger hard drives at a relatively low cost, multi-terabyte configurations can be implemented without the huge cost of just a few years ago.  The server configurations tend to have large amounts of main memory (maybe 24-48 GB of RAM) with multiple cores (like 4, 6, or 8).  This benefits a parallel processing job that is the typical output of a MapReduce architecture.  The ratio of storage to compute power depends on the demands of the job.

Experienced systems administrators will wonder what is being done to protect the integrity of the system.  Hadoop/MapReduce software includes clustering, redundancy and failover options are standard in this environment.  The common wisdom is that RAID architectures are unneeded since nodes are redundant in the typical Hadoop configuration.  Similarly redundant power on the servers is not required since entire nodes can be replicated, so there is no need to spend money on extra hardware.

Consideration should be paid to the kind of storage placed in the clusters since not all clusters are created equal.  Name Nodes provide a control point for multiple Data Nodes, and benefit from additional attention.  Similarly Job Tracker nodes provide control for multiple Task Tracker nodes and should get additional attention.  These high value nodes might benefit from better quality storage like higher reliability HDDs or SSD to improve uptime. 

For more complex configurations there might be a hierarchy of storage.  SATA hard drives for Data Nodes and Task Tracker Nodes.  SSD for Name Nodes and Job Tracker Nodes.  Tape still has value when used as an archive medium for these jobs since future analysis might require a longer historical picture so the data would need to be reloaded to active media like HDDs or SSDs. 

At the Hadoop Summit there were multiple traditional storage vendors that were making a case for using their non-commodity storage.  The crux of their argument is that if your data is important enough to use, it ought to get their levels of reliability and availability.  Today’s advanced storage arrays were created to solve a number of problems of direct attach storage.  Problems such as expandability, reliability, availability, throughput and backup are solved in today’s storage arrays. 

Additionally, VMware has created a case on virtualization to better share resources, like storage to manage these large clusters more efficiently. 

The bottom line is that the hardware configuration that is employed has to fit the needs of the job.  Knowing the size, complexity, importance of the job will help design an appropriate platform to execute your Hadoop job.  Some trial and error in these early days of Hadoop deployments is common, so keep an open mind!