Non linear data processing, as found in Hadoop and other parallel processing models work quite differently from traditional linear processing. The hardware to support parallel processing is a significant tweak on traditional models. Often, when setting up servers, networking and storage to handle a large job, the defaults don’t do the job for parallel processing. Intel has written an excellent paper on the topic named “Optimizing Hadoop Deployments”. This posting will review some of the significant findings in that paper.
General Topology
It’s common to see two or three levels of servers. Due to the large number of servers to be accommodated, they are usually rack mounted. Each rack can be interconnected with a 1GbE switch, and each rack is connected to a cluster level switch, usually 10GbE.
Hardware Configurations
The study at Intel indicated that dual-socket servers are more cost effective than large-scale multi-processor platforms for parallel processing. The Hadoop cluster will take on many of the features found in an enterprise data center server, so save your money.
The large number of disc drives required to handle a Big Data job requires a large number of drives per server. Intel used 4-6, but felt that the shop may want to experiment with even higher ratios, like 12 drives per server. The I/O intensity will vary by job, so some experimenting might be appropriate. Also, their suggestion is for relatively modest 7,200 RPM SATA drives with capacity between 1-2 TB. Using RAID configurations on Hadoop servers is not recommended due to the Hadoop data provisioning and redundancy in the cluster.
Server memory is usually large, like 12 GB to 24GB. Large memory is required to allow the large number of map/reduce tasks to be carried out simultaneously. They even recommend that the memory modules be populated in multiples of six to balance across memory channels. Make sure your ECC is turned on. Intel also has a Hyper-Threading Technology that they showed some nice performance increases. It has to be turned on in the BIOS.
Networking as previously mentioned will be 1GbE at the server and 10GbE at the cluster. Intel also suggests twin 1GbE ports bonded together to create a bigger pipe for the data. They suggest eight queues per port to ensure proper balancing of interrupt handler resources among the processing cores.
Software Configurations
A recent version of Linux is suggested to provide good energy conservation. Versions older than 2.6.30 may use 60% more power, and if you’re configuring hundreds or thousands of servers, it adds up in a hurry. Linux should also have the default open file descriptor limit set to something like 64,000 (instead of the default 1,024).
Intel also suggests Sun Java 6 or later (Java 6u14 or later) to take advantage of optimization like compressed ordinary object pointers.
Hadoop configuration choices need to be reviewed too. Hadoop has several components, including a file system, HDFS. If you disable recording access info of noatime and nodiratime attributes, you can improve performance. Increase the file system read-ahead buffer from 256 sectors to 1,024 or 2,048 for better performance. Additional HDFS configuration settings are reviewed in the paper.
The bottom line is that parallel processing platforms, like Hadoop and Big Data jobs require a different mindset than traditional linear, batch or OLTP type configurations. You might want to get to the Intel website and download their paper “Optimizing Hadoop Deployments”.