Big Data files seem to come in two flavors: busy and lazy. This is due to the nature of the data that we generate, and its usage. Some data will be intensely used while it is fresh, and then relegated to a data parking lot for lazy data. Other data, particularly unstructured data, will be used less often because the usefulness of the data is in identifying trends or outliers and may be processed in batches, instead of in an interactive manner.
Busy data benefits from more aggressive management and higher performing systems with the premise that busy data is more valuable and more time-sensitive. There is a great study that talks about the business value of latency for managing busy data from a 2009 O’Reilly Velocity Conference in San Jose (http://www.youtube.com/watch?v=bQSE51-gr2s) and it reinforces the business need to provide a low latency environment for busy data to get the best result. The video referenced discusses how Microsoft’s Bing and Google independently came to a very similar conclusion: latency matters. It was previously thought that people couldn’t perceive differences under 200ms. These two studies show that people’s behavior responds to faster response (lower latency) until the system response times fall below 50ms. This means your system design should deliver consistent end user response times at about 50ms. The typical slowest element in system design is the electro-mechanical disk drive, and this would indicate that it is time to go to an all-flash array architecture, like those from Violin Memory and others. The business value is surprising. Going from 1000 ms to 50ms was found to have users more engaged and spending 2.8% more money. If your business has anything to do with ecommerce, or just getting more productivity from high-value employees, you ought to be looking at 50ms system response times.
Lazy data is a different story, of course. There is an excellent paper from the USENIX association, a storage geek group, that bears review. A paper from USENIX 2008 written by Andrew Leung et al called “Measurement and Analysis of Large-Scale Network File System Workloads” can provide some color on how files are used, and how they behave over time (http://www.ssrc.ucsc.edu/Papers/leung-usenix08.pdf). As the world of Big Data takes hold the usage patterns for files is changing. Read to write ratios have decreased, read-write access patterns have increased, they see longer sequential runs, most bytes transferred are from larger files and file sizes are larger. Files live longer with fewer than 50% deleted within a day of creation. Files are rarely reopened, if they are it is usually within a minute. They also saw the noisy neighbor effect with fewer than 1% of clients accounting for 50% of file requests. 76% of files are opened only by one client. File sharing only happens for 5% of the files, and 90% of the sharing is read-only. Interestingly, most file types do not have a common access pattern. It sounds like disk might be just fine for lazy data.
The conclusion is that busy data needs a system focused on performance. And that performance must be extreme by what we previously knew, down to 50ms system response time. It seems like the move to an all-flash storage environment and the elimination of hard disk drives for busy data is indicated. The 50ms system response time is difficult for online situations where the network latency is highly variable, but it does provide an indicator of why Google is spending money to put whole communities online. The other conclusion is that lazy data might be a good fit for a lower cost storage since the intensity of usage is less, and the primary goal should be cost effectiveness. Disk storage with it’s sequential performance strength is a good fit for lazy data. To keep costs down you might want to consider some sort of RAID alternative such as SWARM or erasure coding.