Managing a Flood of Data

Managing a Flood of Data

With increasing connectedness of devices and people, the data just keeps coming. What to do with all that data is becoming an increasing problem, or opportunity if you have the right mindset. In general there are three things that can be done with this flood of data:

  1. Discard it, or some of it (sampling)

  2. Parallelize the processing of it (e.g. MPP- massively parallel processing architectures)

  3. In-memory processing with massive HW resources

Any combination of the above might make sense depending on the intent of the project, the amount and kinds of data, and of course, your budget. I find it interesting that the traditional RDBMS still has legs with the movement to utilize in-memory processing which is made possible by continually falling memory prices, making this a “not crazy” alternative. Of course it gets back to what did you want to do with what kind and amount of data. For instance, a relational database for satellite data may not make sense, even if you could do it.

Here’s where the file system can become very interesting. It might be ironic that unstructured data must be organized to be able to analyze it, but I think of it as farming. You cultivate what you have to get what you want. Ideally, the file system will provide a structure for the analysis that will follow. There doesn’t seem to be a shortage of file systems out there, but because the flood of unstructured data is relatively recent, there might be even better file systems on the way.

There are a number of file structures available: local, remote, shared, distributed, parallel, high performance computing, network, object, archiving and security being some examples. The structure of these can be very different. For the flood of unstructured data, parallel file systems seem to offer a way to organize this data for analytics. In many cases the individual record is of little value, indeed the value in most unstructured datasteams is in aggregate. Users are commonly looking for trends or anomalies within a massive amount of data.

An application with massive amounts of new data would suggest that traditionally structured file systems for static data (like data warehouses) might not be able to grow as needed, since the warehouse typically takes a point-in-time view. Traditional unstructured static data like medical imaging might be appropriate based on the application, but most analytics can’t do much with images. Dynamic data has its own challenges. Unstructured dynamic data like CAD drawings or MS Office data (text, etc.) may lend themselves to a different file structure than dynamic structured data like CRM and ERP systems where you are looking for a specific answer from the data.

Dealing with massive amounts of new data may be a recipe for a non linear approach to keep up with the traffic. Parallel file systems started life in the scientific high performance computing (HPC) world. IBM created a parallel file system in the 1990’s called GPFS, but it was proprietary. The network file system (NFS) provided the ability to bring a distributed file system to the masses and share files more easily with a shared name space. Sun created NFS and made it available to everyone, and it was generally adopted and enhanced. There are some I/O bandwidth issues with NFS, which companies like Panasas and open systems oriented Lustre have tried to address. I/O bandwidth remains the primary reason to consider a parallel file system. If you have a flood of data, it’s probably still the best way to deal with it.

I expect to see more parallel and object file systems to provide improved tools over what is available today to better manage the massive data flooding into our data centers. Increasingly, the sampling approach will be diminished since the cost of storage continues to fall, and some of the most interesting data are outliers. The “long tail” analysis to find situations where the rules seem to change when events become extreme can be very valuable. This may require the analysis of all the data, since sampling may not give sufficient evidence to “long tail” events that occur infrequently.

In summary, managing the flood of data is a question of identifying what you want to get from the data. That combined with the nature of the data will guide you to an appropriate file system. In most cases a parallel file system will be the solution, but you have to know your application. The good news is as our sophistication grows, we will have more options to fine tune the systems we build to analyze the data we have to get the results we want.