Why Would I Use NoSQL?

In any job, it helps when you use the right tool for the job. In the Big Data universe there can be many different kinds of data. Structured data in tables. Text from email, tweets, facebook, or other sources. Log data from servers. Sensor data from scientific equipment. To get answers out of this variety of data, there are a variety of tools.

As always with Big Data, it helps to have the end in mind before you start. This will guide you to the sources of data you need to address your desired result. It will also indicate the proper tool. Consider a continuum from a relational database management system (RDBMS) and Hadoop/MapReduce engine on the other end. RDBMS architectures, like Oracle, has ACID (Atomicity, Consistency, Isolation, Durability), a set of properties to assure that database transactions are processed reliably. This is why for critical data that must be correct, and cost is secondary, RDBMS is the standard due to this reliability. For example, you want to know what amount should be on the payroll check. It has to be right. On the other end are the MapReduce solutions. Their primary concern is not coherency like the RDBMS, but parallel processing massive amounts of data in a cost effective manner. Fewer assurances are required for this data because of the result desired. This is often the case when looking for trends or trying to find some correlation between events. MapReduce might be the right tool to see if your customer is about to leave you for another vendor.

The NoSQL world is somewhere in between. While the RDBMS has consistent coherency, the NoSQL world works on eventual consistency. The two-stage commit with the use of logs is a way to get things sorted out eventually, but at any given point in time, a user might get data that hasn’t been updated. This might be adequate for jobs that need faster turnaround time than MapReduce, but don’t want to spend the money to build out the expensive infrastructure for a full RDBMS. MapReduce is a batch job, meaning that the processing has a definite start and stop to produce results. If MapReduce can’t deliver adequate latency, NoSQL provides continuous processing, instead of batch processing for lower latency. Another advantage of NoSQL, similar to MapReduce is scalability. NoSQL provides horizontal scaling up to thousands of nodes. Job are chopped up, as in MapReduce, and spread among a large number of servers for processing. It might be just the ticket for a Facebook update.

One of the downsides of a NoSQL database is the potential for deadlock. A deadlock occurs when two processes are waiting for the other to finish, and needs the other to finish before it proceeds. Hence this stare-down called a deadlock. This might be because the processes are updating records in a difference sequence and they are in conflict resulting in a permanent wait state. There are some tools to minimize the impact of this potential. The workarounds might result in someone seeing outdated data, but again, if it is acceptable for the desired result, then NoSQL could be a good fit. Eventually things get sorted out, if properly designed.

As you see, understanding the job at hand, the desired result, and what kind of issues are acceptable will determine if RDBMS, NoSQL or a MapReduce solution will fit. NoSQL options are growing all the time, which might indicate that this middle ground is finding more suitable jobs.

Managing a Flood of Data

Managing a Flood of Data

With increasing connectedness of devices and people, the data just keeps coming. What to do with all that data is becoming an increasing problem, or opportunity if you have the right mindset. In general there are three things that can be done with this flood of data:

  1. Discard it, or some of it (sampling)

  2. Parallelize the processing of it (e.g. MPP- massively parallel processing architectures)

  3. In-memory processing with massive HW resources

Any combination of the above might make sense depending on the intent of the project, the amount and kinds of data, and of course, your budget. I find it interesting that the traditional RDBMS still has legs with the movement to utilize in-memory processing which is made possible by continually falling memory prices, making this a “not crazy” alternative. Of course it gets back to what did you want to do with what kind and amount of data. For instance, a relational database for satellite data may not make sense, even if you could do it.

Here’s where the file system can become very interesting. It might be ironic that unstructured data must be organized to be able to analyze it, but I think of it as farming. You cultivate what you have to get what you want. Ideally, the file system will provide a structure for the analysis that will follow. There doesn’t seem to be a shortage of file systems out there, but because the flood of unstructured data is relatively recent, there might be even better file systems on the way.

There are a number of file structures available: local, remote, shared, distributed, parallel, high performance computing, network, object, archiving and security being some examples. The structure of these can be very different. For the flood of unstructured data, parallel file systems seem to offer a way to organize this data for analytics. In many cases the individual record is of little value, indeed the value in most unstructured datasteams is in aggregate. Users are commonly looking for trends or anomalies within a massive amount of data.

An application with massive amounts of new data would suggest that traditionally structured file systems for static data (like data warehouses) might not be able to grow as needed, since the warehouse typically takes a point-in-time view. Traditional unstructured static data like medical imaging might be appropriate based on the application, but most analytics can’t do much with images. Dynamic data has its own challenges. Unstructured dynamic data like CAD drawings or MS Office data (text, etc.) may lend themselves to a different file structure than dynamic structured data like CRM and ERP systems where you are looking for a specific answer from the data.

Dealing with massive amounts of new data may be a recipe for a non linear approach to keep up with the traffic. Parallel file systems started life in the scientific high performance computing (HPC) world. IBM created a parallel file system in the 1990’s called GPFS, but it was proprietary. The network file system (NFS) provided the ability to bring a distributed file system to the masses and share files more easily with a shared name space. Sun created NFS and made it available to everyone, and it was generally adopted and enhanced. There are some I/O bandwidth issues with NFS, which companies like Panasas and open systems oriented Lustre have tried to address. I/O bandwidth remains the primary reason to consider a parallel file system. If you have a flood of data, it’s probably still the best way to deal with it.

I expect to see more parallel and object file systems to provide improved tools over what is available today to better manage the massive data flooding into our data centers. Increasingly, the sampling approach will be diminished since the cost of storage continues to fall, and some of the most interesting data are outliers. The “long tail” analysis to find situations where the rules seem to change when events become extreme can be very valuable. This may require the analysis of all the data, since sampling may not give sufficient evidence to “long tail” events that occur infrequently.

In summary, managing the flood of data is a question of identifying what you want to get from the data. That combined with the nature of the data will guide you to an appropriate file system. In most cases a parallel file system will be the solution, but you have to know your application. The good news is as our sophistication grows, we will have more options to fine tune the systems we build to analyze the data we have to get the results we want.

 

 

“Active Flash” for Big Data Analytics on SSD-based Systems

FAST13 USENIX Conference on File and Storage Technologies February 12–15, 2013 in San Jose, CA

If you’re not familiar with the geekfest called USENIX and their file and storage technology conference, it is a very scholarly affair. Papers are submitted on a variety of file and storage topics, and the top picks present their findings to the audience. The major component and system vendors are there along with a wide variety of academic and national labs.

Let’s review a paper about using SSDs in high performance computing where there are a large number of nodes.  See the reference at the end for details regarding the paper.*

The issue is how to manage two jobs on one data set.  The example in the paper is a two-step process in the high-end computing world.  Complex simulations are being done on supercomputers then the results are moved to separate systems where the data is subject to analytics.  Analytics are typically done on smaller systems in a batch mode. The high-end computing (HEC) systems that do the simulations are extremely expensive, and keeping them fully utilized is important. This creates a variety of issues in the data center that include the movement of data between the supercomputer and the storage farm, analytic performance and the power required for these operations. The approach proposed is called “Active Flash”.

The floating point operations performed on the HEC systems are designed for the simulation, not the typical analytic workload.  This results in the data being moved to another platform for analytics. The growth in the data (now moving to exabytes) is projected to increase costs so that just moving the data will be comparable in cost to the analytic processing. In addition, extrapolating the current power cost to future systems indicates this will become the primary design constraint on new systems. The authors expect the power to increase 1000X in the next decade while the power envelope available will only be 10X greater. Clearly, something must be done.

The authors have created an openSSD platform Flash Translation Layer (FTL) with data analysis functions to prove their theories about an Active Flash configuration to reduce both the energy and performance issues with analytics in a HEC environment. Their 18,000 compute node configuration produces 30TB of data each hour. On-the-fly data analytics are performed in the staging area, avoiding data migration performance and energy issues.  By staging area we are talking about the controller in the SSDs. 

High Performance Computing (HPC) tends to be bursty with I/O intensive and compute intensive activity. It’s common that a short I/O burst will be followed by a longer computational activity period. These loads are not evenly split, indeed I/O is usually less than 5% of overall activity. The nature of the workload creates an opportunity for some SSD controller activity to do analytics. As SSD controllers move to multi-core this creates more opportunity for analytics activity while the simulations are active.

The model to identify which nodes will be suitable for SSDs is a combination of capacity, performance, and write endurance characteristics. The energy profile is similarly modeled to predict the energy cost and savings of different configurations. The author’s experimental models were tested in different configurations. The Active Flash version actually extends the traditional FTL layer with analytic functions. The analytic function is enabled with an out-of-band command. The result is elegant and outperforms the offline analytic or dedicated analytic node approach.

The details and formulas are in the referenced paper, and are beyond my humble blog. But for those thinking of SSDs for Big Data, it appears the next market is to enhance the SSD controller for an Active Flash approach to analytics.

 

*The paper is #119 “Active Flash: Towards Energy-Efficient, In-Situ Data Analytics on Extreme-Scale Machines” by Devesh Tiwari 1, Simona Boboila 2, Sudharshan S. Vazhkudai 3, Youngjae Kim 3, Xiaosong Ma 1, Peter J. Desnoyers 2 and Yan Solihin 1 1North Carolina State University 2Northeastern University 3Oak Ridge National Laboratory.