In any job, it helps when you use the right tool for the job. In the Big Data universe there can be many different kinds of data. Structured data in tables. Text from email, tweets, facebook, or other sources. Log data from servers. Sensor data from scientific equipment. To get answers out of this variety of data, there are a variety of tools.
As always with Big Data, it helps to have the end in mind before you start. This will guide you to the sources of data you need to address your desired result. It will also indicate the proper tool. Consider a continuum from a relational database management system (RDBMS) and Hadoop/MapReduce engine on the other end. RDBMS architectures, like Oracle, has ACID (Atomicity, Consistency, Isolation, Durability), a set of properties to assure that database transactions are processed reliably. This is why for critical data that must be correct, and cost is secondary, RDBMS is the standard due to this reliability. For example, you want to know what amount should be on the payroll check. It has to be right. On the other end are the MapReduce solutions. Their primary concern is not coherency like the RDBMS, but parallel processing massive amounts of data in a cost effective manner. Fewer assurances are required for this data because of the result desired. This is often the case when looking for trends or trying to find some correlation between events. MapReduce might be the right tool to see if your customer is about to leave you for another vendor.
The NoSQL world is somewhere in between. While the RDBMS has consistent coherency, the NoSQL world works on eventual consistency. The two-stage commit with the use of logs is a way to get things sorted out eventually, but at any given point in time, a user might get data that hasn’t been updated. This might be adequate for jobs that need faster turnaround time than MapReduce, but don’t want to spend the money to build out the expensive infrastructure for a full RDBMS. MapReduce is a batch job, meaning that the processing has a definite start and stop to produce results. If MapReduce can’t deliver adequate latency, NoSQL provides continuous processing, instead of batch processing for lower latency. Another advantage of NoSQL, similar to MapReduce is scalability. NoSQL provides horizontal scaling up to thousands of nodes. Job are chopped up, as in MapReduce, and spread among a large number of servers for processing. It might be just the ticket for a Facebook update.
One of the downsides of a NoSQL database is the potential for deadlock. A deadlock occurs when two processes are waiting for the other to finish, and needs the other to finish before it proceeds. Hence this stare-down called a deadlock. This might be because the processes are updating records in a difference sequence and they are in conflict resulting in a permanent wait state. There are some tools to minimize the impact of this potential. The workarounds might result in someone seeing outdated data, but again, if it is acceptable for the desired result, then NoSQL could be a good fit. Eventually things get sorted out, if properly designed.
As you see, understanding the job at hand, the desired result, and what kind of issues are acceptable will determine if RDBMS, NoSQL or a MapReduce solution will fit. NoSQL options are growing all the time, which might indicate that this middle ground is finding more suitable jobs.