OpenStack for Big Data in the Cloud

Big data has two use cases for storage requirements: cost and scale as primary considerations with performance as secondary consideration for much of the data, and for real-time analytics performance and scale as primary concerns with cost as a secondary consideration.

OpenStack has positioned itself as the platform for the Open Cloud and has the potential to impact your Big Data storage issues.  It comes in two flavors: one for block and one for file/object storage.

Block storage is the usual mode for traditional storage area networks and is served by OpenStack’s Cinder product. File/object storage is the home of files, video, logs, etc is served by Swift from OpenStack.

SWIFT is for objects that aren’t used for transactions.  Why?  The data in Swift is eventually consistent, which isn’t appropriate for transaction data, but is just fine for much of the kinds of data found in static big data, such as photos, video, log data, machine data, social media feeds, backups, archives, etc.  Readers might recognize a previous Big Data Perspectives blog discussing the differences between consistency models and their appropriate applications.  A key/value pair might be a good fit for eventual consistency, but your bank records should be consistent with ACID compatibility.  One potential issue is the need to change applications because it is a new approach.  Swift can be converted through a gateway to allow legacy applications to work with Swift.  Applications have to be REST API compatible.  REST (representational state transfer) is a way to make web approaches more widely used, via HTTP type commands.  Riverbed is an example of a Swift implementation.  Without a traditional hierarchical structure in place, Swift does provide for unlimited scalability, but with uncertain performance.  The focus is on commodity hardware and open source software to keep the cost of storage low.

CINDER is for block data that could be attached to your SAN and could include transaction data in the cloud.  Where performance is more important, or for transactional and database requirement, Cinder is a more appropriate choice.  It does have big-time supporters such as IBM and NetApp.  You can understand the major storage vendor’s dilemma, however.  The whole focus of OpenStack is to use their software and commodity hardware to bring down the cost of storage.  They do provide API compatibility to allow their proprietary systems to communicate with an OpenStack node.

There might be away to bring the worlds of proprietary and open together to get the best from both.  By using the proprietary systems for ACID related data, typically transactions, databases, CRM, ERP and real-time analytics and OpenStack for less critical data, there is a way to put value where it is recognized, and commodity where it is not.