Scale-out Analytics


Analytics have already changed the world.  The business world.  The science world.  The education world. The government world. And still, most of the data we have has not been used.  An IDC research report, funded by Seagate Technology, states that 68% of enterprise data goes unused.1 If we are to extract more value from data, we need to analyze more data.

Most of today’s enterprise data centers aren’t configured to handle the petabytes of data needed for large-scale analytics. The approach that many enterprises take is to move to the cloud to get the scale and ease of use that is needed for scale-out analytics. It can work.  But it’s not cheap.

To give you an idea of how expensive cloud computing can be, consider Amazon Web Services.  AWS was responsible for 13 percent of Amazon’s overall revenue, but it accounts for about 71% of Amazon’s overall operating profits.2   However, you don’t need to go to the cloud to get the benefits of the cloud, and you might be able to save some money.  Having a data strategy that uses both cloud and on-prem resources appropriately gives you a framework to decide what applications are best served where. This becomes more important over time since you pay to export data from the cloud, so if you change your mind down the road, it might be prohibitively expensive to move the function back on-site. 

So back to the original problem: how should an enterprise build scale-out analytic infrastructure to take advantage of the massive amount of data available but still unused?


A significant reason that enterprises haven’t been executing scale-out analytics on premises is the existing infrastructure is largely designed to accommodate the bread and butter databases that manage transactions, inventory and customer service. Moving to a scale-out analytics infrastructure is another matter. 

One of the significant benefits that the cloud offers is a scale-out infrastructure for very large jobs. Cloud hyperscalers had to develop their own technology to provide scale-out infrastructure.  That technology was only available to you as a customer of the hyperscalers.  One of the critical enabling technologies for scale-out infrastructure is just coming to the general market now, NVMe-oF.

The foundational technology, NVMe has been around for a few years, and it has fundamentally changed the way flash storage communicates.  Previous technologies were built to accommodate hard drives, which are much slower and lack the queue and command depth of NVMe.  The resulting improvement in performance has been profound. The limitation of NVMe for devices is that the great performance works well in the box, but then the box is connected to the network, you’re plugging into the older, slower and less scalable network.  NVMe-oF, or NVMe over fabrics, creates a fabric that is both scalable and blazing fast. The architecture of NVMe-oF is built on Ethernet and IP so there is no need to struggle with an all-new unproven technology.  You can upgrade and expand your existing IP network to support NVMe-oF.


With the speed, scale and flexibility of NVMe-oF, you can create a cloud-like architecture on your premises and reap the benefits in performance, scale and flexibility as well as cost savings. Additionally, when you own your infrastructure on-prem, you need not worry about the generosity of the cloud vendor in the future.  Enterprise IT operations can benefit from the flexibility of a composable disaggregated architecture built with NVMe-oF to create configurations that suit the application.  For example, lots of CPU power and ultra-fast storage for traditional database jobs or alternatively, lots of GPU power and fast storage for scale-out analytics or maybe lots of CPU and less expensive storage for backup and archive workloads.


Strong competitive forces are driving enterprises to make the most of their data. When scaling and flexibility are no longer limitations you have an opportunity to get more out of your data by using more data. The opportunity to discover new insights with data, previously unseen with smaller data sets, can improve science, government, education and business.  The ability to understand your customers more profoundly, detail your operations more precisely, and manage your marketing more effectively are game changers that get better with more data.  To get more out of your data you need a scale-out architecture, and NVMe-oF is the way to make it happen.

1,of%20Amazon’s%20overall%20operating%20profits. 2

Creating a Data Strategy for Analytics

Powerful analytic tools are at our disposal.  These tools provide great insights into data that allows us to make better decisions.  But are we looking at the right data?  Don’t better decisions depend on the quality of the data?  What is the data strategy to make sure we are considering the right data in our analytics?

Golden copy Concept

The Golden Copy is the agreed-upon version of truth that is to be used for analytics and other important applications.  Sometimes known as the single source of truth, the Golden Copy will come from agreed to sources of data, in the appropriate timeliness which then passes through a data curation process that often includes traditional extract/transform/load processes (ETL). The result is data that has been vetted and can be used confidently and consistently for a multitude of purposes.

Data Ownership

The process of validating appropriate sources of data is no simple task.  This is not just an IT exercise, nor is it just a business unit (BU) process.  The BU needs to define the decisions and what sources of data are needed to support those decisions. IT and business analysts may be able to help the BU identify additional data sources.  For data to become a competitive advantage, it may be necessary to combine traditional and non-traditional sources of data.  There are plenty of data sources available but care must be taken to consider the sources and quality of data.  A common arrangement might be that data ownership resides with the BU and they make data source decisions but IT owns the infrastructure and processes for the data.  

Data Management

Data management is comprised of several different processes, and there are different considerations for data sources inside and outside the company.  Data management typically exists for data from inside the company such as sales data, manufacturing process data and supply chain data.  There may be other data from inside the company, often unstructured data, which may not have existing management to determine if it is suitable to be put into the Golden Copy repository.  Examples might include sensor data, customer service feedback, process machinery data or other IOT data that needs to adapt for more general use before it is placed in the Golden Copy repository.  

There are also potential data sources from outside the company which may have very different ETL needs compared to internal data.  For instance, social media data for the marketing department may require a lot more screening than internally sourced data from the CRM application.  There needs to be a process to prepare data for the Golden Copy to make it useful.    

There can be such a thing as being too careful, however.  Remember, more data is often more insightful than better algorithms. A competitive advantage may be to incorporate new sources of data to help your analytics even if there is work to be done before it can go into the Golden Copy.

Multiple Versions of the Truth

Now you have a Golden Copy repository/data lake, and you get your first request for data that is a outside the Golden Copy.  Time to talk about changing business needs and multiple versions of the truth. It is going to be necessary to modify the Golden Copy data to accommodate new requests, but consider using the same process you created to validate the Golden Copy data in the first place. Get the group of IT and BU decision makers together to make sure as the data is created it can be trusted and curated, and if appropriate, added to the Golden Copy, or just used for a one-off need.


Using data as a competitive advantage is a great concept.  Making data a competitive asset is a lot of work.  By putting the right strategy and processes in place you can create a model that will continue to improve over time and create a foundation for a sustainable competitive advantage. As always, data analytic success is best when the desired result is firmly in mind before proceeding.  In many cases, how you source and treat the data will be critical to data analytic success.

Big Data Success Factors

Computerworld has released a study by Computing, a UK company, regarding the state of Big Data in the real world.  Big Data Review 2015 holds not a lot of surprises, but certainly a lot of confirmation that Big Data tools need careful handling to get the right results.  In the early days of Big Data there was a lot of experimentation just to understand the capability of the tools.  There will certainly continue to be plenty of experimentation just given the nature of the process, but success depends on the business execution.

Computing’s 2015 findings show the rapid evolution of the industry based on a shift in the responses in survey questions from just last year.  The number of survey respondents concerned about the different topics has grown significantly, suggesting that Big Data methodologies is rapidly moving into the mainstream.  Nobody wants to be left behind, of course.  In terms of tools, data warehouses and analytics databases were tied with cloud based Big Data services, broadly speaking.  Big leaps in survey respondents considering these tools were seen.

This should come as no surprise to those practicing the art of Big Data.  You probably spend most of your time importing, moving, scrubbing and preparing data for analysis.  Indeed, just finding the data that is important to your analysis can take quite a bit of work.  Garbage in / garbage out still applies.  This report takes time to understand why projects have been perceived as successful, not wasn’t limited to just looking at the latest tools.

For instance, 76% of respondents focused on operational data to improve efficiencies. 24% used Big Data for external opportunities.  Why?  Pragmatic business decision making.  The fastest route to Return on Investment will be to refine and improve operations that you already know.  Operational savings go to the bottom line as increased profit.  New sales only net what the gross margin will allow.  Improving the gross margin impacts all sales.  One of the frustrations that business leaders have with Big Data is the ability to speculate or predict future events when planning business spending.

The power of predictive analytics is immense.  Descriptive analytics may allow you to refine existing operations with less risk but prescriptive analytics that model how things could be, allows you move to disruptive capabilities but higher risk of success.  This is traditionally the domain of the entrepreneur, the ability to remake markets and disrupt the status quo.  The larger challenge for existing companies is how they decide to manage risk and failure.

This tension between the business decision maker and the analytics professional has been true from the start, of course.  The difference now is that decision makers have seen how analytics can improve their business, and funding for analytics is increasing based on that success.  The most visible expression of this is the democratization of Big Data with more self-service tools for business professionals.  The counter trend is reluctance by some departments to share data or cooperate with broader Big Data projects.

To be successful with Big Data projects, the survey identifies a number of factors, the top 3 being: 1) business buy in 2) knowing your data and 3) Core understanding of the business.  The implications of just these three is important.  Business is moving more aggressively into analytics, but with a purpose.  High return on investment objectives will keep projects focused, and tends to lean to the operational side where there is a more likely return.  Knowing your data is another critical aspect of a Big Data project.  Integrating several sources of data and preparing it for analysis is not trivial and takes a lot of time and effort.  Most survey respondents felt that between 80-90% data accuracy is “good enough” for most projects.  The diminishing returns of further improvements may not make any changes to the decision making.  Finally, this isn’t an academic exercise.  The project’s success will depend on deep understanding of the business.  A big part of analytics is deciding what problem you want to solve.  A poorly formed premise or problem will lead to unsatisfactory results.

Computing’s 2015 report has some great information in it, and it highlights the changes in the industry just in the last year.  As analytics becomes more mature it will find application in more companies and more projects, and that’s good for Big Data and the economy.

See the report at:

Beyond Correlation: Process Models

Big Data is the hunt for meaning from an ocean of data.  Until tools like Hadoop and NoSQL became available, it wasn’t practical to derive much visibility from unstructured data, and certainly not much meaning from social media.  Now with these tools, we can provide order to chaos and look into data more closely.  One belief regarding Big Data analysis is that we don’t need to understand the cause of correlation we might find in data.  Simply by understanding the relationships between factors that become apparent in the data, we can find useful information.  Still, correlation is not causation.  We may not understand why two factors are related, but it is still useful to understand correlation.

To further our understanding beyond correlation, a step closer to understanding causation can be Process Mining.  Process Mining looks beyond correlation to further refine associations in the data.  Indeed, Process Mining per Wil van der Aalst of University of Technology in Eindhoven, would posit that by looking at a more structured view of data relationships, we can discover processes in the data.  Based on a Process Mining perspective we can identify processes, identify process bottlenecks, and by looking at what people are actually doing, potentially improve processes.  Finally, we can predict outcomes based on the processes we find by testing with real event data.

Process Mining is different from, but related to, data oriented analysis.  Both approaches start with event data.  The difference is that Process Mining starts with mapping a chain of events to create, refine and test models that fit the data.  You can then use the model suggested by this technique to look for bottlenecks with existing processes.  By testing with actual event data you see the way events are actually occurring, not how they should occur.

To test the relationships suggested by Process Mining, there are three approaches:

  • Play-Out: start from a proposed model. Describe a work items and the workflow. Look at all possible iterations to understand the range of potential processes identified.  Start with what’s already there.
  • Play-In: Look at the existing process and infer a model. Create a model based on what you want to be in place.
  • Replay: With either method of creating a model (Play-Out or Play-In) replay actual data on the model using event log data so you can assess the strengths and weaknesses of the different models. By testing a model with Replay, you can show deviations from expected results, and tune the model accordingly.

Classic Big Data mining starts with data.  Process Mining also starts with data.  The difference is the search for correlation in data mining, and the search for processes in Process Mining.  Data Mining might produce a simple set of Key Performance Indicators (KPI).  Dr. van der Aalst would argue that simple KPI will lead to problems because it is too simplistic.  Understanding how people work together to accomplish a task is more informative.  KPI may show deviations, but the user may have no idea where the deviation comes from, or how to get the process productive again.

By using Process Mining you can identify bottlenecks or unproductive process steps through conformance checking to make sure what you think is happening, really happens.  Data Mining can be either supervised learning with labeled data or unsupervised learning with unlabeled data. Unsupervised learning often relies on cluster or pattern discovery.  Supervised learning often employs regression to analyze relationships.  There are lots of tools to assist in Data Mining, and most of the attention in the industry has focused on data mining.

A good example of a Process Mining tool is the Decision Tree.  In a numerical analysis of decision trees, predictor variables are derived from response variables.  Therefore, decision trees can be used to predict outcomes based on an expected process flow identified in the Decision Tree.  One of the limits of the Decision Tree is to decide how much of a good thing is practical.  Multiple iterations of the decision tree with actual data (Replay) will allow you to continue to refine the decision branches in the Decision Tree model.  You might want to decide before you start how many levels of tree you want to allow (for ease of use) or what kind of identity ratios that you want before you stop developing the model.  Complex processes can create a large Decision Tree.  You should decide at the outset in your Decision Tree creation, what level of success is reasonable.  You should consider what you want the result to be before you start the process.

There are free tools for both Data Mining and Process Mining.

The Process Mining website:

Check out RapidMiner for Data Mining at

For Process mining look at ProM, and open source software tool at

Enjoy the Process!

Big Data Preparation

Let’s face it, the big data world has a lot of unglamorous heavy lifting.  One of those less glamorous jobs is preparing the data for analysis.  Taking a bunch of unstructured data and creating some structure for further analysis takes some thought, rigorous process, and careful documentation.  Unstructured data lacks the row and column structure, which makes it hard to apply traditional analytic tools to such raw information.  Data preparation provides the structure that makes the data suitable for further analysis. 

In order to assure reproducible results, every step should be documented so that an independent researcher can follow the procedure and obtain the same result.  This is also important to detail the steps taken to prepare the data in the event of subsequent criticism, or to alter the process if the ultimate results are not satisfactory.  Because the very results of the downstream analysis depends on assumptions made in the data preparation, the process must be carefully captured. 

The raw data may consist of an unorganized, partial, or inconsistent (such as text) collection of elements for input to an analytic process.  The objective is to create structured data that is suitable for further analysis and should have four attributes according to Jeff Leek and company of Johns Hopkins:

  • One variable in exactly one column
  • Each observation variable in one row
  • One table for each kind of variable
  • Multiple tables with a column ID to link data between tables  

Best practice would also include descriptions at the top of the table with plain language description of the variables, not just a short-hand name that is easily forgotten or mis-identified. 

Good experimental hygiene also includes a code book and instruction list, again with the intent of creating reproducible results.  The code book can be a text file that includes a short section on study design, with descriptions of each variable, and the units they are measured with.  Summarizing the choices made in data preparation and experimental study design will assist future researchers understand your experiment, and reproduce it as necessary.  The instruction list would be the literal unstructured data, and the output is the structured data.  If every step can’t be captured in the computer script, then what ever documentation that would allow an independent researcher to reproduce the results should be produced to let people know exactly how you prepared the data.

Data can be captured from downloaded files, and Excel files are popular with business and science audiences, while XML files are commonly used for web file collection.  XML is an interesting case, since all data in a file can be processed with XML, but to pull files out selectively, you’ll need Xpath which is a related but different language from XML.   Finally, JSON is another structured file system somewhat akin to XML, but the structure/commands are all different.  Java Script Object Notation has its own following, and is commonly used and moving files into and out of JSON are easily supported.

In summary, to have a repeatable data preparation, you should document the process used, the variables, the structure, the files input, the files output, the code to process the raw data into your desired structured data.  Once the data is structured, you can scrub it to get rid of the Darth Vaders and Mickey Mouses.  Then you can start to think about your analysis. 

Why Implement an All Flash Data Center?

The argument goes something like this: if flash costs more than disk, why would you spend the money on an all-flash data center?  Some might suggest that you just use flash for intense I/O applications like databases where you can justify the additional expense over disk.

What we see from Violin customers is different.  Not all have gone all-flash, but for those that have the benefits are many.

All-flash data centers can provide new sources of revenue.  Lower operating costs.  Elimination of slow I/O workarounds.  Improved application response times.  Faster report turnaround.   Simplified operations. Lower capital costs.

As a storage subsystem manufacturer, Violin puts together the best system they can design, but they are constantly being schooled by their customers.  For instance, they had a large telecom customer who were missing some billing opportunities and redesigned their customer accounting software.  When the customer implemented it on their traditional storage system, they didn’t see much benefit. They saw the application wanted even more I/O, and brought in Violin.  As a result they found over $100 million in new revenue.  That paid for the project handsomely, of course.  This is revenue that wasn’t available with traditional storage, but is captured due to Violin’s low latency.

Another example of how flash storage changes the data center is the impact of low latency on servers and the software that runs on them.  Moving to a Violin All Flash Array speeds I/O so much, the traditional layers of overprovisioning and caching can be eliminated.  The result: better application performance with lower costs.  Customers have also told me they can also free up people from this consolidation to redeploy on more productive efforts since there is no need to manage the overprovisioning and caching infrastructure.

However, not all All Flash solutions are created equal.  SSD solutions are inferior to a backplane-based approach like Violin’s Flash Fabric Architecture™.  Consider key operating metrics such as power and floorspace.  For instance,  70 raw TB from Violin takes 3RU of space.  Common SSD-based solutions take 12RU (or more) for the same raw capacity. This density also translates into power.  The Violin 70TB will take 1500W, while common SSD approaches may take over 3000W for the same capacity.  This translates into operating expense savings.  One customer recently estimated they would save 71% in operating costs with Violin over traditional storage.

Additionally, the Violin Flash Fabric Architecture provides superior performance, due to the array-wide striping of data and parallel paths for high throughput that holds up under heavy loads.  It also provides for better resiliency, since hot spots are essentially eliminated.  The result is not just a big step up over traditional disk storage, it is a significant improvement over SSD-based arrays.

Customers who have gone all-flash for active data have found they can buy the new storage and server equipment, and still have money left over.  This is in addition to any new sources of revenue realized, such as the Telecom example.  Flash is essentially free.

The last hurdle has been data services.  Some customers who have Violin installed love the performance, but were hesitant to put all their data on it because they wanted to have the enterprise level availability features.  Capabilities such as synchronous and asynchronous replication, mirroring and clustering give enterprises a robust tool kit.  They configure their data centers in a variety of ways that will protect against local issues like fire, metro area problems like hurricanes/typhoons, and regional issues with a global replication.  These capabilities now exist in the Concerto 7000 All Flash Array from Violin Memory.  This allows enterprises who want to experience transformative performance to also employ the operational capabilities they need to meet their data center design goals.

The move to the all-flash data center is upon us.

For more information go to

OpenStack for Big Data in the Cloud

Big data has two use cases for storage requirements: cost and scale as primary considerations with performance as secondary consideration for much of the data, and for real-time analytics performance and scale as primary concerns with cost as a secondary consideration.

OpenStack has positioned itself as the platform for the Open Cloud and has the potential to impact your Big Data storage issues.  It comes in two flavors: one for block and one for file/object storage.

Block storage is the usual mode for traditional storage area networks and is served by OpenStack’s Cinder product. File/object storage is the home of files, video, logs, etc is served by Swift from OpenStack.

SWIFT is for objects that aren’t used for transactions.  Why?  The data in Swift is eventually consistent, which isn’t appropriate for transaction data, but is just fine for much of the kinds of data found in static big data, such as photos, video, log data, machine data, social media feeds, backups, archives, etc.  Readers might recognize a previous Big Data Perspectives blog discussing the differences between consistency models and their appropriate applications.  A key/value pair might be a good fit for eventual consistency, but your bank records should be consistent with ACID compatibility.  One potential issue is the need to change applications because it is a new approach.  Swift can be converted through a gateway to allow legacy applications to work with Swift.  Applications have to be REST API compatible.  REST (representational state transfer) is a way to make web approaches more widely used, via HTTP type commands.  Riverbed is an example of a Swift implementation.  Without a traditional hierarchical structure in place, Swift does provide for unlimited scalability, but with uncertain performance.  The focus is on commodity hardware and open source software to keep the cost of storage low.

CINDER is for block data that could be attached to your SAN and could include transaction data in the cloud.  Where performance is more important, or for transactional and database requirement, Cinder is a more appropriate choice.  It does have big-time supporters such as IBM and NetApp.  You can understand the major storage vendor’s dilemma, however.  The whole focus of OpenStack is to use their software and commodity hardware to bring down the cost of storage.  They do provide API compatibility to allow their proprietary systems to communicate with an OpenStack node.

There might be away to bring the worlds of proprietary and open together to get the best from both.  By using the proprietary systems for ACID related data, typically transactions, databases, CRM, ERP and real-time analytics and OpenStack for less critical data, there is a way to put value where it is recognized, and commodity where it is not.


Busy and Lazy Data

Big Data files seem to come in two flavors: busy and lazy.  This is due to the nature of the data that we generate, and its usage.  Some data will be intensely used while it is fresh, and then relegated to a data parking lot for lazy data.  Other data, particularly unstructured data, will be used less often because the usefulness of the data is in identifying trends or outliers and may be processed in batches, instead of in an interactive manner. 

Busy data benefits from more aggressive management and higher performing systems with the premise that busy data is more valuable and more time-sensitive.  There is a great study that talks about the business value of latency for managing busy data from a 2009 O’Reilly Velocity Conference in San Jose ( and it reinforces the business need to provide a low latency environment for busy data to get the best result.  The video referenced discusses how Microsoft’s Bing and Google independently came to a very similar conclusion: latency matters.  It was previously thought that people couldn’t perceive differences under 200ms.  These two studies show that people’s behavior responds to faster response (lower latency) until the system response times fall below 50ms.  This means your system design should deliver consistent end user response times at about 50ms.  The typical slowest element in system design is the electro-mechanical disk drive, and this would indicate that it is time to go to an all-flash array architecture, like those from Violin Memory and others.  The business value is surprising.  Going from 1000 ms to 50ms was found to have users more engaged and spending 2.8% more money.  If your business has anything to do with ecommerce, or just getting more productivity from high-value employees, you ought to be looking at 50ms system response times.

Lazy data is a different story, of course.  There is an excellent paper from the USENIX association, a storage geek group, that bears review.  A paper from USENIX 2008 written by Andrew Leung et al called “Measurement and Analysis of Large-Scale Network File System Workloads” can provide some color on how files are used, and how they behave over time (  As the world of Big Data takes hold the usage patterns for files is changing.  Read to write ratios have decreased, read-write access patterns have increased, they see longer sequential runs, most bytes transferred are from larger files and file sizes are larger.  Files live longer with fewer than 50% deleted within a day of creation.  Files are rarely reopened, if they are it is usually within a minute.  They also saw the noisy neighbor effect with fewer than 1% of clients accounting for 50% of file requests.  76% of files are opened only by one client.  File sharing only happens for 5% of the files, and 90% of the sharing is read-only.  Interestingly, most file types do not have a common access pattern.  It sounds like disk might be just fine for lazy data.

The conclusion is that busy data needs a system focused on performance.  And that performance must be extreme by what we previously knew, down to 50ms system response time.  It seems like the move to an all-flash storage environment and the elimination of hard disk drives for busy data is indicated.  The 50ms system response time is difficult for online situations where the network latency is highly variable, but it does provide an indicator of why Google is spending money to put whole communities online.  The other conclusion is that lazy data might be a good fit for a lower cost storage since the intensity of usage is less, and the primary goal should be cost effectiveness.  Disk storage with it’s sequential performance strength is a good fit for lazy data.  To keep costs down you might want to consider some sort of RAID alternative such as SWARM or erasure coding. 

SWARM Intelligence and Big Data

Swarm technology may be thought of as insect logic.  Ants behave in the colony’s interest, but without specific guidance.  How can an animal with a brain the size of a grain of sand create a nest, find food, support a queen and expand to new areas?  Each ant is an individual point of simple intelligence, and with rules that are shared by all ants. Everything that the ant colony needs to get done, gets done.  Simplicity also extends to the ant communication, by using pheromones, the ants communicate with each other without personal contact by reviewing the chemicals left behind by another ant.  Ants are great with parallel operations, since there is no single point of control for all actions, but only one ant is empowered to reproduce, the queen.  In this way the colony is controlled, and the needs of the colony are met.

Think of a Big Data problem.  The MapReduce architecture creates multiple threads.  Simple key/value logic is applied and then shuffled to create a reduced data output that has been intelligently organized.  Each of the MapReduce nodes has simple intelligence to perform a key/value matching.  But in the world of swarm intelligence, this could be done by a multitude of agents, not just a few processors in a batch job.  What if you could release multiple agents into your database of unstructured data to look for anomalies, weed out spurious data or corrupted files, organize data by individual attributes, or identify alternate routes if there are networking problems.  Perhaps swarm intelligence could comb Facebook data to find the next mentally unstable serial killer before they strike.  The beauty is that these agents can be programmed with different simple logic, and in doing so the cost is kept low and the performance is kept high.  The simple intelligence might be programmed to include filetype, age, data protection profile, source, encryption, etc.

One premise is that swarm logic makes file structures unnecessary.  All information about the data is included in the tokens (pheromone).  It would allow the integration of structured and unstructured data.  In sort, swarm can change everything.  I won’t go so far as to say all file structures can go away.  ACID test for data coherency is important for those tasks that have one correct answer, like what was my profit last year, or how many widgets are in the warehouse.  For jobs like this, structured data and relational databases might be best.  But for many analysis jobs, swarm may be a great  improvement.

Swarm provides something that traditional data structures don’t: file intelligence.  In today’s structures we have limited space to specify a rigid set of metadata that is inflexible.  With swarm we can add more intelligence into the data using tokens with a combination of intelligence and information, and let them loose on our unstructured data to find organization, or to sort or otherwise manipulate data.  

Bridging the Gap Between Structured and Unstructured Data

Relational databases like structured data, tables of columns and rows in a defined schema so everyone knows what to expect in every place.  Unstructured data like text, data or numeric values are a different story.  Hadoop certainly fills a need to provide some structure using key value pairs to create some structure where there is no structure.  To get the two worlds of structured and unstructured data to work together, there is usually a bridge of some sort.  The relational database may allow an import of key value data so it can be incorporated into the relational schema, although usually with some work.  This handshaking between relational and key value databases is limping along and is workable.  Under heavy loads the networking and performance impact to move massive quantities of data around can be taxing.  Is there a better way?

To make some sense out of unstructured data some sort of framework needs to be overlaid on the raw data to make it more like information.  This is the reason that Hadoop and similar tools are iterative.  You’re hunting for logic in randomness.  You keep looking and trying different things till something looks like a pattern. 

Besides unstructured and structured data is the in-between land of semi-structured data.  This refers to data that has some beginnings of structure.  It doesn’t have formal discipline like the rows and columns of structured data.  It is usually schema-less or self-describing structure in that it has tags or something similar that provide a starting point for structure.  Examples of semi-structured data might include emails, XML, and similar entities that are grouped together.  Pieces can be missing, or size and type of attributes might not be consistent, so it represents an imperfect structure, but not entirely random.  Hence the in-between land of semi-structured data.

Hadapt takes advantage of this semi-structured data with a data exchange tool to create structured data construct.  They use JSON to exchange file formats.  This is rather clever since JSON is a java script derivative that is fairly well known.  JSON is very good at file exchanges, which can solve the semi-structured to structured problem. By starting with semi-structured data, they get a head start on structure.  JSON is particularly well suited to key value pairs or order arrays.   

The semi-structured data must be parsed with JSON which can then create an array of data that is then available to be manipulated with SQL commands to complete the cycle.  Once there is a structure in place SQL is quite comfortable with the further manipulation.  After all, it is the structured query language.  The most sophisticated tools are in the relational world, hence most efforts to make sense of unstructured or semi-structured data is to add more structure to allow more analysis and reporting.  And after a few steps you indeed can get order out of chaos.