Bridging the Gap Between Structured and Unstructured Data

Relational databases like structured data, tables of columns and rows in a defined schema so everyone knows what to expect in every place.  Unstructured data like text, data or numeric values are a different story.  Hadoop certainly fills a need to provide some structure using key value pairs to create some structure where there is no structure.  To get the two worlds of structured and unstructured data to work together, there is usually a bridge of some sort.  The relational database may allow an import of key value data so it can be incorporated into the relational schema, although usually with some work.  This handshaking between relational and key value databases is limping along and is workable.  Under heavy loads the networking and performance impact to move massive quantities of data around can be taxing.  Is there a better way?

To make some sense out of unstructured data some sort of framework needs to be overlaid on the raw data to make it more like information.  This is the reason that Hadoop and similar tools are iterative.  You’re hunting for logic in randomness.  You keep looking and trying different things till something looks like a pattern. 

Besides unstructured and structured data is the in-between land of semi-structured data.  This refers to data that has some beginnings of structure.  It doesn’t have formal discipline like the rows and columns of structured data.  It is usually schema-less or self-describing structure in that it has tags or something similar that provide a starting point for structure.  Examples of semi-structured data might include emails, XML, and similar entities that are grouped together.  Pieces can be missing, or size and type of attributes might not be consistent, so it represents an imperfect structure, but not entirely random.  Hence the in-between land of semi-structured data.

Hadapt takes advantage of this semi-structured data with a data exchange tool to create structured data construct.  They use JSON to exchange file formats.  This is rather clever since JSON is a java script derivative that is fairly well known.  JSON is very good at file exchanges, which can solve the semi-structured to structured problem. By starting with semi-structured data, they get a head start on structure.  JSON is particularly well suited to key value pairs or order arrays.   

The semi-structured data must be parsed with JSON which can then create an array of data that is then available to be manipulated with SQL commands to complete the cycle.  Once there is a structure in place SQL is quite comfortable with the further manipulation.  After all, it is the structured query language.  The most sophisticated tools are in the relational world, hence most efforts to make sense of unstructured or semi-structured data is to add more structure to allow more analysis and reporting.  And after a few steps you indeed can get order out of chaos.