Big Data Preparation

Let’s face it, the big data world has a lot of unglamorous heavy lifting.  One of those less glamorous jobs is preparing the data for analysis.  Taking a bunch of unstructured data and creating some structure for further analysis takes some thought, rigorous process, and careful documentation.  Unstructured data lacks the row and column structure, which makes it hard to apply traditional analytic tools to such raw information.  Data preparation provides the structure that makes the data suitable for further analysis. 

In order to assure reproducible results, every step should be documented so that an independent researcher can follow the procedure and obtain the same result.  This is also important to detail the steps taken to prepare the data in the event of subsequent criticism, or to alter the process if the ultimate results are not satisfactory.  Because the very results of the downstream analysis depends on assumptions made in the data preparation, the process must be carefully captured. 

The raw data may consist of an unorganized, partial, or inconsistent (such as text) collection of elements for input to an analytic process.  The objective is to create structured data that is suitable for further analysis and should have four attributes according to Jeff Leek and company of Johns Hopkins:

  • One variable in exactly one column
  • Each observation variable in one row
  • One table for each kind of variable
  • Multiple tables with a column ID to link data between tables  

Best practice would also include descriptions at the top of the table with plain language description of the variables, not just a short-hand name that is easily forgotten or mis-identified. 

Good experimental hygiene also includes a code book and instruction list, again with the intent of creating reproducible results.  The code book can be a text file that includes a short section on study design, with descriptions of each variable, and the units they are measured with.  Summarizing the choices made in data preparation and experimental study design will assist future researchers understand your experiment, and reproduce it as necessary.  The instruction list would be the literal unstructured data, and the output is the structured data.  If every step can’t be captured in the computer script, then what ever documentation that would allow an independent researcher to reproduce the results should be produced to let people know exactly how you prepared the data.

Data can be captured from downloaded files, and Excel files are popular with business and science audiences, while XML files are commonly used for web file collection.  XML is an interesting case, since all data in a file can be processed with XML, but to pull files out selectively, you’ll need Xpath which is a related but different language from XML.   Finally, JSON is another structured file system somewhat akin to XML, but the structure/commands are all different.  Java Script Object Notation has its own following, and is commonly used and moving files into and out of JSON are easily supported.

In summary, to have a repeatable data preparation, you should document the process used, the variables, the structure, the files input, the files output, the code to process the raw data into your desired structured data.  Once the data is structured, you can scrub it to get rid of the Darth Vaders and Mickey Mouses.  Then you can start to think about your analysis.