Beyond Correlation: Process Models

Big Data is the hunt for meaning from an ocean of data.  Until tools like Hadoop and NoSQL became available, it wasn’t practical to derive much visibility from unstructured data, and certainly not much meaning from social media.  Now with these tools, we can provide order to chaos and look into data more closely.  One belief regarding Big Data analysis is that we don’t need to understand the cause of correlation we might find in data.  Simply by understanding the relationships between factors that become apparent in the data, we can find useful information.  Still, correlation is not causation.  We may not understand why two factors are related, but it is still useful to understand correlation.

To further our understanding beyond correlation, a step closer to understanding causation can be Process Mining.  Process Mining looks beyond correlation to further refine associations in the data.  Indeed, Process Mining per Wil van der Aalst of University of Technology in Eindhoven, would posit that by looking at a more structured view of data relationships, we can discover processes in the data.  Based on a Process Mining perspective we can identify processes, identify process bottlenecks, and by looking at what people are actually doing, potentially improve processes.  Finally, we can predict outcomes based on the processes we find by testing with real event data.

Process Mining is different from, but related to, data oriented analysis.  Both approaches start with event data.  The difference is that Process Mining starts with mapping a chain of events to create, refine and test models that fit the data.  You can then use the model suggested by this technique to look for bottlenecks with existing processes.  By testing with actual event data you see the way events are actually occurring, not how they should occur.

To test the relationships suggested by Process Mining, there are three approaches:

  • Play-Out: start from a proposed model. Describe a work items and the workflow. Look at all possible iterations to understand the range of potential processes identified.  Start with what’s already there.
  • Play-In: Look at the existing process and infer a model. Create a model based on what you want to be in place.
  • Replay: With either method of creating a model (Play-Out or Play-In) replay actual data on the model using event log data so you can assess the strengths and weaknesses of the different models. By testing a model with Replay, you can show deviations from expected results, and tune the model accordingly.

Classic Big Data mining starts with data.  Process Mining also starts with data.  The difference is the search for correlation in data mining, and the search for processes in Process Mining.  Data Mining might produce a simple set of Key Performance Indicators (KPI).  Dr. van der Aalst would argue that simple KPI will lead to problems because it is too simplistic.  Understanding how people work together to accomplish a task is more informative.  KPI may show deviations, but the user may have no idea where the deviation comes from, or how to get the process productive again.

By using Process Mining you can identify bottlenecks or unproductive process steps through conformance checking to make sure what you think is happening, really happens.  Data Mining can be either supervised learning with labeled data or unsupervised learning with unlabeled data. Unsupervised learning often relies on cluster or pattern discovery.  Supervised learning often employs regression to analyze relationships.  There are lots of tools to assist in Data Mining, and most of the attention in the industry has focused on data mining.

A good example of a Process Mining tool is the Decision Tree.  In a numerical analysis of decision trees, predictor variables are derived from response variables.  Therefore, decision trees can be used to predict outcomes based on an expected process flow identified in the Decision Tree.  One of the limits of the Decision Tree is to decide how much of a good thing is practical.  Multiple iterations of the decision tree with actual data (Replay) will allow you to continue to refine the decision branches in the Decision Tree model.  You might want to decide before you start how many levels of tree you want to allow (for ease of use) or what kind of identity ratios that you want before you stop developing the model.  Complex processes can create a large Decision Tree.  You should decide at the outset in your Decision Tree creation, what level of success is reasonable.  You should consider what you want the result to be before you start the process.

There are free tools for both Data Mining and Process Mining.

The Process Mining website:

Check out RapidMiner for Data Mining at

For Process mining look at ProM, and open source software tool at

Enjoy the Process!


About Big Data Perspectives

Erik Ottem has over 25 years of technology experience with IBM, Seagate, Gadzoox Networks and Agilent. Observations and comments about Big Data are presented for your review.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s