Big Data Success Factors

Computerworld has released a study by Computing, a UK company, regarding the state of Big Data in the real world.  Big Data Review 2015 holds not a lot of surprises, but certainly a lot of confirmation that Big Data tools need careful handling to get the right results.  In the early days of Big Data there was a lot of experimentation just to understand the capability of the tools.  There will certainly continue to be plenty of experimentation just given the nature of the process, but success depends on the business execution.

Computing’s 2015 findings show the rapid evolution of the industry based on a shift in the responses in survey questions from just last year.  The number of survey respondents concerned about the different topics has grown significantly, suggesting that Big Data methodologies is rapidly moving into the mainstream.  Nobody wants to be left behind, of course.  In terms of tools, data warehouses and analytics databases were tied with cloud based Big Data services, broadly speaking.  Big leaps in survey respondents considering these tools were seen.

This should come as no surprise to those practicing the art of Big Data.  You probably spend most of your time importing, moving, scrubbing and preparing data for analysis.  Indeed, just finding the data that is important to your analysis can take quite a bit of work.  Garbage in / garbage out still applies.  This report takes time to understand why projects have been perceived as successful, not wasn’t limited to just looking at the latest tools.

For instance, 76% of respondents focused on operational data to improve efficiencies. 24% used Big Data for external opportunities.  Why?  Pragmatic business decision making.  The fastest route to Return on Investment will be to refine and improve operations that you already know.  Operational savings go to the bottom line as increased profit.  New sales only net what the gross margin will allow.  Improving the gross margin impacts all sales.  One of the frustrations that business leaders have with Big Data is the ability to speculate or predict future events when planning business spending.

The power of predictive analytics is immense.  Descriptive analytics may allow you to refine existing operations with less risk but prescriptive analytics that model how things could be, allows you move to disruptive capabilities but higher risk of success.  This is traditionally the domain of the entrepreneur, the ability to remake markets and disrupt the status quo.  The larger challenge for existing companies is how they decide to manage risk and failure.

This tension between the business decision maker and the analytics professional has been true from the start, of course.  The difference now is that decision makers have seen how analytics can improve their business, and funding for analytics is increasing based on that success.  The most visible expression of this is the democratization of Big Data with more self-service tools for business professionals.  The counter trend is reluctance by some departments to share data or cooperate with broader Big Data projects.

To be successful with Big Data projects, the survey identifies a number of factors, the top 3 being: 1) business buy in 2) knowing your data and 3) Core understanding of the business.  The implications of just these three is important.  Business is moving more aggressively into analytics, but with a purpose.  High return on investment objectives will keep projects focused, and tends to lean to the operational side where there is a more likely return.  Knowing your data is another critical aspect of a Big Data project.  Integrating several sources of data and preparing it for analysis is not trivial and takes a lot of time and effort.  Most survey respondents felt that between 80-90% data accuracy is “good enough” for most projects.  The diminishing returns of further improvements may not make any changes to the decision making.  Finally, this isn’t an academic exercise.  The project’s success will depend on deep understanding of the business.  A big part of analytics is deciding what problem you want to solve.  A poorly formed premise or problem will lead to unsatisfactory results.

Computing’s 2015 report has some great information in it, and it highlights the changes in the industry just in the last year.  As analytics becomes more mature it will find application in more companies and more projects, and that’s good for Big Data and the economy.

See the report at: http://resources.computerworld.com/ccd/assets/84539/detail

Posted in Uncategorized | Leave a comment

Beyond Correlation: Process Models

Big Data is the hunt for meaning from an ocean of data.  Until tools like Hadoop and NoSQL became available, it wasn’t practical to derive much visibility from unstructured data, and certainly not much meaning from social media.  Now with these tools, we can provide order to chaos and look into data more closely.  One belief regarding Big Data analysis is that we don’t need to understand the cause of correlation we might find in data.  Simply by understanding the relationships between factors that become apparent in the data, we can find useful information.  Still, correlation is not causation.  We may not understand why two factors are related, but it is still useful to understand correlation.

To further our understanding beyond correlation, a step closer to understanding causation can be Process Mining.  Process Mining looks beyond correlation to further refine associations in the data.  Indeed, Process Mining per Wil van der Aalst of University of Technology in Eindhoven, would posit that by looking at a more structured view of data relationships, we can discover processes in the data.  Based on a Process Mining perspective we can identify processes, identify process bottlenecks, and by looking at what people are actually doing, potentially improve processes.  Finally, we can predict outcomes based on the processes we find by testing with real event data.

Process Mining is different from, but related to, data oriented analysis.  Both approaches start with event data.  The difference is that Process Mining starts with mapping a chain of events to create, refine and test models that fit the data.  You can then use the model suggested by this technique to look for bottlenecks with existing processes.  By testing with actual event data you see the way events are actually occurring, not how they should occur.

To test the relationships suggested by Process Mining, there are three approaches:

  • Play-Out: start from a proposed model. Describe a work items and the workflow. Look at all possible iterations to understand the range of potential processes identified.  Start with what’s already there.
  • Play-In: Look at the existing process and infer a model. Create a model based on what you want to be in place.
  • Replay: With either method of creating a model (Play-Out or Play-In) replay actual data on the model using event log data so you can assess the strengths and weaknesses of the different models. By testing a model with Replay, you can show deviations from expected results, and tune the model accordingly.

Classic Big Data mining starts with data.  Process Mining also starts with data.  The difference is the search for correlation in data mining, and the search for processes in Process Mining.  Data Mining might produce a simple set of Key Performance Indicators (KPI).  Dr. van der Aalst would argue that simple KPI will lead to problems because it is too simplistic.  Understanding how people work together to accomplish a task is more informative.  KPI may show deviations, but the user may have no idea where the deviation comes from, or how to get the process productive again.

By using Process Mining you can identify bottlenecks or unproductive process steps through conformance checking to make sure what you think is happening, really happens.  Data Mining can be either supervised learning with labeled data or unsupervised learning with unlabeled data. Unsupervised learning often relies on cluster or pattern discovery.  Supervised learning often employs regression to analyze relationships.  There are lots of tools to assist in Data Mining, and most of the attention in the industry has focused on data mining.

A good example of a Process Mining tool is the Decision Tree.  In a numerical analysis of decision trees, predictor variables are derived from response variables.  Therefore, decision trees can be used to predict outcomes based on an expected process flow identified in the Decision Tree.  One of the limits of the Decision Tree is to decide how much of a good thing is practical.  Multiple iterations of the decision tree with actual data (Replay) will allow you to continue to refine the decision branches in the Decision Tree model.  You might want to decide before you start how many levels of tree you want to allow (for ease of use) or what kind of identity ratios that you want before you stop developing the model.  Complex processes can create a large Decision Tree.  You should decide at the outset in your Decision Tree creation, what level of success is reasonable.  You should consider what you want the result to be before you start the process.

There are free tools for both Data Mining and Process Mining.

The Process Mining website: http://www.processmining.org/

Check out RapidMiner for Data Mining at http://sourceforge.net/projects/rapidminer/.

For Process mining look at ProM, and open source software tool at http://www.promtools.org/prom6/downloads/example-logs.zip

Enjoy the Process!

Posted in Uncategorized | Leave a comment

Big Data Preparation

Let’s face it, the big data world has a lot of unglamorous heavy lifting.  One of those less glamorous jobs is preparing the data for analysis.  Taking a bunch of unstructured data and creating some structure for further analysis takes some thought, rigorous process, and careful documentation.  Unstructured data lacks the row and column structure, which makes it hard to apply traditional analytic tools to such raw information.  Data preparation provides the structure that makes the data suitable for further analysis. 

In order to assure reproducible results, every step should be documented so that an independent researcher can follow the procedure and obtain the same result.  This is also important to detail the steps taken to prepare the data in the event of subsequent criticism, or to alter the process if the ultimate results are not satisfactory.  Because the very results of the downstream analysis depends on assumptions made in the data preparation, the process must be carefully captured. 

The raw data may consist of an unorganized, partial, or inconsistent (such as text) collection of elements for input to an analytic process.  The objective is to create structured data that is suitable for further analysis and should have four attributes according to Jeff Leek and company of Johns Hopkins:

  • One variable in exactly one column
  • Each observation variable in one row
  • One table for each kind of variable
  • Multiple tables with a column ID to link data between tables  

Best practice would also include descriptions at the top of the table with plain language description of the variables, not just a short-hand name that is easily forgotten or mis-identified. 

Good experimental hygiene also includes a code book and instruction list, again with the intent of creating reproducible results.  The code book can be a text file that includes a short section on study design, with descriptions of each variable, and the units they are measured with.  Summarizing the choices made in data preparation and experimental study design will assist future researchers understand your experiment, and reproduce it as necessary.  The instruction list would be the literal unstructured data, and the output is the structured data.  If every step can’t be captured in the computer script, then what ever documentation that would allow an independent researcher to reproduce the results should be produced to let people know exactly how you prepared the data.

Data can be captured from downloaded files, and Excel files are popular with business and science audiences, while XML files are commonly used for web file collection.  XML is an interesting case, since all data in a file can be processed with XML, but to pull files out selectively, you’ll need Xpath which is a related but different language from XML.   Finally, JSON is another structured file system somewhat akin to XML, but the structure/commands are all different.  Java Script Object Notation has its own following, and is commonly used and moving files into and out of JSON are easily supported.

In summary, to have a repeatable data preparation, you should document the process used, the variables, the structure, the files input, the files output, the code to process the raw data into your desired structured data.  Once the data is structured, you can scrub it to get rid of the Darth Vaders and Mickey Mouses.  Then you can start to think about your analysis. 

Posted in Uncategorized | Leave a comment

Why Implement an All Flash Data Center?

The argument goes something like this: if flash costs more than disk, why would you spend the money on an all-flash data center?  Some might suggest that you just use flash for intense I/O applications like databases where you can justify the additional expense over disk.

What we see from Violin customers is different.  Not all have gone all-flash, but for those that have the benefits are many.

All-flash data centers can provide new sources of revenue.  Lower operating costs.  Elimination of slow I/O workarounds.  Improved application response times.  Faster report turnaround.   Simplified operations. Lower capital costs.

As a storage subsystem manufacturer, Violin puts together the best system they can design, but they are constantly being schooled by their customers.  For instance, they had a large telecom customer who were missing some billing opportunities and redesigned their customer accounting software.  When the customer implemented it on their traditional storage system, they didn’t see much benefit. They saw the application wanted even more I/O, and brought in Violin.  As a result they found over $100 million in new revenue.  That paid for the project handsomely, of course.  This is revenue that wasn’t available with traditional storage, but is captured due to Violin’s low latency.

Another example of how flash storage changes the data center is the impact of low latency on servers and the software that runs on them.  Moving to a Violin All Flash Array speeds I/O so much, the traditional layers of overprovisioning and caching can be eliminated.  The result: better application performance with lower costs.  Customers have also told me they can also free up people from this consolidation to redeploy on more productive efforts since there is no need to manage the overprovisioning and caching infrastructure.

However, not all All Flash solutions are created equal.  SSD solutions are inferior to a backplane-based approach like Violin’s Flash Fabric Architecture™.  Consider key operating metrics such as power and floorspace.  For instance,  70 raw TB from Violin takes 3RU of space.  Common SSD-based solutions take 12RU (or more) for the same raw capacity. This density also translates into power.  The Violin 70TB will take 1500W, while common SSD approaches may take over 3000W for the same capacity.  This translates into operating expense savings.  One customer recently estimated they would save 71% in operating costs with Violin over traditional storage.

Additionally, the Violin Flash Fabric Architecture provides superior performance, due to the array-wide striping of data and parallel paths for high throughput that holds up under heavy loads.  It also provides for better resiliency, since hot spots are essentially eliminated.  The result is not just a big step up over traditional disk storage, it is a significant improvement over SSD-based arrays.

Customers who have gone all-flash for active data have found they can buy the new storage and server equipment, and still have money left over.  This is in addition to any new sources of revenue realized, such as the Telecom example.  Flash is essentially free.

The last hurdle has been data services.  Some customers who have Violin installed love the performance, but were hesitant to put all their data on it because they wanted to have the enterprise level availability features.  Capabilities such as synchronous and asynchronous replication, mirroring and clustering give enterprises a robust tool kit.  They configure their data centers in a variety of ways that will protect against local issues like fire, metro area problems like hurricanes/typhoons, and regional issues with a global replication.  These capabilities now exist in the Concerto 7000 All Flash Array from Violin Memory.  This allows enterprises who want to experience transformative performance to also employ the operational capabilities they need to meet their data center design goals.

The move to the all-flash data center is upon us.

For more information go to www.violin-memory.com

Posted in Uncategorized | Leave a comment

OpenStack for Big Data in the Cloud

Big data has two use cases for storage requirements: cost and scale as primary considerations with performance as secondary consideration for much of the data, and for real-time analytics performance and scale as primary concerns with cost as a secondary consideration.

OpenStack has positioned itself as the platform for the Open Cloud and has the potential to impact your Big Data storage issues.  It comes in two flavors: one for block and one for file/object storage.

Block storage is the usual mode for traditional storage area networks and is served by OpenStack’s Cinder product. File/object storage is the home of files, video, logs, etc is served by Swift from OpenStack.

SWIFT is for objects that aren’t used for transactions.  Why?  The data in Swift is eventually consistent, which isn’t appropriate for transaction data, but is just fine for much of the kinds of data found in static big data, such as photos, video, log data, machine data, social media feeds, backups, archives, etc.  Readers might recognize a previous Big Data Perspectives blog discussing the differences between consistency models and their appropriate applications.  A key/value pair might be a good fit for eventual consistency, but your bank records should be consistent with ACID compatibility.  One potential issue is the need to change applications because it is a new approach.  Swift can be converted through a gateway to allow legacy applications to work with Swift.  Applications have to be REST API compatible.  REST (representational state transfer) is a way to make web approaches more widely used, via HTTP type commands.  Riverbed is an example of a Swift implementation.  Without a traditional hierarchical structure in place, Swift does provide for unlimited scalability, but with uncertain performance.  The focus is on commodity hardware and open source software to keep the cost of storage low.

CINDER is for block data that could be attached to your SAN and could include transaction data in the cloud.  Where performance is more important, or for transactional and database requirement, Cinder is a more appropriate choice.  It does have big-time supporters such as IBM and NetApp.  You can understand the major storage vendor’s dilemma, however.  The whole focus of OpenStack is to use their software and commodity hardware to bring down the cost of storage.  They do provide API compatibility to allow their proprietary systems to communicate with an OpenStack node.

There might be away to bring the worlds of proprietary and open together to get the best from both.  By using the proprietary systems for ACID related data, typically transactions, databases, CRM, ERP and real-time analytics and OpenStack for less critical data, there is a way to put value where it is recognized, and commodity where it is not.

 

Posted in Uncategorized | Leave a comment

Busy and Lazy Data

Big Data files seem to come in two flavors: busy and lazy.  This is due to the nature of the data that we generate, and its usage.  Some data will be intensely used while it is fresh, and then relegated to a data parking lot for lazy data.  Other data, particularly unstructured data, will be used less often because the usefulness of the data is in identifying trends or outliers and may be processed in batches, instead of in an interactive manner. 

Busy data benefits from more aggressive management and higher performing systems with the premise that busy data is more valuable and more time-sensitive.  There is a great study that talks about the business value of latency for managing busy data from a 2009 O’Reilly Velocity Conference in San Jose (http://www.youtube.com/watch?v=bQSE51-gr2s) and it reinforces the business need to provide a low latency environment for busy data to get the best result.  The video referenced discusses how Microsoft’s Bing and Google independently came to a very similar conclusion: latency matters.  It was previously thought that people couldn’t perceive differences under 200ms.  These two studies show that people’s behavior responds to faster response (lower latency) until the system response times fall below 50ms.  This means your system design should deliver consistent end user response times at about 50ms.  The typical slowest element in system design is the electro-mechanical disk drive, and this would indicate that it is time to go to an all-flash array architecture, like those from Violin Memory and others.  The business value is surprising.  Going from 1000 ms to 50ms was found to have users more engaged and spending 2.8% more money.  If your business has anything to do with ecommerce, or just getting more productivity from high-value employees, you ought to be looking at 50ms system response times.

Lazy data is a different story, of course.  There is an excellent paper from the USENIX association, a storage geek group, that bears review.  A paper from USENIX 2008 written by Andrew Leung et al called “Measurement and Analysis of Large-Scale Network File System Workloads” can provide some color on how files are used, and how they behave over time (http://www.ssrc.ucsc.edu/Papers/leung-usenix08.pdf).  As the world of Big Data takes hold the usage patterns for files is changing.  Read to write ratios have decreased, read-write access patterns have increased, they see longer sequential runs, most bytes transferred are from larger files and file sizes are larger.  Files live longer with fewer than 50% deleted within a day of creation.  Files are rarely reopened, if they are it is usually within a minute.  They also saw the noisy neighbor effect with fewer than 1% of clients accounting for 50% of file requests.  76% of files are opened only by one client.  File sharing only happens for 5% of the files, and 90% of the sharing is read-only.  Interestingly, most file types do not have a common access pattern.  It sounds like disk might be just fine for lazy data.

The conclusion is that busy data needs a system focused on performance.  And that performance must be extreme by what we previously knew, down to 50ms system response time.  It seems like the move to an all-flash storage environment and the elimination of hard disk drives for busy data is indicated.  The 50ms system response time is difficult for online situations where the network latency is highly variable, but it does provide an indicator of why Google is spending money to put whole communities online.  The other conclusion is that lazy data might be a good fit for a lower cost storage since the intensity of usage is less, and the primary goal should be cost effectiveness.  Disk storage with it’s sequential performance strength is a good fit for lazy data.  To keep costs down you might want to consider some sort of RAID alternative such as SWARM or erasure coding. 

Posted in Uncategorized | Tagged , , , , , , , , | Leave a comment

SWARM Intelligence and Big Data

Swarm technology may be thought of as insect logic.  Ants behave in the colony’s interest, but without specific guidance.  How can an animal with a brain the size of a grain of sand create a nest, find food, support a queen and expand to new areas?  Each ant is an individual point of simple intelligence, and with rules that are shared by all ants. Everything that the ant colony needs to get done, gets done.  Simplicity also extends to the ant communication, by using pheromones, the ants communicate with each other without personal contact by reviewing the chemicals left behind by another ant.  Ants are great with parallel operations, since there is no single point of control for all actions, but only one ant is empowered to reproduce, the queen.  In this way the colony is controlled, and the needs of the colony are met.

Think of a Big Data problem.  The MapReduce architecture creates multiple threads.  Simple key/value logic is applied and then shuffled to create a reduced data output that has been intelligently organized.  Each of the MapReduce nodes has simple intelligence to perform a key/value matching.  But in the world of swarm intelligence, this could be done by a multitude of agents, not just a few processors in a batch job.  What if you could release multiple agents into your database of unstructured data to look for anomalies, weed out spurious data or corrupted files, organize data by individual attributes, or identify alternate routes if there are networking problems.  Perhaps swarm intelligence could comb Facebook data to find the next mentally unstable serial killer before they strike.  The beauty is that these agents can be programmed with different simple logic, and in doing so the cost is kept low and the performance is kept high.  The simple intelligence might be programmed to include filetype, age, data protection profile, source, encryption, etc.

One premise is that swarm logic makes file structures unnecessary.  All information about the data is included in the tokens (pheromone).  It would allow the integration of structured and unstructured data.  In sort, swarm can change everything.  I won’t go so far as to say all file structures can go away.  ACID test for data coherency is important for those tasks that have one correct answer, like what was my profit last year, or how many widgets are in the warehouse.  For jobs like this, structured data and relational databases might be best.  But for many analysis jobs, swarm may be a great  improvement.

Swarm provides something that traditional data structures don’t: file intelligence.  In today’s structures we have limited space to specify a rigid set of metadata that is inflexible.  With swarm we can add more intelligence into the data using tokens with a combination of intelligence and information, and let them loose on our unstructured data to find organization, or to sort or otherwise manipulate data.  

Posted in Uncategorized | Tagged , , , | Leave a comment