Big Data Success Factors

Computerworld has released a study by Computing, a UK company, regarding the state of Big Data in the real world.  Big Data Review 2015 holds not a lot of surprises, but certainly a lot of confirmation that Big Data tools need careful handling to get the right results.  In the early days of Big Data there was a lot of experimentation just to understand the capability of the tools.  There will certainly continue to be plenty of experimentation just given the nature of the process, but success depends on the business execution.

Computing’s 2015 findings show the rapid evolution of the industry based on a shift in the responses in survey questions from just last year.  The number of survey respondents concerned about the different topics has grown significantly, suggesting that Big Data methodologies is rapidly moving into the mainstream.  Nobody wants to be left behind, of course.  In terms of tools, data warehouses and analytics databases were tied with cloud based Big Data services, broadly speaking.  Big leaps in survey respondents considering these tools were seen.

This should come as no surprise to those practicing the art of Big Data.  You probably spend most of your time importing, moving, scrubbing and preparing data for analysis.  Indeed, just finding the data that is important to your analysis can take quite a bit of work.  Garbage in / garbage out still applies.  This report takes time to understand why projects have been perceived as successful, not wasn’t limited to just looking at the latest tools.

For instance, 76% of respondents focused on operational data to improve efficiencies. 24% used Big Data for external opportunities.  Why?  Pragmatic business decision making.  The fastest route to Return on Investment will be to refine and improve operations that you already know.  Operational savings go to the bottom line as increased profit.  New sales only net what the gross margin will allow.  Improving the gross margin impacts all sales.  One of the frustrations that business leaders have with Big Data is the ability to speculate or predict future events when planning business spending.

The power of predictive analytics is immense.  Descriptive analytics may allow you to refine existing operations with less risk but prescriptive analytics that model how things could be, allows you move to disruptive capabilities but higher risk of success.  This is traditionally the domain of the entrepreneur, the ability to remake markets and disrupt the status quo.  The larger challenge for existing companies is how they decide to manage risk and failure.

This tension between the business decision maker and the analytics professional has been true from the start, of course.  The difference now is that decision makers have seen how analytics can improve their business, and funding for analytics is increasing based on that success.  The most visible expression of this is the democratization of Big Data with more self-service tools for business professionals.  The counter trend is reluctance by some departments to share data or cooperate with broader Big Data projects.

To be successful with Big Data projects, the survey identifies a number of factors, the top 3 being: 1) business buy in 2) knowing your data and 3) Core understanding of the business.  The implications of just these three is important.  Business is moving more aggressively into analytics, but with a purpose.  High return on investment objectives will keep projects focused, and tends to lean to the operational side where there is a more likely return.  Knowing your data is another critical aspect of a Big Data project.  Integrating several sources of data and preparing it for analysis is not trivial and takes a lot of time and effort.  Most survey respondents felt that between 80-90% data accuracy is “good enough” for most projects.  The diminishing returns of further improvements may not make any changes to the decision making.  Finally, this isn’t an academic exercise.  The project’s success will depend on deep understanding of the business.  A big part of analytics is deciding what problem you want to solve.  A poorly formed premise or problem will lead to unsatisfactory results.

Computing’s 2015 report has some great information in it, and it highlights the changes in the industry just in the last year.  As analytics becomes more mature it will find application in more companies and more projects, and that’s good for Big Data and the economy.

See the report at: http://resources.computerworld.com/ccd/assets/84539/detail

Posted in Uncategorized | Leave a comment

Beyond Correlation: Process Models

Big Data is the hunt for meaning from an ocean of data.  Until tools like Hadoop and NoSQL became available, it wasn’t practical to derive much visibility from unstructured data, and certainly not much meaning from social media.  Now with these tools, we can provide order to chaos and look into data more closely.  One belief regarding Big Data analysis is that we don’t need to understand the cause of correlation we might find in data.  Simply by understanding the relationships between factors that become apparent in the data, we can find useful information.  Still, correlation is not causation.  We may not understand why two factors are related, but it is still useful to understand correlation.

To further our understanding beyond correlation, a step closer to understanding causation can be Process Mining.  Process Mining looks beyond correlation to further refine associations in the data.  Indeed, Process Mining per Wil van der Aalst of University of Technology in Eindhoven, would posit that by looking at a more structured view of data relationships, we can discover processes in the data.  Based on a Process Mining perspective we can identify processes, identify process bottlenecks, and by looking at what people are actually doing, potentially improve processes.  Finally, we can predict outcomes based on the processes we find by testing with real event data.

Process Mining is different from, but related to, data oriented analysis.  Both approaches start with event data.  The difference is that Process Mining starts with mapping a chain of events to create, refine and test models that fit the data.  You can then use the model suggested by this technique to look for bottlenecks with existing processes.  By testing with actual event data you see the way events are actually occurring, not how they should occur.

To test the relationships suggested by Process Mining, there are three approaches:

  • Play-Out: start from a proposed model. Describe a work items and the workflow. Look at all possible iterations to understand the range of potential processes identified.  Start with what’s already there.
  • Play-In: Look at the existing process and infer a model. Create a model based on what you want to be in place.
  • Replay: With either method of creating a model (Play-Out or Play-In) replay actual data on the model using event log data so you can assess the strengths and weaknesses of the different models. By testing a model with Replay, you can show deviations from expected results, and tune the model accordingly.

Classic Big Data mining starts with data.  Process Mining also starts with data.  The difference is the search for correlation in data mining, and the search for processes in Process Mining.  Data Mining might produce a simple set of Key Performance Indicators (KPI).  Dr. van der Aalst would argue that simple KPI will lead to problems because it is too simplistic.  Understanding how people work together to accomplish a task is more informative.  KPI may show deviations, but the user may have no idea where the deviation comes from, or how to get the process productive again.

By using Process Mining you can identify bottlenecks or unproductive process steps through conformance checking to make sure what you think is happening, really happens.  Data Mining can be either supervised learning with labeled data or unsupervised learning with unlabeled data. Unsupervised learning often relies on cluster or pattern discovery.  Supervised learning often employs regression to analyze relationships.  There are lots of tools to assist in Data Mining, and most of the attention in the industry has focused on data mining.

A good example of a Process Mining tool is the Decision Tree.  In a numerical analysis of decision trees, predictor variables are derived from response variables.  Therefore, decision trees can be used to predict outcomes based on an expected process flow identified in the Decision Tree.  One of the limits of the Decision Tree is to decide how much of a good thing is practical.  Multiple iterations of the decision tree with actual data (Replay) will allow you to continue to refine the decision branches in the Decision Tree model.  You might want to decide before you start how many levels of tree you want to allow (for ease of use) or what kind of identity ratios that you want before you stop developing the model.  Complex processes can create a large Decision Tree.  You should decide at the outset in your Decision Tree creation, what level of success is reasonable.  You should consider what you want the result to be before you start the process.

There are free tools for both Data Mining and Process Mining.

The Process Mining website: http://www.processmining.org/

Check out RapidMiner for Data Mining at http://sourceforge.net/projects/rapidminer/.

For Process mining look at ProM, and open source software tool at http://www.promtools.org/prom6/downloads/example-logs.zip

Enjoy the Process!

Posted in Uncategorized | Leave a comment

Big Data Preparation

Let’s face it, the big data world has a lot of unglamorous heavy lifting.  One of those less glamorous jobs is preparing the data for analysis.  Taking a bunch of unstructured data and creating some structure for further analysis takes some thought, rigorous process, and careful documentation.  Unstructured data lacks the row and column structure, which makes it hard to apply traditional analytic tools to such raw information.  Data preparation provides the structure that makes the data suitable for further analysis. 

In order to assure reproducible results, every step should be documented so that an independent researcher can follow the procedure and obtain the same result.  This is also important to detail the steps taken to prepare the data in the event of subsequent criticism, or to alter the process if the ultimate results are not satisfactory.  Because the very results of the downstream analysis depends on assumptions made in the data preparation, the process must be carefully captured. 

The raw data may consist of an unorganized, partial, or inconsistent (such as text) collection of elements for input to an analytic process.  The objective is to create structured data that is suitable for further analysis and should have four attributes according to Jeff Leek and company of Johns Hopkins:

  • One variable in exactly one column
  • Each observation variable in one row
  • One table for each kind of variable
  • Multiple tables with a column ID to link data between tables  

Best practice would also include descriptions at the top of the table with plain language description of the variables, not just a short-hand name that is easily forgotten or mis-identified. 

Good experimental hygiene also includes a code book and instruction list, again with the intent of creating reproducible results.  The code book can be a text file that includes a short section on study design, with descriptions of each variable, and the units they are measured with.  Summarizing the choices made in data preparation and experimental study design will assist future researchers understand your experiment, and reproduce it as necessary.  The instruction list would be the literal unstructured data, and the output is the structured data.  If every step can’t be captured in the computer script, then what ever documentation that would allow an independent researcher to reproduce the results should be produced to let people know exactly how you prepared the data.

Data can be captured from downloaded files, and Excel files are popular with business and science audiences, while XML files are commonly used for web file collection.  XML is an interesting case, since all data in a file can be processed with XML, but to pull files out selectively, you’ll need Xpath which is a related but different language from XML.   Finally, JSON is another structured file system somewhat akin to XML, but the structure/commands are all different.  Java Script Object Notation has its own following, and is commonly used and moving files into and out of JSON are easily supported.

In summary, to have a repeatable data preparation, you should document the process used, the variables, the structure, the files input, the files output, the code to process the raw data into your desired structured data.  Once the data is structured, you can scrub it to get rid of the Darth Vaders and Mickey Mouses.  Then you can start to think about your analysis. 

Posted in Uncategorized | Leave a comment

Why Implement an All Flash Data Center?

The argument goes something like this: if flash costs more than disk, why would you spend the money on an all-flash data center?  Some might suggest that you just use flash for intense I/O applications like databases where you can justify the additional expense over disk.

What we see from Violin customers is different.  Not all have gone all-flash, but for those that have the benefits are many.

All-flash data centers can provide new sources of revenue.  Lower operating costs.  Elimination of slow I/O workarounds.  Improved application response times.  Faster report turnaround.   Simplified operations. Lower capital costs.

As a storage subsystem manufacturer, Violin puts together the best system they can design, but they are constantly being schooled by their customers.  For instance, they had a large telecom customer who were missing some billing opportunities and redesigned their customer accounting software.  When the customer implemented it on their traditional storage system, they didn’t see much benefit. They saw the application wanted even more I/O, and brought in Violin.  As a result they found over $100 million in new revenue.  That paid for the project handsomely, of course.  This is revenue that wasn’t available with traditional storage, but is captured due to Violin’s low latency.

Another example of how flash storage changes the data center is the impact of low latency on servers and the software that runs on them.  Moving to a Violin All Flash Array speeds I/O so much, the traditional layers of overprovisioning and caching can be eliminated.  The result: better application performance with lower costs.  Customers have also told me they can also free up people from this consolidation to redeploy on more productive efforts since there is no need to manage the overprovisioning and caching infrastructure.

However, not all All Flash solutions are created equal.  SSD solutions are inferior to a backplane-based approach like Violin’s Flash Fabric Architecture™.  Consider key operating metrics such as power and floorspace.  For instance,  70 raw TB from Violin takes 3RU of space.  Common SSD-based solutions take 12RU (or more) for the same raw capacity. This density also translates into power.  The Violin 70TB will take 1500W, while common SSD approaches may take over 3000W for the same capacity.  This translates into operating expense savings.  One customer recently estimated they would save 71% in operating costs with Violin over traditional storage.

Additionally, the Violin Flash Fabric Architecture provides superior performance, due to the array-wide striping of data and parallel paths for high throughput that holds up under heavy loads.  It also provides for better resiliency, since hot spots are essentially eliminated.  The result is not just a big step up over traditional disk storage, it is a significant improvement over SSD-based arrays.

Customers who have gone all-flash for active data have found they can buy the new storage and server equipment, and still have money left over.  This is in addition to any new sources of revenue realized, such as the Telecom example.  Flash is essentially free.

The last hurdle has been data services.  Some customers who have Violin installed love the performance, but were hesitant to put all their data on it because they wanted to have the enterprise level availability features.  Capabilities such as synchronous and asynchronous replication, mirroring and clustering give enterprises a robust tool kit.  They configure their data centers in a variety of ways that will protect against local issues like fire, metro area problems like hurricanes/typhoons, and regional issues with a global replication.  These capabilities now exist in the Concerto 7000 All Flash Array from Violin Memory.  This allows enterprises who want to experience transformative performance to also employ the operational capabilities they need to meet their data center design goals.

The move to the all-flash data center is upon us.

For more information go to www.violin-memory.com

Posted in Uncategorized | Leave a comment

OpenStack for Big Data in the Cloud

Big data has two use cases for storage requirements: cost and scale as primary considerations with performance as secondary consideration for much of the data, and for real-time analytics performance and scale as primary concerns with cost as a secondary consideration.

OpenStack has positioned itself as the platform for the Open Cloud and has the potential to impact your Big Data storage issues.  It comes in two flavors: one for block and one for file/object storage.

Block storage is the usual mode for traditional storage area networks and is served by OpenStack’s Cinder product. File/object storage is the home of files, video, logs, etc is served by Swift from OpenStack.

SWIFT is for objects that aren’t used for transactions.  Why?  The data in Swift is eventually consistent, which isn’t appropriate for transaction data, but is just fine for much of the kinds of data found in static big data, such as photos, video, log data, machine data, social media feeds, backups, archives, etc.  Readers might recognize a previous Big Data Perspectives blog discussing the differences between consistency models and their appropriate applications.  A key/value pair might be a good fit for eventual consistency, but your bank records should be consistent with ACID compatibility.  One potential issue is the need to change applications because it is a new approach.  Swift can be converted through a gateway to allow legacy applications to work with Swift.  Applications have to be REST API compatible.  REST (representational state transfer) is a way to make web approaches more widely used, via HTTP type commands.  Riverbed is an example of a Swift implementation.  Without a traditional hierarchical structure in place, Swift does provide for unlimited scalability, but with uncertain performance.  The focus is on commodity hardware and open source software to keep the cost of storage low.

CINDER is for block data that could be attached to your SAN and could include transaction data in the cloud.  Where performance is more important, or for transactional and database requirement, Cinder is a more appropriate choice.  It does have big-time supporters such as IBM and NetApp.  You can understand the major storage vendor’s dilemma, however.  The whole focus of OpenStack is to use their software and commodity hardware to bring down the cost of storage.  They do provide API compatibility to allow their proprietary systems to communicate with an OpenStack node.

There might be away to bring the worlds of proprietary and open together to get the best from both.  By using the proprietary systems for ACID related data, typically transactions, databases, CRM, ERP and real-time analytics and OpenStack for less critical data, there is a way to put value where it is recognized, and commodity where it is not.

 

Posted in Uncategorized | Leave a comment

Busy and Lazy Data

Big Data files seem to come in two flavors: busy and lazy.  This is due to the nature of the data that we generate, and its usage.  Some data will be intensely used while it is fresh, and then relegated to a data parking lot for lazy data.  Other data, particularly unstructured data, will be used less often because the usefulness of the data is in identifying trends or outliers and may be processed in batches, instead of in an interactive manner. 

Busy data benefits from more aggressive management and higher performing systems with the premise that busy data is more valuable and more time-sensitive.  There is a great study that talks about the business value of latency for managing busy data from a 2009 O’Reilly Velocity Conference in San Jose (http://www.youtube.com/watch?v=bQSE51-gr2s) and it reinforces the business need to provide a low latency environment for busy data to get the best result.  The video referenced discusses how Microsoft’s Bing and Google independently came to a very similar conclusion: latency matters.  It was previously thought that people couldn’t perceive differences under 200ms.  These two studies show that people’s behavior responds to faster response (lower latency) until the system response times fall below 50ms.  This means your system design should deliver consistent end user response times at about 50ms.  The typical slowest element in system design is the electro-mechanical disk drive, and this would indicate that it is time to go to an all-flash array architecture, like those from Violin Memory and others.  The business value is surprising.  Going from 1000 ms to 50ms was found to have users more engaged and spending 2.8% more money.  If your business has anything to do with ecommerce, or just getting more productivity from high-value employees, you ought to be looking at 50ms system response times.

Lazy data is a different story, of course.  There is an excellent paper from the USENIX association, a storage geek group, that bears review.  A paper from USENIX 2008 written by Andrew Leung et al called “Measurement and Analysis of Large-Scale Network File System Workloads” can provide some color on how files are used, and how they behave over time (http://www.ssrc.ucsc.edu/Papers/leung-usenix08.pdf).  As the world of Big Data takes hold the usage patterns for files is changing.  Read to write ratios have decreased, read-write access patterns have increased, they see longer sequential runs, most bytes transferred are from larger files and file sizes are larger.  Files live longer with fewer than 50% deleted within a day of creation.  Files are rarely reopened, if they are it is usually within a minute.  They also saw the noisy neighbor effect with fewer than 1% of clients accounting for 50% of file requests.  76% of files are opened only by one client.  File sharing only happens for 5% of the files, and 90% of the sharing is read-only.  Interestingly, most file types do not have a common access pattern.  It sounds like disk might be just fine for lazy data.

The conclusion is that busy data needs a system focused on performance.  And that performance must be extreme by what we previously knew, down to 50ms system response time.  It seems like the move to an all-flash storage environment and the elimination of hard disk drives for busy data is indicated.  The 50ms system response time is difficult for online situations where the network latency is highly variable, but it does provide an indicator of why Google is spending money to put whole communities online.  The other conclusion is that lazy data might be a good fit for a lower cost storage since the intensity of usage is less, and the primary goal should be cost effectiveness.  Disk storage with it’s sequential performance strength is a good fit for lazy data.  To keep costs down you might want to consider some sort of RAID alternative such as SWARM or erasure coding. 

Posted in Uncategorized | Tagged , , , , , , , , | Leave a comment

SWARM Intelligence and Big Data

Swarm technology may be thought of as insect logic.  Ants behave in the colony’s interest, but without specific guidance.  How can an animal with a brain the size of a grain of sand create a nest, find food, support a queen and expand to new areas?  Each ant is an individual point of simple intelligence, and with rules that are shared by all ants. Everything that the ant colony needs to get done, gets done.  Simplicity also extends to the ant communication, by using pheromones, the ants communicate with each other without personal contact by reviewing the chemicals left behind by another ant.  Ants are great with parallel operations, since there is no single point of control for all actions, but only one ant is empowered to reproduce, the queen.  In this way the colony is controlled, and the needs of the colony are met.

Think of a Big Data problem.  The MapReduce architecture creates multiple threads.  Simple key/value logic is applied and then shuffled to create a reduced data output that has been intelligently organized.  Each of the MapReduce nodes has simple intelligence to perform a key/value matching.  But in the world of swarm intelligence, this could be done by a multitude of agents, not just a few processors in a batch job.  What if you could release multiple agents into your database of unstructured data to look for anomalies, weed out spurious data or corrupted files, organize data by individual attributes, or identify alternate routes if there are networking problems.  Perhaps swarm intelligence could comb Facebook data to find the next mentally unstable serial killer before they strike.  The beauty is that these agents can be programmed with different simple logic, and in doing so the cost is kept low and the performance is kept high.  The simple intelligence might be programmed to include filetype, age, data protection profile, source, encryption, etc.

One premise is that swarm logic makes file structures unnecessary.  All information about the data is included in the tokens (pheromone).  It would allow the integration of structured and unstructured data.  In sort, swarm can change everything.  I won’t go so far as to say all file structures can go away.  ACID test for data coherency is important for those tasks that have one correct answer, like what was my profit last year, or how many widgets are in the warehouse.  For jobs like this, structured data and relational databases might be best.  But for many analysis jobs, swarm may be a great  improvement.

Swarm provides something that traditional data structures don’t: file intelligence.  In today’s structures we have limited space to specify a rigid set of metadata that is inflexible.  With swarm we can add more intelligence into the data using tokens with a combination of intelligence and information, and let them loose on our unstructured data to find organization, or to sort or otherwise manipulate data.  

Posted in Uncategorized | Tagged , , , | Leave a comment

Bridging the Gap Between Structured and Unstructured Data

Relational databases like structured data, tables of columns and rows in a defined schema so everyone knows what to expect in every place.  Unstructured data like text, data or numeric values are a different story.  Hadoop certainly fills a need to provide some structure using key value pairs to create some structure where there is no structure.  To get the two worlds of structured and unstructured data to work together, there is usually a bridge of some sort.  The relational database may allow an import of key value data so it can be incorporated into the relational schema, although usually with some work.  This handshaking between relational and key value databases is limping along and is workable.  Under heavy loads the networking and performance impact to move massive quantities of data around can be taxing.  Is there a better way?

To make some sense out of unstructured data some sort of framework needs to be overlaid on the raw data to make it more like information.  This is the reason that Hadoop and similar tools are iterative.  You’re hunting for logic in randomness.  You keep looking and trying different things till something looks like a pattern. 

Besides unstructured and structured data is the in-between land of semi-structured data.  This refers to data that has some beginnings of structure.  It doesn’t have formal discipline like the rows and columns of structured data.  It is usually schema-less or self-describing structure in that it has tags or something similar that provide a starting point for structure.  Examples of semi-structured data might include emails, XML, and similar entities that are grouped together.  Pieces can be missing, or size and type of attributes might not be consistent, so it represents an imperfect structure, but not entirely random.  Hence the in-between land of semi-structured data.

Hadapt takes advantage of this semi-structured data with a data exchange tool to create structured data construct.  They use JSON to exchange file formats.  This is rather clever since JSON is a java script derivative that is fairly well known.  JSON is very good at file exchanges, which can solve the semi-structured to structured problem. By starting with semi-structured data, they get a head start on structure.  JSON is particularly well suited to key value pairs or order arrays.   

The semi-structured data must be parsed with JSON which can then create an array of data that is then available to be manipulated with SQL commands to complete the cycle.  Once there is a structure in place SQL is quite comfortable with the further manipulation.  After all, it is the structured query language.  The most sophisticated tools are in the relational world, hence most efforts to make sense of unstructured or semi-structured data is to add more structure to allow more analysis and reporting.  And after a few steps you indeed can get order out of chaos.

 

 

Posted in Uncategorized | 1 Comment

The Problem with Machine Learning

Machine learning is widely perceived as getting its start with chess.  When the skills of the program exceeded the skills of the programmer, the logic went,  you’ve created machine learning.  The machine now has capabilities that the programmer didn’t.  Of course, this is something of a fiction.  Massive calculation capabilities alone don’t really mean there’s profound learning occurring.  Massive calculation capabilities might reveal learning, however. 

 In the meantime machine learning has done some really interesting things, such as driving a car, face recognition, spam mail identification, and with it robots can even vacuum the house.

The core problem for machine learning is structure.   Machine learning will do well with structured environments, but not so well in unstructured environments.  Driving a car is a great example.  It did poorly until the problems it couldn’t identify such as subtleties in the driving surface, or the significance of certain objects were defined in a way that could be dealt with by the program.  The underlying problem is structuring a problem in a way that the program can analyze and resolve it. 

Consider the way people learn, and the way that machines learn.  A child starts learning in an unstructured way.  It takes a long time for a child to learn how to speak or walk.  Each child will learn at their own rate based on their environment and unique genetic makeup.  Once they have an unstructured basis for learning do we add a structured learning environment; school.   With machines we provide a structured environment first, and then hope they can learn the subtleties of a complex world. 

There is an excellent paper by Pedro Domingos of University of Washington looking at the growth areas of machine learning.  One observation is that when humans make a new discovery they can create language to describe the new concept.  These concepts are also comparable in the human mind where people can see the comparison between situations and apply new techniques to existing areas by taking skills from one area and applying to another.  An example of human learning is using physicists to create mathematical models for the financial industry to create models for high speed computer trading. 

 Structured machine learning models are making progress in solving some interesting problems, like those mentioned previously.  The approaches mostly look at providing more layers of complexity in the way problems are analyzed and resolved.  Indeed, the future of machine learning is not in the volume of data, but the complexity of the issues to be studied.  The world is a complex place, and human understanding of it is comprised of a combination of literal and intuitive approaches.  The intuitive is the ability to reach across domains of knowledge, extensions of understanding to new areas, and qualitative judgments.

Because machines literal due to the nature of machine structure, programming has been likewise very literal.  The machine learning models are getting far more sophisticated in terms of complexity.  Still, the challenge of creating a structured tool (machine learning) to tackle an unstructured world may be a problem that will never be entirely resolved till we restructure the machine. 

Perhaps we create a machine with a base level of “instinct”, enough to allow the machine to perceive the world around them, and then let it learn.  If we were to create a machine that might take a few years of observation of the world and create their own basis for understanding the universe, what kind of intelligence would be created?  Would we be able to control it?  Will it provide any useful service to us?  

Posted in Uncategorized | Tagged | Leave a comment

Hadoop Summit 2013

The Hadoop Summit for 2013 has just concluded in San Jose.  There were a few themes that seemed to recur throughout the two-day summit with over 2,500 people.  The overall story is the continued progress to take Hadoop out of the experimental and fringe case environment, and move it into the enterprise with all the appropriate controls, function and management.  A related objective is to have 50% of the world’s data on Hadoop in five years.

The latest Hadoop release 2.0 is known as Yarn (yet another resource negotiator).  To be a little more precise Hadoop is still less than one at release 0.23, but MapReduce is now version 2.0 or MRv2.  The new MRv2 release addresses some of MapReduces’ long known problems such as security and scheduling limitations.  Hadoop’s Job Tracker/resource manager/job scheduling have been re-engineered to provide more control with a global resource manager and an application focused “application master”.  The new Yarn APIs are backwards compatible with the previous version with a recompile.  Good news, of course.  You can get more details of the new Hadoop release at the Apache site hadoop.apache.org

The other themes in the Hadoop Summit included in-memory computing, DataLakes, 80% rule, and the role of open source in a commercial product.

Hadoop traditionally is a batch job.  Enterprise applications demand  an interactive capability.  Hadoop is moving into an interactive capability.  But it doesn’t stop there.  The step beyond interactive capability is stream processing with In-memory computing.  In-memory computing is becoming more popular as the cost of memory plummets and people are increasingly looking for “real-time” response from MapReduce related products like Hadoop.  The leading player with in-memory computing is SAP’s Hana, but there are several alternatives.  In-memory processing provides blazing speed, but higher costs than a traditional paging database that moves data in and out of rotating disc drives.  Performance can be enhanced by the use of Flash memory, but it may still not be enough.  In-memory typically will have the best performance, and several vendors like Qubole,  Kognitio (which pre-dates Hadoop by quite a bit), Data Torrent as well as others showing at the conference were touting the benefits of their in-memory solutions.  They provide a great performance boost, if that’s what your application needs.  

DataLakes came up in the kickoff as a place to put your data till you figure out what to do with it.  I immediately thought of data warehouses, but this is different.  In a data warehouse you will usually need to create a schema and scrub the data before it goes in the warehouse so you can process it more efficiently.  The idea of a DataLake is to put the data in, and figure out the schema as you do the processing.  A number of people I spoke with are still scratching their heads about the details of how this might work, but the concept has some merit.

The 80% rule, the Pareto Principle, refers to 80% of the results coming from 20% of the work, be it customers, products or whatever.  In regards to Big Data this is how I view many of the application specific products for Big Data.  Due to the shortage of Data Scientists, creating products and platforms for people with more general skills provides 80% of the benefit of Big Data with only 20% of the skills required.  I spoke with the guys at Talend and that is clearly their approach.   They have a few application areas that have specific solutions aimed at user analyst skills to address the fat part of the market. 

Finally, there remains tension between open source and proprietary products.  There are some other examples of open source as a mainstream product, and Linux comes to mind as the poster child for the movement.  Most of the open source projects are less mainstream.  Commercial companies need to differentiate their products to justify their existence.   The push behind Hadoop to be the real success story for open source is pretty exciting.  Multiple participants I spoke with saw open source as the best way to innovate.  It provides a far wider pool of talent to access, and has enough rigor to provide a base that other vendors can leverage for proprietary applications.  The excitement at Hadoop Summit generated by moving this platform into the enterprise is audacious, and the promise of open source software seems to be coming true.  Sometimes dreams do come true.        

Posted in Uncategorized | Leave a comment