Marketing Analytics Beyond the Clickstream

Big Data and analytics are closely related.  What good is Big Data if not as a better basis for analysis?  Indeed it is commonly said that more data trumps better algorithms.  One area that is very popular, particularly with retail concerns is marketing analytics.  UCLA professor Dominique Hanssens is a recognized expert in this area.  He noted that marketing analytics as an academic endeavor started in the 1960’s with a more rigorous look at what happens in markets.  Some of the first work was done looking how a new product gains acceptance (or not) in the market.  How fast does it gain acceptance?  There have been many studies on segmentation also from the 1960’s, which is critical to develop marketing messages that are more apt to produce results that blanket marketing to an undifferentiated audience.  Market response models followed in the 1970’s and consumer choice models in the 1980’s. 

The bar has been raised for marketing in recent years and indicates how business is reflecting a more quantitative approach, and ROI for all parts of the business is now common.  Marketing analytics isn’t just about PPC, SEO and analytics on the clickstream.  Big Data analytics in marketing allows a more complete picture of not just market segments but even individuals.  Together they can be used to better manage the marketing budget.  One area that can have significant impact on the company’s profitability is pricing management with analytics.  Some industries such as airlines and hotels have a perishable offering that benefits from a dynamic pricing strategy, and indeed consumers have found it more challenging to shop just on price, since the timing of the purchase has a great deal to do with the price paid for the airline seat or hotel room.  This is much more about the vendor getting value for the offering, and not just a cost plus pricing model.  The difference can significantly impact the profitability of the company. 

One example of how traditional analytics combined with recent access to Big Data can change the company’s behavior is the diffusion of a new product in the market.  By viewing the pre-sales behavior among prospective customers, you can predict the eventual sales in different segments.  Big Data allows a more comprehensive description of market segments, as well as detailed information on pre-sales behavior.  New products will typically show different rates of adoption among different market segments.  By segmenting the market more completely you can more accurately predict the prospective customer, and perhaps offer additional promotional efforts to these segments that are more likely to produce a higher return on investment for the company where a blanket outreach may not be appropriate for the entire market space.  The Bass diffusion model has been around since 1969, but now with Big Data capabilities it allows a more granular look at how the product will move into the market and will suggest segments appropriate for additional marketing outreach.

One way that marketing outreach might gain greater effectiveness with Big Data is in the creation of “buzz” for highly enthusiastic fans of a particular product by identifying the leads in a given market segment.  Big Data allows us to analyze the connectedness of different individuals, which allows us to focus on the perceived experts in a given segment to provide extra information and motivation to promote their transmitting their support for the new product to their network, and thereby spur adoption in the market. 

 By combining the traditional marketing analytic models developed and refined over the decades, with relatively new looks at what used to be hard-to-analyze unstructured data from the web capturing consumer behavior, we have a new depth in our understanding of how markets work, and the basis for more focused and accountable marketing.


The Power of Good Questions

Kaushik Das of Greenplum, at a recent Big Data Cloud meetup spoke about Big Data as an enabler of 21st Century storytelling.  One of the comments made was about the Chompsky vs Norvig argument regarding patterns that can be found in random data.  Correlation is not cause and effect, a difference that most of the press still does not comprehend.  Chompsky is the premier linguist in the world, and feels we need to understand cause and effect, not just correlation as identified by Google’s Norvig.

It does bring into focus the ability of analytics to bring insights versus pattern searching.  The difference is in the quality of the questions asked.  I think that because analytics is pattern matching, in essence, that one can derive more insight with multiple questions, and I believe this is how most practitioners approach Big Data analytics.  The tools of analytics are powerful and provide greater reach than previously possible.  No longer are we limited to averages and samples, but a complete analysis of the data is possible.  This is important because at the fringes things can become non-linear with spectacular impact.  Look at the financial crisis or hurricane prediction as a couple of examples of where data that works very predictably inside the middle of a normal distribution can become very wild at the extremes.

Real insight becomes storytelling.  I mean this in the highest sense.  Storytelling that allows us to understand systems, be they natural, or based on human behavior can be tremendously powerful.  At the best it becomes predictive.  At the worst we are awaiting 100 monkeys with typewriters to produce Shakespeare.  When in doubt, ask more questions.  Do more analytics, and get more perspective on the story.

Parallel File Systems for Big Data at SNW

SNW, formerly known as Storage Network World, is a cooperative effort between SNIA and Computerworld. 

Interesting SNW observations:

Lots of flash storage, some for block architectures, some NAS, some cards, quite a variety of Flash memory solutions.  The reason for all the interest is the gap between server and storage performance which has reached a critical level, combined with a whopping drop in the cost of NAND flash memory is changing the storage world.  This change will be felt in the Big Data world too.

Another observation: JBOD is now JBOSS- there are no disk drives in the latest Flash based systems, so just a bunch of disks doesn’t make sense.  I propose just a bunch of solid state or JBOSS.  It sounds cool too.

SNIA has a proposed next generation distributed file system, but so have a bunch of others.  A presentation by Philippe Nicolas of Scality caught my attention.

Things don’t scale well at the Pedabyte level, of course.  And things based on hierarchical models don’t parallelize well at all.  A potential solution for a Big Data approach is a distributed file system.  Think more along the lines of how RAID storage parity is spread over multiple drives and assembled in the event of a drive failure.  In a parallel file system, the file system is spread over a number of servers and assembled as needed.  This means the relational DB model is out, and a key value (recognize that Hadoop lovers?) replaces it.  It means that file systems become more of a peer to peer proposition.  In such architectures, the tradeoff can be performance with some sort of global namespace to hold the metadata.  If done right, the performance can be better, if there is some replication of the data. The latest versions are built on approaches from Gnutella, Napster, BitTorrent and others.  The new approaches can actually be used for legitimate purposes.

Philippe’s tenet was that traditional file systems have hit their limit, and things will have to move to parallel structures.  One example of a distributed files system is Apache’s DFS.  It creates a federation of nodes that will act as the file system with a high availability name node for the metadata.  It is built in Java and can scale to 120PB. 

Another example is the Google File System.  The reason they need a massive file system is obvious.  GFS is now version 2, and takes the data in 1MB chunks and distributes it across a multitude of nodes.  The metadata is kept in multiple distributed master nodes.  Moose and Cloudstore have a similar approach.

Parallel NFS (PNFS) is an update of a long-time tool, NFS.  PNFS allows the metadata to be distributed over many nodes to improve performance and availability and is being looked at closely in SNIA.

 Lustre is an object based system that has evolved from widespread academic roots, including Carnegie Mellon.  It is commonly used in academic HPC environments where massive scalability is required. 

Check for more details on their view of file system evolution and current state of the art.

My opinion is that Big Data is the right approach when there is no one single answer.  If the question is “What were the company’s earnings this quarter?”  That question is best solved with a relational DB and traditional data approaches with ACID rigor.  If you want to know why your sales are off, this is a great Big Data problem that will require a lot of information and probably generate a thought provoking answer that will lead to more questions.  Big Data is here to stay, but so are relational approaches, so to get the best results, use both.


Extremely Large Database Conference

SLAC/Stanford Extremely Large Database Conference (XLDB)September 2012

The XLDB conference is targeted at highly sophisticated end users working on large commercial, scientific and government projects.  Companies included Ebay, Zynga, Google, Comcast, Sears, Facebook, Cloudera, AMD, HP, 10Gen, NTT, Teradata, EMC, Stumbleupon and Samsung.  Educational institutions include San Diego Super Computer Center, Michigan State, UC Santa Cruz, University of Chicago, University of Washington, Portland State U, University of Toronto, UC San Diego, GeorgetownUniversity, JacobsUniversity and Stanford/SLAC.  National Labs included CERN, Lawrence Berkeley, and the NationalCenter for Biotechnology Information.

Presenters come from a variety of backgrounds, but the common thread is solving complex large computing problems.  There is a bias toward open source software solutions, but commercial vendors were welcomed also, if they focus on large complex problems.  The audience is very familiar with the common approaches, and come together to refine their approaches and tools.  I have compressed the sessions I attended to glean the more interesting points from a variety of speakers and topics.

The growth of extremely large databases is not coming from transactional data.  Traditional IT shops with relational databases grow a relatively small amount each year.  As such, they might be good candidates for in-memory computing or SSD memory.  Because of the high velocity of this type of data, it is considered “hot” data and a common rule of thumb is when 90% of your I/O is “Hot Data” place it on high performance storage to get the best performance.

Much of the extremely large database load is often “Cold Data” that is infrequently accessed, but can still be valuable.  This is commonly used data for analytics that might include clickstream data, sensor data, and other non-transactional data.  For most analytics, bandwidth is more important than IOPS.  This sort of data is also commonly referred to as “Big Data”.

“Hot Data” is usually suitable for in-memory configurations.  The cost of memory is falling rapidly, and the performance benefits now outweigh the costs for high value transaction data.  Relational databases are the usual engines for transactional data.  For larger RDBMS flash memory and/or SSD storage can be a great fit.  There are some aspects of in-memory systems that need to be examined.  Not all “in-memory” databases allow writes- you may have to refresh the database to get new data in.  Then perform analytics, and refresh again.  Hana from SAP is an example of a technology looking at an “in-memory” solution.

For “Cold Data”, traditional HDD storage should provide the right performance for cost balance.  As we continue to accumulate data there is a natural consequence of Big Data; the mobility gap.  Because databases and data sets are getting so large, they can’t be moved as a practical matter.  This implies that the data centers need to move to 10Gb connections for everything as soon as possible to mitigate this phenomenon.

To bridge the gap between RDBMS and NoSQL databases like Mongo and Cassandra the array database is a new approach, an example being SciDB (  that is typically used for scientific research.  No tables, and it is massively scalable.  It is an open source project.

It was an excellent conference, and some very knowledgeable people presenting and in the audience.  Sometimes the discussions with the presenters got a little loud, but everyone was focused on getting to the bottom of whatever problem was being discussed.


Flash Memory in a Big Data World

The Flash Memory Summit just concluded in Silicon Valley.  The vendors and attendees were legion and the excitement level was high.  Several vendors were fishing for employees and the industry definitely felt like it was on the growth path.  One of the common footballs being passed back and forth was “what role does flash memory play in the enterprise?”  Refinements on that theme include will SSD replace HDD?  “Will new applications, like Big Data, enable a new age for flash?”

Let’s take a quick look at flash memory first, then we can assess how it might impact Big Data applications.  Flash memory is nonvolatile semiconductor memory.  That is to say, unlike DRAM memory, flash memory remembers even after the power is turned off.  It is like DRAM memory in that they are both semiconductor technologies.  Why is flash memory getting a lot of attention now?  It’s all about price.  The cost of flash memory has plummeted.  One speaker mentioned that in the first half of 2012 prices dropped 46%, and was expected to drop about 64% for the year!!!  It was pointed out that SSDs that use flash memory now are price competitively with 15K HDDs, the high-end enterprise HDDs found in many storage arrays.  That will enable a lot of new applications and attention. 

Flash memory is the heart of the sold state disc or SSD.  But you can also get flash as a card or just chips.  SSDs are different from raw Flash in that SSDs have a controller, some software, an interface like SAS or SATA instead of a bus interface like PCIe.  The technology in flash is NAND logic semiconductor and can be found in lots of consumer electronics like your camera, cell phone and thumb drive.  It is incorporated into other components and packaging for your SSD.

It sounds great, and we can expect SSDs to take over from HDDs, right?  Not so fast.  It’s fair to say every technology has some good and bad points.  Latency is a great advantage for solid state memory, either DRAM or Flash.  Flash has some issues.  Write performance is a challenge, and can even be slower than an enterprise HDD.  Another issue for the enterprise is the traditional RAS (reliability, availability and serviceability).  The flash issue is endurance.  Heavy use, especially writes, will cause a flash memory sell to plug up and cease to function.  Flash has a limited life.  The good news is that there are workarounds.  As mentioned earlier the SSD drive isn’t just flash, there are other components, and these can be used to ameliorate some of the downsides of flash.

How does this impact Big Data?  In-memory applications will benefit from SSDs since semiconductor flash will have similar low latency advantages as DRAM but not as fast and not as expensive, so these jobs will see a benefit.  Also looking at the needs of the Big Data job is the best way to predict if there will be enough benefit to offset the expense of Flash.  For instance, Hadoop is a batch job.  There are benefits to getting faster execution of a batch job, but consider how time sensitive your requirements are since flash can be pricey. 

Another issue is the configuration.  Most of the SSDs are found in storage arrays where virtualization and Big Data might coincide.  If storage is being configured to support a NoSQL job like Cassandra, Mongo DB and the like, this could be a good use of flash.  If the configuration is a direct attached storage, in the range of traditional Hadoop clusters, it might not make sense.  LSI was making a case at the Flash Memory Summit that their hybrid drive with flash and rotating disc was a perfect solution for DAS in a Big Data cluster.  They claimed a 37% reduction in run time for a sample Terasort job in a Hadoop environment.

Other speakers feel strongly that bandwidth is the requirement for Big Data performance.  You can see why, since Big Data apps may involve moving massive amounts of data.  Moving that data quickly can be a major determinate of overall performance.    

The basics require that you understand the nature of your job to put the right storage in the right configuration, and there might be a vendor out there with some enabling software that might optimize for your situation.  One vendor was showing how their algorithms can double the performance of their SSDs over standard SSDs.  This just demonstrates how variable your results might be.  Even so, it’s nice to have options, and the Flash Memory Summit showed how even Big Data can benefit from this fast changing technology. 

Big Data Storage

The fundamental hardware view of Big Data, with its Open Source mindset, is for cheap commodity storage and servers.   A major tenet of the Big Data movement is to keep things inexpensive both for software and hardware.  The desire for this is understandable, with the size of Big Data implementations, costs can easily become scary. 

From a hardware view of Big Data, the baseline theory is that you might have blade servers in a rack with a few SATA hard disc drives directly attached to the servers.  These compute nodes would then be replicated to meet the needs of the job.  With the cost of terabyte and larger hard drives at a relatively low cost, multi-terabyte configurations can be implemented without the huge cost of just a few years ago.  The server configurations tend to have large amounts of main memory (maybe 24-48 GB of RAM) with multiple cores (like 4, 6, or 8).  This benefits a parallel processing job that is the typical output of a MapReduce architecture.  The ratio of storage to compute power depends on the demands of the job.

Experienced systems administrators will wonder what is being done to protect the integrity of the system.  Hadoop/MapReduce software includes clustering, redundancy and failover options are standard in this environment.  The common wisdom is that RAID architectures are unneeded since nodes are redundant in the typical Hadoop configuration.  Similarly redundant power on the servers is not required since entire nodes can be replicated, so there is no need to spend money on extra hardware.

Consideration should be paid to the kind of storage placed in the clusters since not all clusters are created equal.  Name Nodes provide a control point for multiple Data Nodes, and benefit from additional attention.  Similarly Job Tracker nodes provide control for multiple Task Tracker nodes and should get additional attention.  These high value nodes might benefit from better quality storage like higher reliability HDDs or SSD to improve uptime. 

For more complex configurations there might be a hierarchy of storage.  SATA hard drives for Data Nodes and Task Tracker Nodes.  SSD for Name Nodes and Job Tracker Nodes.  Tape still has value when used as an archive medium for these jobs since future analysis might require a longer historical picture so the data would need to be reloaded to active media like HDDs or SSDs. 

At the Hadoop Summit there were multiple traditional storage vendors that were making a case for using their non-commodity storage.  The crux of their argument is that if your data is important enough to use, it ought to get their levels of reliability and availability.  Today’s advanced storage arrays were created to solve a number of problems of direct attach storage.  Problems such as expandability, reliability, availability, throughput and backup are solved in today’s storage arrays. 

Additionally, VMware has created a case on virtualization to better share resources, like storage to manage these large clusters more efficiently. 

The bottom line is that the hardware configuration that is employed has to fit the needs of the job.  Knowing the size, complexity, importance of the job will help design an appropriate platform to execute your Hadoop job.  Some trial and error in these early days of Hadoop deployments is common, so keep an open mind!

Hadoop Summit 2012 Summary

The Haddop Summit June 13 and 14 of 2012 was attended by over 2,000 Big Data geeks according to the organizers.  There were over 100 sessions and keynotes.  There were several particularly interesting comments made by the speakers.


Real time systems, resource management and hardware are the next Hadoop Frontiers.  – Scott Burke Yahoo!


We expect to see over 50% of the world’s data on Hadoop by 2015.  -Shaun Connolly Hortonworks


Data will grow from 1.2ZB in 2010 to 35.2ZB in 2020 –IDC Digital Universe Study quoted at the summit


Yahoo! has a Big Data configuration including over 40,000 nodes!  It uses proprietary management to handle this monster.  It services 3,000 users inside Yahoo! 


For more normal configurations, large was considered to be 3-6,000 nodes in operation. 


I was surprised at how many people were muttering about how to displace Oracle installations with Hadoop.  The Open Source movement is something like a religion to some people, but the community has their best shot at a real application based beach head with Hadoop and its attendant components.


There are several tools to try and paste together a comprehensive system, and some providers like Hortonworks and Cloudera have put together stable Hadoop platforms with a set of tools to make it usable.  The established players are getting in on the action too. 


IBM had a significant presence with their BigInsights version of Hadoop.  VMware was very active on the demo floor and with presentations discussing how their approach will bring some additional discipline and tools to the party. 


One of the very fun parts was hearing from companies and organizations that have implemented Hadoop and are using it for real results today.  Because the nature of Hadoop use is to boost one’s competitive position, in many cases the details were missing, but no less compelling.  @WalmartLabs presented, and was also recruiting.  Indeed, almost every presentation mentioned that the company was hiring. 


It’s a party like 1999.


VMware has a different take on storage for the Hadoop cluster.  By design, Hadoop uses local direct attach storage.  VMware wants to take the efficiencies of storage networking to the Hadoop cluster.  They showed some data about direct storage being cheaper than storage networked configurations, and then talked about managing storage more efficiently with VMware.  The major benefit touted was the availability through use of failover of critical components, like the namenode.  They have a project Serengeti to manage Hadoop and provide some structure for availability.


A number of speakers were addressing the issue of working Hadoop into existing relational databases.  Sqoop was repeatedly mentioned as a mechanism to import data into Hadoop to make the analytic efforts more comprehensive and useful. 


Finally, log files, the long forgotten detritus of the data center are getting some respect.  Now that there is a method (Hadoop) of using this data to predict upcoming faults and data center problems, log data is getting new attention.  Log files are now used in security analytics to look for patterns of incursion or threats. 


Do You Need a PhD to Analyze Big Data?

It has been said that you need a doctorate to adequately run Hadoop and the many attendant programs for Big Data analysis.  Is this truth or fiction? 


The Hadoop environment is complex.  Hadoop itself is an open source program from Apache, the people that bring you a wide range of internet related open-source software projects.  Hadoop, as of May 2012, is at release 1.0.2.  There is still a lot to learn from Hadoop and the code is not super mature.  As with many open-source projects, there are a lot of very talented people working on this, and it will gain more function and stability over time.  As you might imagine it has a lot of Java and Unix related stuffing, and there are a number of support programs required to have a real Big Data solution based on Hadoop.


At Big Data University, an IBM™ education effort on Hadoop, you get FREE hands-on experience with Hadoop and some related software.  In Hadoop Fundamentals I, you can get Hadoop code by downloading or using Hadoop in the cloud on Amazon™.  The coursework will take you on an adventure with some great narration for each module and hands-on labs.  The objective is to get you comfortable with Hadoop and Big Data frameworks.


The introductory Hadoop class gets into the IBM version of Hadoop called BigInsights,  as well as HDFS, Pig, Hive, MapReduce, JAQL, and Flume.  Oh, by the way, there might be some user developed code functions required to glue the parts together depending on your needs.  Some assembly is required.    


IBM InfoSphere BigInsights™ provides a stable configuration of Hadoop that has been provided by IBM for the class.  The software is free and IBM can provide support for a price, of course.  You get a stable Hadoop version with the IBM spin, but other companies like Cloudera, Apache and others can also provide Hadoop. 


To actually build a system to do something with Hadoop, you’ll need some other components.  HDFS is the file system that supports Hadoop.  The IBM course will take you through how Hadoop makes use of a file system designed to support parallel processing threads.  You also get a HDFS lab as part of the course.


Next they get into MapReduce, and the methodology on parallel processing and the logic of how data flows and how the pieces fit together.  This includes mergesort, data types, fault tolerance and scheduling/task execution.  Then you get the lab to see if you really got it.


Now that you have the big picture on Big Data, what do you do with it?  To create a MapReduce job could require a lot of Java code.  You use high level tools to create a MapReduce job and thusly reduce the effort and time required.  That’s where Pig, Hive, and Jaql come in.   Pig originated at Yahoo™   and provides a great way to handle unstructured data since there is no fixed schema.  For unstructured data think of analyzing twitter comments.  Hive was developed by Facebook™ and is a data warehouse function for Big Data.  It is a declarative language, so you specify the desired output, and it figures a way to get it.  Jaql was developed by IBM and is another schema optional language for data flow analysis. 


Finally, you might want to move data around your Hadoop cluster.  Flume was developed by Cloudera™ for this purpose.  With Flume you aggregate and move data to an internal store like HDFS. 


Let’s return to our original question from the title of this article.  Do you need a PhD to run a Big Data job?  Probably not.  But it wouldn’t hurt.  In any event you will need to be a highly competent programmer to get good results.  Hadoop solutions are best considered as an iterative process so there will be some experimentation.  A combination of skills that might include programming and analytics methodologies would produce the best results.  So if not a PhD, a really well accomplished computer science graduate would seem to be the focus to enable a Big Data solution. 

Social Media and Big Data

Social media information can be defined as non-structured data: narratives, random facts, opinions, numbers, fabrications, fables all mixed together with abandon with unknown relationships among any of the data.  So why is this useful?

When you ask a question in a survey, you often get what the subject thinks you want to hear.  If you want to know what a person is really thinking, you might have to eavesdrop.  A better way is to check blogs, facebook postings, forums, tweets, and the like to see what people are thinking.  In a traditional survey all the data is about you, and you control the questions.  In a social media search the hit rate for topics you want information on, like your brand or event, is a small minority of comments made.  You might have to sort through a half million documents to get any useful information.  You might get better data if you look through millions of documents.  Where does this info come from?

There are data aggregators who are always on the prowl gathering information for rent.  You can get raw data from the aggregators that you can then put into your analytics engine.  You could also employ your own crawler to go through your documents, customer comments and the like to create the mass of unstructured data you want to understand.  The objective would be to create an aggregation metric that might look at what people are saying about your brand, product or company.  You might want to see how these comments change over time.  You can use it to gage opinion on a marketing campaign or product launch.  This sort of input can provide valuable input to the way you run your business.

This information is still just correlated events, not causality.   To take the next step as to what to do with this information, perhaps it’s time to join the speakers.  By analyzing the information you can find out where people are talking about you, not just what they’re saying.  You might want to monitor or participate in more active places where you are being discussed.  This will require some thought to make sure you don’t undermine the activity, but provide people already thinking about your company or product to get even more involved.  

Analyzing social media is not an exact science but an iterative process that you will do some probing and testing to see what results might be useful.  The major reason for this is the raw data.  It’s not about you.  It’s about everything.  Some sliver of information might be interesting for you, most will not.  You will build rule sets to apply against text to find something, and more rule sets as you get results that aren’t quite what your were looking for.  You’ll end up creating structured data out of unstructured data.  Why?  Because you want to find actionable information.  That is best done with structured information. 

Once you’ve gotten your rule set refined so that you’re getting useful information from the mountain of data, you’ll need to decide how often you want to search the mountain, what kind of analytic queries to find new information in the same mountain and where to find new mountains of data to pursue.  The results are extracted into a structured array of objects that might include potential leads or product interest.  That’s where the gold resides.  You turn your marketing engine on those results, and refine the process to begin again.

Comments on “Competing on Analytics” by Thomas Davenport & Jeanne Harris

Much of American business is built on intuition and experience.  The authors make the case that the power is shifting to enterprises that quantify their decision process.  There are several notable examples of success.  There are few documented failures, although I would expect that they exist also.  The task of identifying a problem that Big Data and business analytics can solve is not trivial.  The execution of that plan can also be significant.  The price tag of business analytics needs to be viewed in the light of any other business expense, answering the question “what is the return on investment?”.  Usually, the outcome is unknown, so the benefit may be difficult to quantify until after the fact.  By then, of course, the money is spent.


The results can also be profound.  The authors take a look at several companies who have been able to create or extend competitive advantage based on a more quantitative view of decision making.  Several notable companies are counted among the quantitative winners such as Google, Netflix, Amazon, and others.   Basing decisions on executive whim is becoming an endangered species according to the book.  Indeed, as a customer base gets large enough to be beyond the comprehension of one person, you are better served to use statistics to understand the behavior of your customers.  Big Data/analytics can also be used to sharpen up manufacturing processes, operations, vendor management, pricing and human resources. 


That last one, human resources, takes a little explaining.  If you saw the movie Moneyball with Brad Pitt, you saw how statistics can trump intuition.  Billy Beane’s move to get a statistical look at players, and screen out their personal, emotional, qualitative aspects provided one of the best runs in baseball.  But they didn’t win the World Series.  But for a given budget they may have maximized their return on investment. 


The book makes a case that a company’s employees need to be as closely scrutinized as their customers.  An interesting theory, but with what data?  One of the problems with most professions is that their evaluations are largely qualitative.  Sales people have quotas so you can measure quota performance, but most positions defy easy quantification.  The risk is that trivial activities that can be measured become the focus instead of contributions to the business.  If analytics for human performance at work can be developed, I would expect to see a new class of executives moving to the forefront.  Current executives have strong social skills and business skills.  In many cases their social skills eclipse their business skills.  If the proper metrics can be developed and tracked and promotions follow, a different kind of executive may emerge with better business skills and perhaps lesser social skills.  This may be good or bad for the workplace.  But this isn’t about enjoying work.  It’s about maximum returns.


The authors create a spectrum for the role of analytics in a business from non-players (Analytically Impaired- stage 1) to masters (Analytical Competitors stage 5).  They portray the change that must be led by senior executives to make the transition, because it will impact managers, employees and their support structure with big budget and process disruption.  The highest use of analytics can create big results and a sustainable advantage enterprise wide.  Analytics then becomes the primary driver of performance and business value. 


Indeed there is data to suggest that extensive use of analytics can create significant market advantages and create a competitive advantage that is significant.  The authors do acknowledge that not every industry can be transformed with analytics.  They look to the airline industry that uses analytics extensively in pricing and operations.  The largest players in the industry keep flirting with bankruptcy.  Of course, it might be worse if they were not so analytically inclined. 


They break the domains for analytics into internal such as financial, manufacturing, R&D, and human resources as well as external for customers and suppliers.  Early examples might be cost management and getting a handle on costs which can be particularly tricky for services.  They also note that analytics is generally an iterative process, and some experimentation may be required.  This increases the price tag of analytics, but it is the predominant way that an organization gains experience and expertise to then leverage projects in different areas of the business.  The ability to progress from a stage one to stage five company will be different for each company, but the majority of companies that can overcome the management issues to persevere, can become dominant in their industries if they continue to evolve. 


The skill set for executives and analyst is significant and somewhat daunting.  The authors get a little hissy about have a doctorate as the only proof of competence.  But certainly the skills laid out in the book are not pervasive in the industry, and there is a wide concern about the scarce availability of skilled people to execute an analytics plan.  Executives need to also have some comprehension about analytics so they can correctly direct the effort and resources to achieve high return efforts. 


Finally, I don’t think there can be any doubt that Big Data/analytics is changing the world we live in.  This trend will only accelerate.  There will be out-sized rewards for companies that move quickly, and the laggards risk a changing market that they can not effectively compete.  Big Data/analytics will not change everything.  It will change most things.  The companies that brave the uncertainty and resource constraints will be the ones that have the best chance of survival in the economies of the twenty-first century.