Busy and Lazy Data

Big Data files seem to come in two flavors: busy and lazy.  This is due to the nature of the data that we generate, and its usage.  Some data will be intensely used while it is fresh, and then relegated to a data parking lot for lazy data.  Other data, particularly unstructured data, will be used less often because the usefulness of the data is in identifying trends or outliers and may be processed in batches, instead of in an interactive manner. 

Busy data benefits from more aggressive management and higher performing systems with the premise that busy data is more valuable and more time-sensitive.  There is a great study that talks about the business value of latency for managing busy data from a 2009 O’Reilly Velocity Conference in San Jose (http://www.youtube.com/watch?v=bQSE51-gr2s) and it reinforces the business need to provide a low latency environment for busy data to get the best result.  The video referenced discusses how Microsoft’s Bing and Google independently came to a very similar conclusion: latency matters.  It was previously thought that people couldn’t perceive differences under 200ms.  These two studies show that people’s behavior responds to faster response (lower latency) until the system response times fall below 50ms.  This means your system design should deliver consistent end user response times at about 50ms.  The typical slowest element in system design is the electro-mechanical disk drive, and this would indicate that it is time to go to an all-flash array architecture, like those from Violin Memory and others.  The business value is surprising.  Going from 1000 ms to 50ms was found to have users more engaged and spending 2.8% more money.  If your business has anything to do with ecommerce, or just getting more productivity from high-value employees, you ought to be looking at 50ms system response times.

Lazy data is a different story, of course.  There is an excellent paper from the USENIX association, a storage geek group, that bears review.  A paper from USENIX 2008 written by Andrew Leung et al called “Measurement and Analysis of Large-Scale Network File System Workloads” can provide some color on how files are used, and how they behave over time (http://www.ssrc.ucsc.edu/Papers/leung-usenix08.pdf).  As the world of Big Data takes hold the usage patterns for files is changing.  Read to write ratios have decreased, read-write access patterns have increased, they see longer sequential runs, most bytes transferred are from larger files and file sizes are larger.  Files live longer with fewer than 50% deleted within a day of creation.  Files are rarely reopened, if they are it is usually within a minute.  They also saw the noisy neighbor effect with fewer than 1% of clients accounting for 50% of file requests.  76% of files are opened only by one client.  File sharing only happens for 5% of the files, and 90% of the sharing is read-only.  Interestingly, most file types do not have a common access pattern.  It sounds like disk might be just fine for lazy data.

The conclusion is that busy data needs a system focused on performance.  And that performance must be extreme by what we previously knew, down to 50ms system response time.  It seems like the move to an all-flash storage environment and the elimination of hard disk drives for busy data is indicated.  The 50ms system response time is difficult for online situations where the network latency is highly variable, but it does provide an indicator of why Google is spending money to put whole communities online.  The other conclusion is that lazy data might be a good fit for a lower cost storage since the intensity of usage is less, and the primary goal should be cost effectiveness.  Disk storage with it’s sequential performance strength is a good fit for lazy data.  To keep costs down you might want to consider some sort of RAID alternative such as SWARM or erasure coding. 


SWARM Intelligence and Big Data

Swarm technology may be thought of as insect logic.  Ants behave in the colony’s interest, but without specific guidance.  How can an animal with a brain the size of a grain of sand create a nest, find food, support a queen and expand to new areas?  Each ant is an individual point of simple intelligence, and with rules that are shared by all ants. Everything that the ant colony needs to get done, gets done.  Simplicity also extends to the ant communication, by using pheromones, the ants communicate with each other without personal contact by reviewing the chemicals left behind by another ant.  Ants are great with parallel operations, since there is no single point of control for all actions, but only one ant is empowered to reproduce, the queen.  In this way the colony is controlled, and the needs of the colony are met.

Think of a Big Data problem.  The MapReduce architecture creates multiple threads.  Simple key/value logic is applied and then shuffled to create a reduced data output that has been intelligently organized.  Each of the MapReduce nodes has simple intelligence to perform a key/value matching.  But in the world of swarm intelligence, this could be done by a multitude of agents, not just a few processors in a batch job.  What if you could release multiple agents into your database of unstructured data to look for anomalies, weed out spurious data or corrupted files, organize data by individual attributes, or identify alternate routes if there are networking problems.  Perhaps swarm intelligence could comb Facebook data to find the next mentally unstable serial killer before they strike.  The beauty is that these agents can be programmed with different simple logic, and in doing so the cost is kept low and the performance is kept high.  The simple intelligence might be programmed to include filetype, age, data protection profile, source, encryption, etc.

One premise is that swarm logic makes file structures unnecessary.  All information about the data is included in the tokens (pheromone).  It would allow the integration of structured and unstructured data.  In sort, swarm can change everything.  I won’t go so far as to say all file structures can go away.  ACID test for data coherency is important for those tasks that have one correct answer, like what was my profit last year, or how many widgets are in the warehouse.  For jobs like this, structured data and relational databases might be best.  But for many analysis jobs, swarm may be a great  improvement.

Swarm provides something that traditional data structures don’t: file intelligence.  In today’s structures we have limited space to specify a rigid set of metadata that is inflexible.  With swarm we can add more intelligence into the data using tokens with a combination of intelligence and information, and let them loose on our unstructured data to find organization, or to sort or otherwise manipulate data.  

Bridging the Gap Between Structured and Unstructured Data

Relational databases like structured data, tables of columns and rows in a defined schema so everyone knows what to expect in every place.  Unstructured data like text, data or numeric values are a different story.  Hadoop certainly fills a need to provide some structure using key value pairs to create some structure where there is no structure.  To get the two worlds of structured and unstructured data to work together, there is usually a bridge of some sort.  The relational database may allow an import of key value data so it can be incorporated into the relational schema, although usually with some work.  This handshaking between relational and key value databases is limping along and is workable.  Under heavy loads the networking and performance impact to move massive quantities of data around can be taxing.  Is there a better way?

To make some sense out of unstructured data some sort of framework needs to be overlaid on the raw data to make it more like information.  This is the reason that Hadoop and similar tools are iterative.  You’re hunting for logic in randomness.  You keep looking and trying different things till something looks like a pattern. 

Besides unstructured and structured data is the in-between land of semi-structured data.  This refers to data that has some beginnings of structure.  It doesn’t have formal discipline like the rows and columns of structured data.  It is usually schema-less or self-describing structure in that it has tags or something similar that provide a starting point for structure.  Examples of semi-structured data might include emails, XML, and similar entities that are grouped together.  Pieces can be missing, or size and type of attributes might not be consistent, so it represents an imperfect structure, but not entirely random.  Hence the in-between land of semi-structured data.

Hadapt takes advantage of this semi-structured data with a data exchange tool to create structured data construct.  They use JSON to exchange file formats.  This is rather clever since JSON is a java script derivative that is fairly well known.  JSON is very good at file exchanges, which can solve the semi-structured to structured problem. By starting with semi-structured data, they get a head start on structure.  JSON is particularly well suited to key value pairs or order arrays.   

The semi-structured data must be parsed with JSON which can then create an array of data that is then available to be manipulated with SQL commands to complete the cycle.  Once there is a structure in place SQL is quite comfortable with the further manipulation.  After all, it is the structured query language.  The most sophisticated tools are in the relational world, hence most efforts to make sense of unstructured or semi-structured data is to add more structure to allow more analysis and reporting.  And after a few steps you indeed can get order out of chaos.



The Problem with Machine Learning

Machine learning is widely perceived as getting its start with chess.  When the skills of the program exceeded the skills of the programmer, the logic went,  you’ve created machine learning.  The machine now has capabilities that the programmer didn’t.  Of course, this is something of a fiction.  Massive calculation capabilities alone don’t really mean there’s profound learning occurring.  Massive calculation capabilities might reveal learning, however. 

 In the meantime machine learning has done some really interesting things, such as driving a car, face recognition, spam mail identification, and with it robots can even vacuum the house.

The core problem for machine learning is structure.   Machine learning will do well with structured environments, but not so well in unstructured environments.  Driving a car is a great example.  It did poorly until the problems it couldn’t identify such as subtleties in the driving surface, or the significance of certain objects were defined in a way that could be dealt with by the program.  The underlying problem is structuring a problem in a way that the program can analyze and resolve it. 

Consider the way people learn, and the way that machines learn.  A child starts learning in an unstructured way.  It takes a long time for a child to learn how to speak or walk.  Each child will learn at their own rate based on their environment and unique genetic makeup.  Once they have an unstructured basis for learning do we add a structured learning environment; school.   With machines we provide a structured environment first, and then hope they can learn the subtleties of a complex world. 

There is an excellent paper by Pedro Domingos of University of Washington looking at the growth areas of machine learning.  One observation is that when humans make a new discovery they can create language to describe the new concept.  These concepts are also comparable in the human mind where people can see the comparison between situations and apply new techniques to existing areas by taking skills from one area and applying to another.  An example of human learning is using physicists to create mathematical models for the financial industry to create models for high speed computer trading. 

 Structured machine learning models are making progress in solving some interesting problems, like those mentioned previously.  The approaches mostly look at providing more layers of complexity in the way problems are analyzed and resolved.  Indeed, the future of machine learning is not in the volume of data, but the complexity of the issues to be studied.  The world is a complex place, and human understanding of it is comprised of a combination of literal and intuitive approaches.  The intuitive is the ability to reach across domains of knowledge, extensions of understanding to new areas, and qualitative judgments.

Because machines literal due to the nature of machine structure, programming has been likewise very literal.  The machine learning models are getting far more sophisticated in terms of complexity.  Still, the challenge of creating a structured tool (machine learning) to tackle an unstructured world may be a problem that will never be entirely resolved till we restructure the machine. 

Perhaps we create a machine with a base level of “instinct”, enough to allow the machine to perceive the world around them, and then let it learn.  If we were to create a machine that might take a few years of observation of the world and create their own basis for understanding the universe, what kind of intelligence would be created?  Would we be able to control it?  Will it provide any useful service to us?  

Hadoop Summit 2013

The Hadoop Summit for 2013 has just concluded in San Jose.  There were a few themes that seemed to recur throughout the two-day summit with over 2,500 people.  The overall story is the continued progress to take Hadoop out of the experimental and fringe case environment, and move it into the enterprise with all the appropriate controls, function and management.  A related objective is to have 50% of the world’s data on Hadoop in five years.

The latest Hadoop release 2.0 is known as Yarn (yet another resource negotiator).  To be a little more precise Hadoop is still less than one at release 0.23, but MapReduce is now version 2.0 or MRv2.  The new MRv2 release addresses some of MapReduces’ long known problems such as security and scheduling limitations.  Hadoop’s Job Tracker/resource manager/job scheduling have been re-engineered to provide more control with a global resource manager and an application focused “application master”.  The new Yarn APIs are backwards compatible with the previous version with a recompile.  Good news, of course.  You can get more details of the new Hadoop release at the Apache site hadoop.apache.org

The other themes in the Hadoop Summit included in-memory computing, DataLakes, 80% rule, and the role of open source in a commercial product.

Hadoop traditionally is a batch job.  Enterprise applications demand  an interactive capability.  Hadoop is moving into an interactive capability.  But it doesn’t stop there.  The step beyond interactive capability is stream processing with In-memory computing.  In-memory computing is becoming more popular as the cost of memory plummets and people are increasingly looking for “real-time” response from MapReduce related products like Hadoop.  The leading player with in-memory computing is SAP’s Hana, but there are several alternatives.  In-memory processing provides blazing speed, but higher costs than a traditional paging database that moves data in and out of rotating disc drives.  Performance can be enhanced by the use of Flash memory, but it may still not be enough.  In-memory typically will have the best performance, and several vendors like Qubole,  Kognitio (which pre-dates Hadoop by quite a bit), Data Torrent as well as others showing at the conference were touting the benefits of their in-memory solutions.  They provide a great performance boost, if that’s what your application needs.  

DataLakes came up in the kickoff as a place to put your data till you figure out what to do with it.  I immediately thought of data warehouses, but this is different.  In a data warehouse you will usually need to create a schema and scrub the data before it goes in the warehouse so you can process it more efficiently.  The idea of a DataLake is to put the data in, and figure out the schema as you do the processing.  A number of people I spoke with are still scratching their heads about the details of how this might work, but the concept has some merit.

The 80% rule, the Pareto Principle, refers to 80% of the results coming from 20% of the work, be it customers, products or whatever.  In regards to Big Data this is how I view many of the application specific products for Big Data.  Due to the shortage of Data Scientists, creating products and platforms for people with more general skills provides 80% of the benefit of Big Data with only 20% of the skills required.  I spoke with the guys at Talend and that is clearly their approach.   They have a few application areas that have specific solutions aimed at user analyst skills to address the fat part of the market. 

Finally, there remains tension between open source and proprietary products.  There are some other examples of open source as a mainstream product, and Linux comes to mind as the poster child for the movement.  Most of the open source projects are less mainstream.  Commercial companies need to differentiate their products to justify their existence.   The push behind Hadoop to be the real success story for open source is pretty exciting.  Multiple participants I spoke with saw open source as the best way to innovate.  It provides a far wider pool of talent to access, and has enough rigor to provide a base that other vendors can leverage for proprietary applications.  The excitement at Hadoop Summit generated by moving this platform into the enterprise is audacious, and the promise of open source software seems to be coming true.  Sometimes dreams do come true.        

Why Would I Use NoSQL?

In any job, it helps when you use the right tool for the job. In the Big Data universe there can be many different kinds of data. Structured data in tables. Text from email, tweets, facebook, or other sources. Log data from servers. Sensor data from scientific equipment. To get answers out of this variety of data, there are a variety of tools.

As always with Big Data, it helps to have the end in mind before you start. This will guide you to the sources of data you need to address your desired result. It will also indicate the proper tool. Consider a continuum from a relational database management system (RDBMS) and Hadoop/MapReduce engine on the other end. RDBMS architectures, like Oracle, has ACID (Atomicity, Consistency, Isolation, Durability), a set of properties to assure that database transactions are processed reliably. This is why for critical data that must be correct, and cost is secondary, RDBMS is the standard due to this reliability. For example, you want to know what amount should be on the payroll check. It has to be right. On the other end are the MapReduce solutions. Their primary concern is not coherency like the RDBMS, but parallel processing massive amounts of data in a cost effective manner. Fewer assurances are required for this data because of the result desired. This is often the case when looking for trends or trying to find some correlation between events. MapReduce might be the right tool to see if your customer is about to leave you for another vendor.

The NoSQL world is somewhere in between. While the RDBMS has consistent coherency, the NoSQL world works on eventual consistency. The two-stage commit with the use of logs is a way to get things sorted out eventually, but at any given point in time, a user might get data that hasn’t been updated. This might be adequate for jobs that need faster turnaround time than MapReduce, but don’t want to spend the money to build out the expensive infrastructure for a full RDBMS. MapReduce is a batch job, meaning that the processing has a definite start and stop to produce results. If MapReduce can’t deliver adequate latency, NoSQL provides continuous processing, instead of batch processing for lower latency. Another advantage of NoSQL, similar to MapReduce is scalability. NoSQL provides horizontal scaling up to thousands of nodes. Job are chopped up, as in MapReduce, and spread among a large number of servers for processing. It might be just the ticket for a Facebook update.

One of the downsides of a NoSQL database is the potential for deadlock. A deadlock occurs when two processes are waiting for the other to finish, and needs the other to finish before it proceeds. Hence this stare-down called a deadlock. This might be because the processes are updating records in a difference sequence and they are in conflict resulting in a permanent wait state. There are some tools to minimize the impact of this potential. The workarounds might result in someone seeing outdated data, but again, if it is acceptable for the desired result, then NoSQL could be a good fit. Eventually things get sorted out, if properly designed.

As you see, understanding the job at hand, the desired result, and what kind of issues are acceptable will determine if RDBMS, NoSQL or a MapReduce solution will fit. NoSQL options are growing all the time, which might indicate that this middle ground is finding more suitable jobs.

Managing a Flood of Data

Managing a Flood of Data

With increasing connectedness of devices and people, the data just keeps coming. What to do with all that data is becoming an increasing problem, or opportunity if you have the right mindset. In general there are three things that can be done with this flood of data:

  1. Discard it, or some of it (sampling)

  2. Parallelize the processing of it (e.g. MPP- massively parallel processing architectures)

  3. In-memory processing with massive HW resources

Any combination of the above might make sense depending on the intent of the project, the amount and kinds of data, and of course, your budget. I find it interesting that the traditional RDBMS still has legs with the movement to utilize in-memory processing which is made possible by continually falling memory prices, making this a “not crazy” alternative. Of course it gets back to what did you want to do with what kind and amount of data. For instance, a relational database for satellite data may not make sense, even if you could do it.

Here’s where the file system can become very interesting. It might be ironic that unstructured data must be organized to be able to analyze it, but I think of it as farming. You cultivate what you have to get what you want. Ideally, the file system will provide a structure for the analysis that will follow. There doesn’t seem to be a shortage of file systems out there, but because the flood of unstructured data is relatively recent, there might be even better file systems on the way.

There are a number of file structures available: local, remote, shared, distributed, parallel, high performance computing, network, object, archiving and security being some examples. The structure of these can be very different. For the flood of unstructured data, parallel file systems seem to offer a way to organize this data for analytics. In many cases the individual record is of little value, indeed the value in most unstructured datasteams is in aggregate. Users are commonly looking for trends or anomalies within a massive amount of data.

An application with massive amounts of new data would suggest that traditionally structured file systems for static data (like data warehouses) might not be able to grow as needed, since the warehouse typically takes a point-in-time view. Traditional unstructured static data like medical imaging might be appropriate based on the application, but most analytics can’t do much with images. Dynamic data has its own challenges. Unstructured dynamic data like CAD drawings or MS Office data (text, etc.) may lend themselves to a different file structure than dynamic structured data like CRM and ERP systems where you are looking for a specific answer from the data.

Dealing with massive amounts of new data may be a recipe for a non linear approach to keep up with the traffic. Parallel file systems started life in the scientific high performance computing (HPC) world. IBM created a parallel file system in the 1990’s called GPFS, but it was proprietary. The network file system (NFS) provided the ability to bring a distributed file system to the masses and share files more easily with a shared name space. Sun created NFS and made it available to everyone, and it was generally adopted and enhanced. There are some I/O bandwidth issues with NFS, which companies like Panasas and open systems oriented Lustre have tried to address. I/O bandwidth remains the primary reason to consider a parallel file system. If you have a flood of data, it’s probably still the best way to deal with it.

I expect to see more parallel and object file systems to provide improved tools over what is available today to better manage the massive data flooding into our data centers. Increasingly, the sampling approach will be diminished since the cost of storage continues to fall, and some of the most interesting data are outliers. The “long tail” analysis to find situations where the rules seem to change when events become extreme can be very valuable. This may require the analysis of all the data, since sampling may not give sufficient evidence to “long tail” events that occur infrequently.

In summary, managing the flood of data is a question of identifying what you want to get from the data. That combined with the nature of the data will guide you to an appropriate file system. In most cases a parallel file system will be the solution, but you have to know your application. The good news is as our sophistication grows, we will have more options to fine tune the systems we build to analyze the data we have to get the results we want.



Information Extraction- Ready for Prime Time?

Oren Etzioini of the University of Washington held a talk at Adobe in March, and gave a rundown on the current state of the art in IE.  We’ll get to that in a minute, but what is IE?  Information Extraction is the science of making sense of unstructured human text.  The challenge is that human language can be imprecise.  Structured data is so named because of the systematic categorization of the data into tables in a way that optimizes its analysis.  Unstructured data, as in human speech, does not lend itself to tables nor structure.  In analyzing human language, it is common to employ natural language processing to create a system that will derive useful information from human language.  This may not be possible in politics, but perhaps in business it could work.

Why is this useful?  Today’s technology allows us to ask “What is the best Mexican restaurant in San Jose” and get a ranking by star ratings that users have input.  IE allows us to ask “Where can I get the best margarita in San Jose?”  and get a ranking by comments about margaritas.  To get a ranking based on attributes that weren’t defined in advance, queries require a more advanced understanding of what is being said in reviews, not just star ranking. 

How do you analyze unstructured text?  The key to answering the attribute based questions is context.  Information extraction is machine learning.  Algorithms will attempt to determine what is relevant.  Scalability combined with algorithms are the keys to generate useful results,.  The IE model identifies the tuple and the probability of a relationship.  An example might be trying to find out who invented the light bulb, and getting results such as: invented(Edison, light bulb), 0.99, indicating a strong link between Edison and inventing the light bulb. 

If one is looking for examples of some attribute, they often occur in context with other terms, which we might consider as clues.  One can then use clues to find more instances of the attribute.  This is how we pick apart context in more detail.    

The challenge is extracting the information we’re looking for, and ignore the rest.  One of the more interesting applications is a shopping tool, decide.com that will check different websites for rumors about new product introductions.  It even goes further to estimate when a new product might come along based on the company’s previous history, what will happen to the pricing, and what kind of features are being talked about.  It creates a summary of rumors compiling the results of multiple sources, saving time, and results can be displayed on a mobile device.

Dr. Etzioni’s pet project is Open IE.  His premise is that word relations have canonical structure.  By looking at this structure you extract the relationships for analysis.  You don’t generally pre-identify the concepts, you want to be able to find interesting stuff.  His extractors find these relationships.  You can play with his model on the web at openie.cs.washington.edu or get Open IE extractors for download without license fees.

There are some early IE efforts out there today.  Google Knowledge Graph and Facebook Graph Search are a couple.  Oren is part of a startup, Decide.com, that is also in the space.  All of them are relatively early stages as algorithms improve the usefulness of the data will improve.  People want an answer, not a bunch of results to sort through.  This becomes increasingly important as we are increasingly looking for these results on our mobile device.  This forces a more succinct response, like talking to a person, and getting an answer.  Oren did mention that Siri, which can respond to a query with an answer is very limited.  He wants to use all available documents, tweets, reviews, posts, blogs and everything he can get his hands on to formulate an answer.

Check out his extractor on the website mentioned above for more information, and a free test drive.

In answer to the question, IE is almost ready for prime time.  There are promising signs for this technology, but I wouldn’t bet my house on it yet.


“Active Flash” for Big Data Analytics on SSD-based Systems

FAST13 USENIX Conference on File and Storage Technologies February 12–15, 2013 in San Jose, CA

If you’re not familiar with the geekfest called USENIX and their file and storage technology conference, it is a very scholarly affair. Papers are submitted on a variety of file and storage topics, and the top picks present their findings to the audience. The major component and system vendors are there along with a wide variety of academic and national labs.

Let’s review a paper about using SSDs in high performance computing where there are a large number of nodes.  See the reference at the end for details regarding the paper.*

The issue is how to manage two jobs on one data set.  The example in the paper is a two-step process in the high-end computing world.  Complex simulations are being done on supercomputers then the results are moved to separate systems where the data is subject to analytics.  Analytics are typically done on smaller systems in a batch mode. The high-end computing (HEC) systems that do the simulations are extremely expensive, and keeping them fully utilized is important. This creates a variety of issues in the data center that include the movement of data between the supercomputer and the storage farm, analytic performance and the power required for these operations. The approach proposed is called “Active Flash”.

The floating point operations performed on the HEC systems are designed for the simulation, not the typical analytic workload.  This results in the data being moved to another platform for analytics. The growth in the data (now moving to exabytes) is projected to increase costs so that just moving the data will be comparable in cost to the analytic processing. In addition, extrapolating the current power cost to future systems indicates this will become the primary design constraint on new systems. The authors expect the power to increase 1000X in the next decade while the power envelope available will only be 10X greater. Clearly, something must be done.

The authors have created an openSSD platform Flash Translation Layer (FTL) with data analysis functions to prove their theories about an Active Flash configuration to reduce both the energy and performance issues with analytics in a HEC environment. Their 18,000 compute node configuration produces 30TB of data each hour. On-the-fly data analytics are performed in the staging area, avoiding data migration performance and energy issues.  By staging area we are talking about the controller in the SSDs. 

High Performance Computing (HPC) tends to be bursty with I/O intensive and compute intensive activity. It’s common that a short I/O burst will be followed by a longer computational activity period. These loads are not evenly split, indeed I/O is usually less than 5% of overall activity. The nature of the workload creates an opportunity for some SSD controller activity to do analytics. As SSD controllers move to multi-core this creates more opportunity for analytics activity while the simulations are active.

The model to identify which nodes will be suitable for SSDs is a combination of capacity, performance, and write endurance characteristics. The energy profile is similarly modeled to predict the energy cost and savings of different configurations. The author’s experimental models were tested in different configurations. The Active Flash version actually extends the traditional FTL layer with analytic functions. The analytic function is enabled with an out-of-band command. The result is elegant and outperforms the offline analytic or dedicated analytic node approach.

The details and formulas are in the referenced paper, and are beyond my humble blog. But for those thinking of SSDs for Big Data, it appears the next market is to enhance the SSD controller for an Active Flash approach to analytics.


*The paper is #119 “Active Flash: Towards Energy-Efficient, In-Situ Data Analytics on Extreme-Scale Machines” by Devesh Tiwari 1, Simona Boboila 2, Sudharshan S. Vazhkudai 3, Youngjae Kim 3, Xiaosong Ma 1, Peter J. Desnoyers 2 and Yan Solihin 1 1North Carolina State University 2Northeastern University 3Oak Ridge National Laboratory.



Marketing Generalizations Based on Academic Research

The basis for this blog is a book from a UCLA professor, Dominique Hanssens, “Empirical Generalizations about Marketing Impact”. The current big data analytics is focused on discovering trends from massive amounts of data. I believe that combining big data analytics and marketing models from academia will provide the next step in analytic effectiveness. The referenced book is a summary of marketing research from all over the world. The studies are collected into chapters on topics like pricing, promotion, market adoption, etc.

One ox that gets gored right away is the link between market share and profitability. The correlation is 0.35, meaning that profits are only somewhat related to market share based on real world studies. If you want market share and profits you best look to new markets. Market pioneers tend to have higher market share and profits. If the market pioneer develops a broad product line early it can force follow-on competitors into narrow niches. Indeed, the order of entry of new competitors relates to how much market the subsequent entrants can expect. Less for later, more for earlier. There are some benefit to ads and promotion to mitigate a later entry. That said, the pioneers are still risk takers, and 64% of the pioneers fail.

The Bass diffusion model predicts how a new product will be accepted by the market. This model has been around for decades, and indicates that 5-6 years from the start of sales, things peak. Things start to decline around 8 years after sales take off. The diffusion of a product depends on the type of market. More developed markets adopt faster than less developed markets. Fun products like consumer electronics adopt faster than work related products. Another interesting thing about adoption has to do with standards. When there are competing standards (like high capacity DVDs or Beta vs VHS video) things start out slower, but grow faster subsequently. It also makes a difference on how new the product is. For a real new category of product, these experience the fastest growth. Moderately novel products being introduced show slower growth, even while they might be almost as complex as a truly novel product entry.

A new product’s acceptance depends on consumer innovation, usage intensity and income. Consumers who are peer influenced have a lower trial probability. Ads lessen the impact of peer pressure. Innovative consumers are more effected by features and displays than by peers. Price promotions can increase the size of the pie for a new product category, not just market share in a new market. Use promotions to boost demand for a new category of product. The benefit of the promotion lessens after 10 weeks. Ads are more effective early in a product life cycle than later.

Price as a marketing tool in consumer markets can increase sales from existing customers who may be stockpiling product, as opposed to causing competitive customers to switch brands. Price promotions work best with merchandising changes.

The balance between a sales force and marketing is tested also. Personal selling budgets are most effective in early product life and less effective in later product life cycles. Personal selling is also more effective in Europe than in the US. Sales budgets seem to work best at about 12.5% of revenue.

Trade shows work best in the IT industry. They yield about twice the benefit of other industries like medical, entertainment, etc. Also the size of the booth at the trade show is less important than the number of salespeople that populate the booth.

Responding to competitor prices in your ads reduces competitive shopping by consumers. These ads can be effective for up to three quarters. Responding to competitive ads, however, seldom pays. Speaking of ads, only 20% of sales effect from ads is from the actual campaign. Most of the benefit comes from the rest of the marketing efforts, underlining the need for a real marketing plan, not just a campaign. Marketing spending is commonly between 10-20% for successful companies. Higher gross margin products are at the high end, and lower margin products at the low end.

Marketing channels are key for consumer products. About 54% of the difference among brands has to do with distribution breadth.

Finally, it really is all about the customer. Customer relationships are far more important than business performance. Build trust with your customer and it will pay off. For internet companies customers look for privacy policies, good security, easy navigation, and impartial advice. Overselling will reduce the trust of your customers. Companies that are customer focused outperform those that are operationally oriented. Which kind do you want to be?