Marketing Generalizations Based on Academic Research

The basis for this blog is a book from a UCLA professor, Dominique Hanssens, “Empirical Generalizations about Marketing Impact”. The current big data analytics is focused on discovering trends from massive amounts of data. I believe that combining big data analytics and marketing models from academia will provide the next step in analytic effectiveness. The referenced book is a summary of marketing research from all over the world. The studies are collected into chapters on topics like pricing, promotion, market adoption, etc.

One ox that gets gored right away is the link between market share and profitability. The correlation is 0.35, meaning that profits are only somewhat related to market share based on real world studies. If you want market share and profits you best look to new markets. Market pioneers tend to have higher market share and profits. If the market pioneer develops a broad product line early it can force follow-on competitors into narrow niches. Indeed, the order of entry of new competitors relates to how much market the subsequent entrants can expect. Less for later, more for earlier. There are some benefit to ads and promotion to mitigate a later entry. That said, the pioneers are still risk takers, and 64% of the pioneers fail.

The Bass diffusion model predicts how a new product will be accepted by the market. This model has been around for decades, and indicates that 5-6 years from the start of sales, things peak. Things start to decline around 8 years after sales take off. The diffusion of a product depends on the type of market. More developed markets adopt faster than less developed markets. Fun products like consumer electronics adopt faster than work related products. Another interesting thing about adoption has to do with standards. When there are competing standards (like high capacity DVDs or Beta vs VHS video) things start out slower, but grow faster subsequently. It also makes a difference on how new the product is. For a real new category of product, these experience the fastest growth. Moderately novel products being introduced show slower growth, even while they might be almost as complex as a truly novel product entry.

A new product’s acceptance depends on consumer innovation, usage intensity and income. Consumers who are peer influenced have a lower trial probability. Ads lessen the impact of peer pressure. Innovative consumers are more effected by features and displays than by peers. Price promotions can increase the size of the pie for a new product category, not just market share in a new market. Use promotions to boost demand for a new category of product. The benefit of the promotion lessens after 10 weeks. Ads are more effective early in a product life cycle than later.

Price as a marketing tool in consumer markets can increase sales from existing customers who may be stockpiling product, as opposed to causing competitive customers to switch brands. Price promotions work best with merchandising changes.

The balance between a sales force and marketing is tested also. Personal selling budgets are most effective in early product life and less effective in later product life cycles. Personal selling is also more effective in Europe than in the US. Sales budgets seem to work best at about 12.5% of revenue.

Trade shows work best in the IT industry. They yield about twice the benefit of other industries like medical, entertainment, etc. Also the size of the booth at the trade show is less important than the number of salespeople that populate the booth.

Responding to competitor prices in your ads reduces competitive shopping by consumers. These ads can be effective for up to three quarters. Responding to competitive ads, however, seldom pays. Speaking of ads, only 20% of sales effect from ads is from the actual campaign. Most of the benefit comes from the rest of the marketing efforts, underlining the need for a real marketing plan, not just a campaign. Marketing spending is commonly between 10-20% for successful companies. Higher gross margin products are at the high end, and lower margin products at the low end.

Marketing channels are key for consumer products. About 54% of the difference among brands has to do with distribution breadth.

Finally, it really is all about the customer. Customer relationships are far more important than business performance. Build trust with your customer and it will pay off. For internet companies customers look for privacy policies, good security, easy navigation, and impartial advice. Overselling will reduce the trust of your customers. Companies that are customer focused outperform those that are operationally oriented. Which kind do you want to be?

Posted in Uncategorized | Leave a comment

Marketing Analytics Beyond the Clickstream

Big Data and analytics are closely related.  What good is Big Data if not as a better basis for analysis?  Indeed it is commonly said that more data trumps better algorithms.  One area that is very popular, particularly with retail concerns is marketing analytics.  UCLA professor Dominique Hanssens is a recognized expert in this area.  He noted that marketing analytics as an academic endeavor started in the 1960’s with a more rigorous look at what happens in markets.  Some of the first work was done looking how a new product gains acceptance (or not) in the market.  How fast does it gain acceptance?  There have been many studies on segmentation also from the 1960’s, which is critical to develop marketing messages that are more apt to produce results that blanket marketing to an undifferentiated audience.  Market response models followed in the 1970’s and consumer choice models in the 1980’s. 

The bar has been raised for marketing in recent years and indicates how business is reflecting a more quantitative approach, and ROI for all parts of the business is now common.  Marketing analytics isn’t just about PPC, SEO and analytics on the clickstream.  Big Data analytics in marketing allows a more complete picture of not just market segments but even individuals.  Together they can be used to better manage the marketing budget.  One area that can have significant impact on the company’s profitability is pricing management with analytics.  Some industries such as airlines and hotels have a perishable offering that benefits from a dynamic pricing strategy, and indeed consumers have found it more challenging to shop just on price, since the timing of the purchase has a great deal to do with the price paid for the airline seat or hotel room.  This is much more about the vendor getting value for the offering, and not just a cost plus pricing model.  The difference can significantly impact the profitability of the company. 

One example of how traditional analytics combined with recent access to Big Data can change the company’s behavior is the diffusion of a new product in the market.  By viewing the pre-sales behavior among prospective customers, you can predict the eventual sales in different segments.  Big Data allows a more comprehensive description of market segments, as well as detailed information on pre-sales behavior.  New products will typically show different rates of adoption among different market segments.  By segmenting the market more completely you can more accurately predict the prospective customer, and perhaps offer additional promotional efforts to these segments that are more likely to produce a higher return on investment for the company where a blanket outreach may not be appropriate for the entire market space.  The Bass diffusion model has been around since 1969, but now with Big Data capabilities it allows a more granular look at how the product will move into the market and will suggest segments appropriate for additional marketing outreach.

One way that marketing outreach might gain greater effectiveness with Big Data is in the creation of “buzz” for highly enthusiastic fans of a particular product by identifying the leads in a given market segment.  Big Data allows us to analyze the connectedness of different individuals, which allows us to focus on the perceived experts in a given segment to provide extra information and motivation to promote their transmitting their support for the new product to their network, and thereby spur adoption in the market. 

 By combining the traditional marketing analytic models developed and refined over the decades, with relatively new looks at what used to be hard-to-analyze unstructured data from the web capturing consumer behavior, we have a new depth in our understanding of how markets work, and the basis for more focused and accountable marketing.

Posted in Uncategorized | Leave a comment

The Power of Good Questions

Kaushik Das of Greenplum, at a recent Big Data Cloud meetup spoke about Big Data as an enabler of 21st Century storytelling.  One of the comments made was about the Chompsky vs Norvig argument regarding patterns that can be found in random data.  Correlation is not cause and effect, a difference that most of the press still does not comprehend.  Chompsky is the premier linguist in the world, and feels we need to understand cause and effect, not just correlation as identified by Google’s Norvig.

It does bring into focus the ability of analytics to bring insights versus pattern searching.  The difference is in the quality of the questions asked.  I think that because analytics is pattern matching, in essence, that one can derive more insight with multiple questions, and I believe this is how most practitioners approach Big Data analytics.  The tools of analytics are powerful and provide greater reach than previously possible.  No longer are we limited to averages and samples, but a complete analysis of the data is possible.  This is important because at the fringes things can become non-linear with spectacular impact.  Look at the financial crisis or hurricane prediction as a couple of examples of where data that works very predictably inside the middle of a normal distribution can become very wild at the extremes.

Real insight becomes storytelling.  I mean this in the highest sense.  Storytelling that allows us to understand systems, be they natural, or based on human behavior can be tremendously powerful.  At the best it becomes predictive.  At the worst we are awaiting 100 monkeys with typewriters to produce Shakespeare.  When in doubt, ask more questions.  Do more analytics, and get more perspective on the story.

Posted in Uncategorized | Leave a comment

Parallel File Systems for Big Data at SNW

SNW, formerly known as Storage Network World, is a cooperative effort between SNIA and Computerworld. 

Interesting SNW observations:

Lots of flash storage, some for block architectures, some NAS, some cards, quite a variety of Flash memory solutions.  The reason for all the interest is the gap between server and storage performance which has reached a critical level, combined with a whopping drop in the cost of NAND flash memory is changing the storage world.  This change will be felt in the Big Data world too.

Another observation: JBOD is now JBOSS- there are no disk drives in the latest Flash based systems, so just a bunch of disks doesn’t make sense.  I propose just a bunch of solid state or JBOSS.  It sounds cool too.

SNIA has a proposed next generation distributed file system, but so have a bunch of others.  A presentation by Philippe Nicolas of Scality caught my attention.

Things don’t scale well at the Pedabyte level, of course.  And things based on hierarchical models don’t parallelize well at all.  A potential solution for a Big Data approach is a distributed file system.  Think more along the lines of how RAID storage parity is spread over multiple drives and assembled in the event of a drive failure.  In a parallel file system, the file system is spread over a number of servers and assembled as needed.  This means the relational DB model is out, and a key value (recognize that Hadoop lovers?) replaces it.  It means that file systems become more of a peer to peer proposition.  In such architectures, the tradeoff can be performance with some sort of global namespace to hold the metadata.  If done right, the performance can be better, if there is some replication of the data. The latest versions are built on approaches from Gnutella, Napster, BitTorrent and others.  The new approaches can actually be used for legitimate purposes.

Philippe’s tenet was that traditional file systems have hit their limit, and things will have to move to parallel structures.  One example of a distributed files system is Apache’s DFS.  It creates a federation of nodes that will act as the file system with a high availability name node for the metadata.  It is built in Java and can scale to 120PB. 

Another example is the Google File System.  The reason they need a massive file system is obvious.  GFS is now version 2, and takes the data in 1MB chunks and distributes it across a multitude of nodes.  The metadata is kept in multiple distributed master nodes.  Moose and Cloudstore have a similar approach.

Parallel NFS (PNFS) is an update of a long-time tool, NFS.  PNFS allows the metadata to be distributed over many nodes to improve performance and availability and is being looked at closely in SNIA.

 Lustre is an object based system that has evolved from widespread academic roots, including Carnegie Mellon.  It is commonly used in academic HPC environments where massive scalability is required. 

Check for more details on their view of file system evolution and current state of the art.

My opinion is that Big Data is the right approach when there is no one single answer.  If the question is “What were the company’s earnings this quarter?”  That question is best solved with a relational DB and traditional data approaches with ACID rigor.  If you want to know why your sales are off, this is a great Big Data problem that will require a lot of information and probably generate a thought provoking answer that will lead to more questions.  Big Data is here to stay, but so are relational approaches, so to get the best results, use both.


Posted in Uncategorized | Leave a comment

Extremely Large Database Conference

SLAC/Stanford Extremely Large Database Conference (XLDB)September 2012

The XLDB conference is targeted at highly sophisticated end users working on large commercial, scientific and government projects.  Companies included Ebay, Zynga, Google, Comcast, Sears, Facebook, Cloudera, AMD, HP, 10Gen, NTT, Teradata, EMC, Stumbleupon and Samsung.  Educational institutions include San Diego Super Computer Center, Michigan State, UC Santa Cruz, University of Chicago, University of Washington, Portland State U, University of Toronto, UC San Diego, GeorgetownUniversity, JacobsUniversity and Stanford/SLAC.  National Labs included CERN, Lawrence Berkeley, and the NationalCenter for Biotechnology Information.

Presenters come from a variety of backgrounds, but the common thread is solving complex large computing problems.  There is a bias toward open source software solutions, but commercial vendors were welcomed also, if they focus on large complex problems.  The audience is very familiar with the common approaches, and come together to refine their approaches and tools.  I have compressed the sessions I attended to glean the more interesting points from a variety of speakers and topics.

The growth of extremely large databases is not coming from transactional data.  Traditional IT shops with relational databases grow a relatively small amount each year.  As such, they might be good candidates for in-memory computing or SSD memory.  Because of the high velocity of this type of data, it is considered “hot” data and a common rule of thumb is when 90% of your I/O is “Hot Data” place it on high performance storage to get the best performance.

Much of the extremely large database load is often “Cold Data” that is infrequently accessed, but can still be valuable.  This is commonly used data for analytics that might include clickstream data, sensor data, and other non-transactional data.  For most analytics, bandwidth is more important than IOPS.  This sort of data is also commonly referred to as “Big Data”.

“Hot Data” is usually suitable for in-memory configurations.  The cost of memory is falling rapidly, and the performance benefits now outweigh the costs for high value transaction data.  Relational databases are the usual engines for transactional data.  For larger RDBMS flash memory and/or SSD storage can be a great fit.  There are some aspects of in-memory systems that need to be examined.  Not all “in-memory” databases allow writes- you may have to refresh the database to get new data in.  Then perform analytics, and refresh again.  Hana from SAP is an example of a technology looking at an “in-memory” solution.

For “Cold Data”, traditional HDD storage should provide the right performance for cost balance.  As we continue to accumulate data there is a natural consequence of Big Data; the mobility gap.  Because databases and data sets are getting so large, they can’t be moved as a practical matter.  This implies that the data centers need to move to 10Gb connections for everything as soon as possible to mitigate this phenomenon.

To bridge the gap between RDBMS and NoSQL databases like Mongo and Cassandra the array database is a new approach, an example being SciDB (  that is typically used for scientific research.  No tables, and it is massively scalable.  It is an open source project.

It was an excellent conference, and some very knowledgeable people presenting and in the audience.  Sometimes the discussions with the presenters got a little loud, but everyone was focused on getting to the bottom of whatever problem was being discussed.


Posted in Uncategorized | Leave a comment

Flash Memory in a Big Data World

The Flash Memory Summit just concluded in Silicon Valley.  The vendors and attendees were legion and the excitement level was high.  Several vendors were fishing for employees and the industry definitely felt like it was on the growth path.  One of the common footballs being passed back and forth was “what role does flash memory play in the enterprise?”  Refinements on that theme include will SSD replace HDD?  “Will new applications, like Big Data, enable a new age for flash?”

Let’s take a quick look at flash memory first, then we can assess how it might impact Big Data applications.  Flash memory is nonvolatile semiconductor memory.  That is to say, unlike DRAM memory, flash memory remembers even after the power is turned off.  It is like DRAM memory in that they are both semiconductor technologies.  Why is flash memory getting a lot of attention now?  It’s all about price.  The cost of flash memory has plummeted.  One speaker mentioned that in the first half of 2012 prices dropped 46%, and was expected to drop about 64% for the year!!!  It was pointed out that SSDs that use flash memory now are price competitively with 15K HDDs, the high-end enterprise HDDs found in many storage arrays.  That will enable a lot of new applications and attention. 

Flash memory is the heart of the sold state disc or SSD.  But you can also get flash as a card or just chips.  SSDs are different from raw Flash in that SSDs have a controller, some software, an interface like SAS or SATA instead of a bus interface like PCIe.  The technology in flash is NAND logic semiconductor and can be found in lots of consumer electronics like your camera, cell phone and thumb drive.  It is incorporated into other components and packaging for your SSD.

It sounds great, and we can expect SSDs to take over from HDDs, right?  Not so fast.  It’s fair to say every technology has some good and bad points.  Latency is a great advantage for solid state memory, either DRAM or Flash.  Flash has some issues.  Write performance is a challenge, and can even be slower than an enterprise HDD.  Another issue for the enterprise is the traditional RAS (reliability, availability and serviceability).  The flash issue is endurance.  Heavy use, especially writes, will cause a flash memory sell to plug up and cease to function.  Flash has a limited life.  The good news is that there are workarounds.  As mentioned earlier the SSD drive isn’t just flash, there are other components, and these can be used to ameliorate some of the downsides of flash.

How does this impact Big Data?  In-memory applications will benefit from SSDs since semiconductor flash will have similar low latency advantages as DRAM but not as fast and not as expensive, so these jobs will see a benefit.  Also looking at the needs of the Big Data job is the best way to predict if there will be enough benefit to offset the expense of Flash.  For instance, Hadoop is a batch job.  There are benefits to getting faster execution of a batch job, but consider how time sensitive your requirements are since flash can be pricey. 

Another issue is the configuration.  Most of the SSDs are found in storage arrays where virtualization and Big Data might coincide.  If storage is being configured to support a NoSQL job like Cassandra, Mongo DB and the like, this could be a good use of flash.  If the configuration is a direct attached storage, in the range of traditional Hadoop clusters, it might not make sense.  LSI was making a case at the Flash Memory Summit that their hybrid drive with flash and rotating disc was a perfect solution for DAS in a Big Data cluster.  They claimed a 37% reduction in run time for a sample Terasort job in a Hadoop environment.

Other speakers feel strongly that bandwidth is the requirement for Big Data performance.  You can see why, since Big Data apps may involve moving massive amounts of data.  Moving that data quickly can be a major determinate of overall performance.    

The basics require that you understand the nature of your job to put the right storage in the right configuration, and there might be a vendor out there with some enabling software that might optimize for your situation.  One vendor was showing how their algorithms can double the performance of their SSDs over standard SSDs.  This just demonstrates how variable your results might be.  Even so, it’s nice to have options, and the Flash Memory Summit showed how even Big Data can benefit from this fast changing technology. 

Posted in Uncategorized | Leave a comment

Big Data Storage

The fundamental hardware view of Big Data, with its Open Source mindset, is for cheap commodity storage and servers.   A major tenet of the Big Data movement is to keep things inexpensive both for software and hardware.  The desire for this is understandable, with the size of Big Data implementations, costs can easily become scary. 

From a hardware view of Big Data, the baseline theory is that you might have blade servers in a rack with a few SATA hard disc drives directly attached to the servers.  These compute nodes would then be replicated to meet the needs of the job.  With the cost of terabyte and larger hard drives at a relatively low cost, multi-terabyte configurations can be implemented without the huge cost of just a few years ago.  The server configurations tend to have large amounts of main memory (maybe 24-48 GB of RAM) with multiple cores (like 4, 6, or 8).  This benefits a parallel processing job that is the typical output of a MapReduce architecture.  The ratio of storage to compute power depends on the demands of the job.

Experienced systems administrators will wonder what is being done to protect the integrity of the system.  Hadoop/MapReduce software includes clustering, redundancy and failover options are standard in this environment.  The common wisdom is that RAID architectures are unneeded since nodes are redundant in the typical Hadoop configuration.  Similarly redundant power on the servers is not required since entire nodes can be replicated, so there is no need to spend money on extra hardware.

Consideration should be paid to the kind of storage placed in the clusters since not all clusters are created equal.  Name Nodes provide a control point for multiple Data Nodes, and benefit from additional attention.  Similarly Job Tracker nodes provide control for multiple Task Tracker nodes and should get additional attention.  These high value nodes might benefit from better quality storage like higher reliability HDDs or SSD to improve uptime. 

For more complex configurations there might be a hierarchy of storage.  SATA hard drives for Data Nodes and Task Tracker Nodes.  SSD for Name Nodes and Job Tracker Nodes.  Tape still has value when used as an archive medium for these jobs since future analysis might require a longer historical picture so the data would need to be reloaded to active media like HDDs or SSDs. 

At the Hadoop Summit there were multiple traditional storage vendors that were making a case for using their non-commodity storage.  The crux of their argument is that if your data is important enough to use, it ought to get their levels of reliability and availability.  Today’s advanced storage arrays were created to solve a number of problems of direct attach storage.  Problems such as expandability, reliability, availability, throughput and backup are solved in today’s storage arrays. 

Additionally, VMware has created a case on virtualization to better share resources, like storage to manage these large clusters more efficiently. 

The bottom line is that the hardware configuration that is employed has to fit the needs of the job.  Knowing the size, complexity, importance of the job will help design an appropriate platform to execute your Hadoop job.  Some trial and error in these early days of Hadoop deployments is common, so keep an open mind!

Posted in Uncategorized | Leave a comment