Big Data Storage

The fundamental hardware view of Big Data, with its Open Source mindset, is for cheap commodity storage and servers.   A major tenet of the Big Data movement is to keep things inexpensive both for software and hardware.  The desire for this is understandable, with the size of Big Data implementations, costs can easily become scary. 

From a hardware view of Big Data, the baseline theory is that you might have blade servers in a rack with a few SATA hard disc drives directly attached to the servers.  These compute nodes would then be replicated to meet the needs of the job.  With the cost of terabyte and larger hard drives at a relatively low cost, multi-terabyte configurations can be implemented without the huge cost of just a few years ago.  The server configurations tend to have large amounts of main memory (maybe 24-48 GB of RAM) with multiple cores (like 4, 6, or 8).  This benefits a parallel processing job that is the typical output of a MapReduce architecture.  The ratio of storage to compute power depends on the demands of the job.

Experienced systems administrators will wonder what is being done to protect the integrity of the system.  Hadoop/MapReduce software includes clustering, redundancy and failover options are standard in this environment.  The common wisdom is that RAID architectures are unneeded since nodes are redundant in the typical Hadoop configuration.  Similarly redundant power on the servers is not required since entire nodes can be replicated, so there is no need to spend money on extra hardware.

Consideration should be paid to the kind of storage placed in the clusters since not all clusters are created equal.  Name Nodes provide a control point for multiple Data Nodes, and benefit from additional attention.  Similarly Job Tracker nodes provide control for multiple Task Tracker nodes and should get additional attention.  These high value nodes might benefit from better quality storage like higher reliability HDDs or SSD to improve uptime. 

For more complex configurations there might be a hierarchy of storage.  SATA hard drives for Data Nodes and Task Tracker Nodes.  SSD for Name Nodes and Job Tracker Nodes.  Tape still has value when used as an archive medium for these jobs since future analysis might require a longer historical picture so the data would need to be reloaded to active media like HDDs or SSDs. 

At the Hadoop Summit there were multiple traditional storage vendors that were making a case for using their non-commodity storage.  The crux of their argument is that if your data is important enough to use, it ought to get their levels of reliability and availability.  Today’s advanced storage arrays were created to solve a number of problems of direct attach storage.  Problems such as expandability, reliability, availability, throughput and backup are solved in today’s storage arrays. 

Additionally, VMware has created a case on virtualization to better share resources, like storage to manage these large clusters more efficiently. 

The bottom line is that the hardware configuration that is employed has to fit the needs of the job.  Knowing the size, complexity, importance of the job will help design an appropriate platform to execute your Hadoop job.  Some trial and error in these early days of Hadoop deployments is common, so keep an open mind!

Posted in Uncategorized | Leave a comment

Hadoop Summit 2012 Summary

The Haddop Summit June 13 and 14 of 2012 was attended by over 2,000 Big Data geeks according to the organizers.  There were over 100 sessions and keynotes.  There were several particularly interesting comments made by the speakers.

 

Real time systems, resource management and hardware are the next Hadoop Frontiers.  – Scott Burke Yahoo!

 

We expect to see over 50% of the world’s data on Hadoop by 2015.  -Shaun Connolly Hortonworks

 

Data will grow from 1.2ZB in 2010 to 35.2ZB in 2020 –IDC Digital Universe Study quoted at the summit

 

Yahoo! has a Big Data configuration including over 40,000 nodes!  It uses proprietary management to handle this monster.  It services 3,000 users inside Yahoo! 

 

For more normal configurations, large was considered to be 3-6,000 nodes in operation. 

 

I was surprised at how many people were muttering about how to displace Oracle installations with Hadoop.  The Open Source movement is something like a religion to some people, but the community has their best shot at a real application based beach head with Hadoop and its attendant components.

 

There are several tools to try and paste together a comprehensive system, and some providers like Hortonworks and Cloudera have put together stable Hadoop platforms with a set of tools to make it usable.  The established players are getting in on the action too. 

 

IBM had a significant presence with their BigInsights version of Hadoop.  VMware was very active on the demo floor and with presentations discussing how their approach will bring some additional discipline and tools to the party. 

 

One of the very fun parts was hearing from companies and organizations that have implemented Hadoop and are using it for real results today.  Because the nature of Hadoop use is to boost one’s competitive position, in many cases the details were missing, but no less compelling.  @WalmartLabs presented, and was also recruiting.  Indeed, almost every presentation mentioned that the company was hiring. 

 

It’s a party like 1999.

 

VMware has a different take on storage for the Hadoop cluster.  By design, Hadoop uses local direct attach storage.  VMware wants to take the efficiencies of storage networking to the Hadoop cluster.  They showed some data about direct storage being cheaper than storage networked configurations, and then talked about managing storage more efficiently with VMware.  The major benefit touted was the availability through use of failover of critical components, like the namenode.  They have a project Serengeti to manage Hadoop and provide some structure for availability.

 

A number of speakers were addressing the issue of working Hadoop into existing relational databases.  Sqoop was repeatedly mentioned as a mechanism to import data into Hadoop to make the analytic efforts more comprehensive and useful. 

 

Finally, log files, the long forgotten detritus of the data center are getting some respect.  Now that there is a method (Hadoop) of using this data to predict upcoming faults and data center problems, log data is getting new attention.  Log files are now used in security analytics to look for patterns of incursion or threats. 

 

Posted in Uncategorized | Leave a comment

Do You Need a PhD to Analyze Big Data?

It has been said that you need a doctorate to adequately run Hadoop and the many attendant programs for Big Data analysis.  Is this truth or fiction? 

 

The Hadoop environment is complex.  Hadoop itself is an open source program from Apache, the people that bring you a wide range of internet related open-source software projects.  Hadoop, as of May 2012, is at release 1.0.2.  There is still a lot to learn from Hadoop and the code is not super mature.  As with many open-source projects, there are a lot of very talented people working on this, and it will gain more function and stability over time.  As you might imagine it has a lot of Java and Unix related stuffing, and there are a number of support programs required to have a real Big Data solution based on Hadoop.

 

At Big Data University, an IBM™ education effort on Hadoop, you get FREE hands-on experience with Hadoop and some related software.  In Hadoop Fundamentals I, you can get Hadoop code by downloading or using Hadoop in the cloud on Amazon™.  The coursework will take you on an adventure with some great narration for each module and hands-on labs.  The objective is to get you comfortable with Hadoop and Big Data frameworks.

 

The introductory Hadoop class gets into the IBM version of Hadoop called BigInsights,  as well as HDFS, Pig, Hive, MapReduce, JAQL, and Flume.  Oh, by the way, there might be some user developed code functions required to glue the parts together depending on your needs.  Some assembly is required.    

 

IBM InfoSphere BigInsights™ provides a stable configuration of Hadoop that has been provided by IBM for the class.  The software is free and IBM can provide support for a price, of course.  You get a stable Hadoop version with the IBM spin, but other companies like Cloudera, Apache and others can also provide Hadoop. 

 

To actually build a system to do something with Hadoop, you’ll need some other components.  HDFS is the file system that supports Hadoop.  The IBM course will take you through how Hadoop makes use of a file system designed to support parallel processing threads.  You also get a HDFS lab as part of the course.

 

Next they get into MapReduce, and the methodology on parallel processing and the logic of how data flows and how the pieces fit together.  This includes mergesort, data types, fault tolerance and scheduling/task execution.  Then you get the lab to see if you really got it.

 

Now that you have the big picture on Big Data, what do you do with it?  To create a MapReduce job could require a lot of Java code.  You use high level tools to create a MapReduce job and thusly reduce the effort and time required.  That’s where Pig, Hive, and Jaql come in.   Pig originated at Yahoo™   and provides a great way to handle unstructured data since there is no fixed schema.  For unstructured data think of analyzing twitter comments.  Hive was developed by Facebook™ and is a data warehouse function for Big Data.  It is a declarative language, so you specify the desired output, and it figures a way to get it.  Jaql was developed by IBM and is another schema optional language for data flow analysis. 

 

Finally, you might want to move data around your Hadoop cluster.  Flume was developed by Cloudera™ for this purpose.  With Flume you aggregate and move data to an internal store like HDFS. 

 

Let’s return to our original question from the title of this article.  Do you need a PhD to run a Big Data job?  Probably not.  But it wouldn’t hurt.  In any event you will need to be a highly competent programmer to get good results.  Hadoop solutions are best considered as an iterative process so there will be some experimentation.  A combination of skills that might include programming and analytics methodologies would produce the best results.  So if not a PhD, a really well accomplished computer science graduate would seem to be the focus to enable a Big Data solution. 

Posted in Uncategorized | 2 Comments

Social Media and Big Data

Social media information can be defined as non-structured data: narratives, random facts, opinions, numbers, fabrications, fables all mixed together with abandon with unknown relationships among any of the data.  So why is this useful?

When you ask a question in a survey, you often get what the subject thinks you want to hear.  If you want to know what a person is really thinking, you might have to eavesdrop.  A better way is to check blogs, facebook postings, forums, tweets, and the like to see what people are thinking.  In a traditional survey all the data is about you, and you control the questions.  In a social media search the hit rate for topics you want information on, like your brand or event, is a small minority of comments made.  You might have to sort through a half million documents to get any useful information.  You might get better data if you look through millions of documents.  Where does this info come from?

There are data aggregators who are always on the prowl gathering information for rent.  You can get raw data from the aggregators that you can then put into your analytics engine.  You could also employ your own crawler to go through your documents, customer comments and the like to create the mass of unstructured data you want to understand.  The objective would be to create an aggregation metric that might look at what people are saying about your brand, product or company.  You might want to see how these comments change over time.  You can use it to gage opinion on a marketing campaign or product launch.  This sort of input can provide valuable input to the way you run your business.

This information is still just correlated events, not causality.   To take the next step as to what to do with this information, perhaps it’s time to join the speakers.  By analyzing the information you can find out where people are talking about you, not just what they’re saying.  You might want to monitor or participate in more active places where you are being discussed.  This will require some thought to make sure you don’t undermine the activity, but provide people already thinking about your company or product to get even more involved.  

Analyzing social media is not an exact science but an iterative process that you will do some probing and testing to see what results might be useful.  The major reason for this is the raw data.  It’s not about you.  It’s about everything.  Some sliver of information might be interesting for you, most will not.  You will build rule sets to apply against text to find something, and more rule sets as you get results that aren’t quite what your were looking for.  You’ll end up creating structured data out of unstructured data.  Why?  Because you want to find actionable information.  That is best done with structured information. 

Once you’ve gotten your rule set refined so that you’re getting useful information from the mountain of data, you’ll need to decide how often you want to search the mountain, what kind of analytic queries to find new information in the same mountain and where to find new mountains of data to pursue.  The results are extracted into a structured array of objects that might include potential leads or product interest.  That’s where the gold resides.  You turn your marketing engine on those results, and refine the process to begin again.

Posted in Uncategorized | Leave a comment

Comments on “Competing on Analytics” by Thomas Davenport & Jeanne Harris

Much of American business is built on intuition and experience.  The authors make the case that the power is shifting to enterprises that quantify their decision process.  There are several notable examples of success.  There are few documented failures, although I would expect that they exist also.  The task of identifying a problem that Big Data and business analytics can solve is not trivial.  The execution of that plan can also be significant.  The price tag of business analytics needs to be viewed in the light of any other business expense, answering the question “what is the return on investment?”.  Usually, the outcome is unknown, so the benefit may be difficult to quantify until after the fact.  By then, of course, the money is spent.

 

The results can also be profound.  The authors take a look at several companies who have been able to create or extend competitive advantage based on a more quantitative view of decision making.  Several notable companies are counted among the quantitative winners such as Google, Netflix, Amazon, and others.   Basing decisions on executive whim is becoming an endangered species according to the book.  Indeed, as a customer base gets large enough to be beyond the comprehension of one person, you are better served to use statistics to understand the behavior of your customers.  Big Data/analytics can also be used to sharpen up manufacturing processes, operations, vendor management, pricing and human resources. 

 

That last one, human resources, takes a little explaining.  If you saw the movie Moneyball with Brad Pitt, you saw how statistics can trump intuition.  Billy Beane’s move to get a statistical look at players, and screen out their personal, emotional, qualitative aspects provided one of the best runs in baseball.  But they didn’t win the World Series.  But for a given budget they may have maximized their return on investment. 

 

The book makes a case that a company’s employees need to be as closely scrutinized as their customers.  An interesting theory, but with what data?  One of the problems with most professions is that their evaluations are largely qualitative.  Sales people have quotas so you can measure quota performance, but most positions defy easy quantification.  The risk is that trivial activities that can be measured become the focus instead of contributions to the business.  If analytics for human performance at work can be developed, I would expect to see a new class of executives moving to the forefront.  Current executives have strong social skills and business skills.  In many cases their social skills eclipse their business skills.  If the proper metrics can be developed and tracked and promotions follow, a different kind of executive may emerge with better business skills and perhaps lesser social skills.  This may be good or bad for the workplace.  But this isn’t about enjoying work.  It’s about maximum returns.

 

The authors create a spectrum for the role of analytics in a business from non-players (Analytically Impaired- stage 1) to masters (Analytical Competitors stage 5).  They portray the change that must be led by senior executives to make the transition, because it will impact managers, employees and their support structure with big budget and process disruption.  The highest use of analytics can create big results and a sustainable advantage enterprise wide.  Analytics then becomes the primary driver of performance and business value. 

 

Indeed there is data to suggest that extensive use of analytics can create significant market advantages and create a competitive advantage that is significant.  The authors do acknowledge that not every industry can be transformed with analytics.  They look to the airline industry that uses analytics extensively in pricing and operations.  The largest players in the industry keep flirting with bankruptcy.  Of course, it might be worse if they were not so analytically inclined. 

 

They break the domains for analytics into internal such as financial, manufacturing, R&D, and human resources as well as external for customers and suppliers.  Early examples might be cost management and getting a handle on costs which can be particularly tricky for services.  They also note that analytics is generally an iterative process, and some experimentation may be required.  This increases the price tag of analytics, but it is the predominant way that an organization gains experience and expertise to then leverage projects in different areas of the business.  The ability to progress from a stage one to stage five company will be different for each company, but the majority of companies that can overcome the management issues to persevere, can become dominant in their industries if they continue to evolve. 

 

The skill set for executives and analyst is significant and somewhat daunting.  The authors get a little hissy about have a doctorate as the only proof of competence.  But certainly the skills laid out in the book are not pervasive in the industry, and there is a wide concern about the scarce availability of skilled people to execute an analytics plan.  Executives need to also have some comprehension about analytics so they can correctly direct the effort and resources to achieve high return efforts. 

 

Finally, I don’t think there can be any doubt that Big Data/analytics is changing the world we live in.  This trend will only accelerate.  There will be out-sized rewards for companies that move quickly, and the laggards risk a changing market that they can not effectively compete.  Big Data/analytics will not change everything.  It will change most things.  The companies that brave the uncertainty and resource constraints will be the ones that have the best chance of survival in the economies of the twenty-first century. 

Posted in Uncategorized | Leave a comment

Big Data Solutions

Big Data is commonly thought of as a large data processing problem that is comprised of the three “V”s: velocity, variety and volume.  If we think of Big Data workloads as just a parallel processing challenge or a traditional serial processing challenge we may not be looking at the big picture for Big Data.  What kind of problem are we trying to solve and what kind of resources do we need to throw at the problem?  I’ve put together a little chart to try and summarize the landscape.

Posted in Uncategorized | Leave a comment

The Hardware Side of Big Data

Non linear data processing, as found in Hadoop and other parallel processing models work quite differently from traditional linear processing. The hardware to support parallel processing is a significant tweak on traditional models. Often, when setting up servers, networking and storage to handle a large job, the defaults don’t do the job for parallel processing. Intel has written an excellent paper on the topic named “Optimizing Hadoop Deployments”. This posting will review some of the significant findings in that paper.

 

General Topology

It’s common to see two or three levels of servers. Due to the large number of servers to be accommodated, they are usually rack mounted. Each rack can be interconnected with a 1GbE switch, and each rack is connected to a cluster level switch, usually 10GbE.

 

Hardware Configurations

The study at Intel indicated that dual-socket servers are more cost effective than large-scale multi-processor platforms for parallel processing. The Hadoop cluster will take on many of the features found in an enterprise data center server, so save your money.

 

The large number of disc drives required to handle a Big Data job requires a large number of drives per server.  Intel used 4-6, but felt that the shop may want to experiment with even higher ratios, like 12 drives per server. The I/O intensity will vary by job, so some experimenting might be appropriate. Also, their suggestion is for relatively modest 7,200 RPM SATA drives with capacity between 1-2 TB. Using RAID configurations on Hadoop servers is not recommended due to the Hadoop data provisioning and redundancy in the cluster.

 

Server memory is usually large, like 12 GB to 24GB. Large memory is required to allow the large number of map/reduce tasks to be carried out simultaneously. They even recommend that the memory modules be populated in multiples of six to balance across memory channels. Make sure your ECC is turned on. Intel also has a Hyper-Threading Technology that they showed some nice performance increases. It has to be turned on in the BIOS.

 

Networking as previously mentioned will be 1GbE at the server and 10GbE at the cluster. Intel also suggests twin 1GbE ports bonded together to create a bigger pipe for the data. They suggest eight queues per port to ensure proper balancing of interrupt handler resources among the processing cores.

 

Software Configurations

A recent version of Linux is suggested to provide good energy conservation. Versions older than 2.6.30 may use 60% more power, and if you’re configuring hundreds or thousands of servers, it adds up in a hurry. Linux should also have the default open file descriptor limit set to something like 64,000 (instead of the default 1,024).

 

Intel also suggests Sun Java 6 or later (Java 6u14 or later) to take advantage of optimization like compressed ordinary object pointers.

Hadoop configuration choices need to be reviewed too. Hadoop has several components, including a file system, HDFS. If you disable recording access info of noatime and nodiratime attributes, you can improve performance. Increase the file system read-ahead buffer from 256 sectors to 1,024 or 2,048 for better performance. Additional HDFS configuration settings are reviewed in the paper.

 

The bottom line is that parallel processing platforms, like Hadoop and Big Data jobs require a different mindset than traditional linear, batch or OLTP type configurations. You might want to get to the Intel website and download their paper “Optimizing Hadoop Deployments”.

Posted in Uncategorized | Leave a comment

Big Data Effect

Notes from the Churchill Club meeting December 7, 2011

Panelists:

Keith Collins, SVP SAS

Gil Elbaz, CEO Factual

Ping Li, Partner Accel Partners

Luke Lonergan, CTO Greenplum/EMC

Anand Rajaman, SVP WalmartLabs

Michael Chui, Senior Fellow McKinsey

 

The meeting was in the style of a panel responding to questions from Michael Chui who took the role of moderator.  Ultimately, the questions came from the audience.  Michael did a nice job of extracting comments from the whole of the panel.  The flow of the discussions tended to wander once the audience members were polled, so you may note some issues emerging more than once.

 

Some of the more interesting comments I’ve captured in this blog.  For instance the differentiation was made between Open Data, which might be transparent, easy to access and community driven versus Big Data.  Big Data might include Open Data, but is not limited to Open Data.  Information that has limitations on its use such as medical information is not Open Data, by law.  There were a number of questions probing different aspects of the relative openness of data.  The panel agreed this is an area that is still evolving, and the use of abstracted data, or anonymous data is still evolving due to privacy concerns and laws.  Legislative change on data use and ownership is still an open issue. 

 

Social data might include public or Facebook information.  Facebook might be an example where the person owns the data and offers permission for its use on a broader basis.  One use of this might be such as Shopping Cat where you can research what to get your friends based on what is found by your agent crawling through facebook. 

 

One tool available today for Big Data analysis is Hadoop.  Hadoop is an open source parallel file system designed to be used with Big Data.  Due to the difficulty of working with Hadoop a few strategies have emerged.  One is to create a team of experts to manage and create the system around Hadoop to generate business intelligence from massive amounts of unstructured data.  The panel seemed to agree that there is a skill shortage of qualified people to perform these tasks.  The emergence of the data scientist who would acquire, analyze and tell stories with the data is becoming a high demand position.  Additionally, the skill set to realize the potential of Big Data needs to be collaborative.  New ideas and behavior modeling will come from outside traditional IT skills.  It is through the introduction of new approaches and the large amount of data to be analyzed that the power of Big Data will be felt.  The democratization of data will better exploit the data resource.  With Big Data comes the need for data quality, and the tools in this area are growing. 

 

 

 

One of the structural problems with Big Data is the break with traditional CPU/Memory storage architecture.  To pursue a parallel file system, like Hadoop, the business intelligence applications need to be restructured.  Since you are not just sampling data, the sheer volume of the information to be processed overwhelms traditional architectures. 

 

One of the significant new sources of Big Data is the mobile location data from smart phones and similar devices.  This also harkens back to the discussion of sensitive data and privacy concerns. 

 

Applications are still evolving using the power of Big Data.  Greenplum/EMC, creator of a Big Data appliance has created a recognition event for their customers called DataHeros.  It recognizes people who are making good use of data in new ways.  Some applications include new ways to identify credit card fraud, child abuse and even helping the government identify tax cheats. 

 

Another application is Kaggle, a crowd-sourcing platform for data analytics, and is co-sponsoring a contest with a purse of $3 million to help solve difficult problems, in particular to help resolve a preventative health care problem.  

 

New Big Data business ideas might include:

  • Changes in the way of buying local services with coupons and local commerce identified by geo-locating capabilities.

 

  • Mobileand Big Data intersect in a way that might be useful, not harassing.  It might enable better management of a fleet of delivery trucks, for example.

 

  • There are ways to be explored that combine public and private data for a more complete view of a problem.

 

  • Smart grid management could be a natural for Big Data to help manage the nation’s energy. 

 

  • Data from security cameras has already helped security agencies worldwide, but maybe the next generation might include millions of cheap cameras to provide a more complete view of our world.  They could get so cheap we could put cameras on pigeons.  (I trust you appreciate the humor of the panelists)

 

  • To better monitor pollution we could put sensors on bikes.  As the cost of sensors becomes more affordable, they could be come widespread and give us new ways of looking at the world around us. 

 

To move people-related information forward a standardized privacy policy that has different level of permissions and capabilities would help.  We need to move Big Data into a more useful and ethical capability.  For those consumed with privacy there is the Lockbin project that provides security as a shareware app. 

 

Something that is happening now is EMC is using Big Data via an employee prize.  The winner at EMC looked at customer service issues and found a way to make products more reliable. 

 

Walmart.com is using Big Data to learn about products and customers in a social sense.  They look at blogs to see what products are interesting.  They can compare different geographical areas to see what products are popular in an area via social media input. 

 

By the way, data isn’t just black and white.  It would make things easier to regulate and categorize if that were the case.  We live in a probabilistic world, and there is a lot of grey areas, which will make going forward with Big Data and regulations tricky. 

 

 

 

Posted in Uncategorized | Leave a comment

IDC Big Data Perspectives

IDC June 2011

Extracting Value from Chaos

By John Gantz and David Reinsel

 

The IDC team spent some time putting together a report that started off discussing the mammoth amount of data that has arrived with the advent of social media, data collection in the natural world and in man-made systems. The numbers are staggering: 1.8 zettabytes (that’s 21 zeros after the 1.8) which represents a growth of data of about 9 times what it was just 5 years ago. IDC estimates that by 2015 there will be 7.9ZB in existence.

 

Most of this data is “unstructured”. In the digital world, structured data exists in tables and databases and can be indexed, searched and manged. Unstructured data includes things like emails, video, photos, voice conversations and is the vast majority of data in existence.

 

Since over 90% of the digital data is unstructured, given that someone will want to extract some value from this information, the rush is on to create tools to manage this resource. In addition to new tools to organize this information, a new approach to locate this information has emerged in cloud storage. These two technologies are changing the data storage environment.

 

Deriving value from all this data is just on the early steps. Protecting this data is also in early stages, with perhaps only half of sensitive data being adequately protected today. The famous lost laptop computer with sensitive information on it that seems to appear in the news as a regular feature reminds us it’s not just technology, but people and their behavior that can put data at risk.

 

IDC describes “Big Data” as the analysis and use of large volumes of data, not just the fact that there are large volumes of data. They are predicting that cloud storage will be responsible for much of the management of these massive amounts of data, and that the move to monetize the value of this data will be led by the cloud storage vendors. Cloud vendors will collect, analyze and enable third parties to utilize this technology for their data monetization.

 

Security is another evolving requirement for Big Data. The different needs and different regulatory requirements creates a hierarchy of security. IDC estimated that 28% of all data required some level of security. For instance a youtube upload may require only privacy on the users email address. Information that MIGHT be required for litigation discovery in an organization would need to be protected to some extent. Sensitive personal information held by a custodian requires a higher standard of protection, and organization confidential information will often have a higher standard of protection. Finally, military and banking information would reflect needs for the highest protection.

 

IDC has also created the concept of a “digital shadow” that covers the information about a person, not information that person has created. As technology and our presence in the digital world expand, so does our “digital shadow”. For instance our facebook page might have data we put there, but also data put there by other people, and digital shadow information would also include who our friends are and information that can be derived from those relationships.

 

The point of the paper is to reinforce and expand current storage management strategies and practices, as well as consider new approaches, like cloud storage, to manage the emergence of Big Data. This change typically has to be managed in a flat to modestly increasing resource environment. This will challenge our best industry resources to keep up with the challenge.

 

 

Posted in Uncategorized | Leave a comment

Big Data Mckinsey Report

The 2011 Mckinsey Big Data report (see the Mckinsey.com website for a free copy) covers a broad range of issues with Big Data.  Given their expertise on all thing technical, their over 300 page report provides some significant gravitas on the topic.  Of course they will be fishing for consulting assignments as a result, but they touch the nerve of a real issue.  How can organizations stay on the leading edge?

The Mckinsey answer is to provide technology to look at all aspects of your business and systematize your approach.  The data that might be accumulating without being analyzed may hold the key for getting closer to your customer by understanding their behavior better, or perhaps understanding your suppliers better, or providing a mechanism to be more methodical creating more predicable results and less variability in your operations.

The report doesn’t spend any significant time on the mechanics of Big Data technology like Dyanamo from Amazon or the open software Hadoop to analyze massive amounts of data.  The paper is aimed at identifying the pain of not properly using Big Data, and building a business case why a significant amount of money should be spent to master your data.   In all fairness, they do provide some additional granularity, with some industries being more appropriate than others.  They do spend time discussing four examples: health care, government services, manufacturing and telecom/mobile.

Major ways that Big Data can add value include mechanisms for both companies and consumers to become better informed and create transparency for pricing and trends.  It can provide the platform to create new business models that are disruptive for existing approaches.  It might also take more human decision making out of the process, and replace it with algorithms.  The intent is not to dehumanize the world, but to get over the reinvention of the wheel for the 6 billionth time.

The report also looks at the problems with Big Data implementations.  There are legal and structural reasons why the Big Data approach may hit roadblocks.  Data security, talent shortages and technology infrastructure are three areas that will slow or stop Big Data.

The report ends with a smidgen of their consulting approach, starting with a methodology outline of how to create a value potential index for your company, and some of the suitability metrics that they would employ to see if Big Data is appropriate.

The potential benefits for this approach are significant, if you are in an appropriate industry, have some budget available, and the stomach for some real disruptive changes.  But this is what separates the leaders from the pack.

 

Posted in Uncategorized | Leave a comment