How to Streamline DevOps for Edge AI

The need to stay competitive in today’s fast-paced world means continual innovation.  On a more mundane point, it means getting applications built and deployed as quickly as possible.  And of course you want to have a highly reliable way to deliver and monetize those applications.  This is the DevOps tension: deliver applications quickly in an IT environment that views change as risky.  Both perspectives are correct from their point of view.  DevOps philosophy manages this tension in multiple ways. 

One way is to co-locate the development and operations teams to improve communication and sensitize each team to the other’s concerns.  It allows greater depth in understanding to avoid the “throw it over the wall” story that operations might cite.  The development team may find their objectives challenging if “change management means we only deploy new apps every six months”.  The business suffers if operations and development can’t create a productive process for both teams.  Meaningful changes can come with the discussions that happen face to face and allow both teams to vent their frustrations and co-create solutions.

One common area of concern stems from the difference between the development system environment and the deployment system.  This means that before the application can be deployed it must first be modified to run on the deployment system.  If the application is developed in the cloud and will be deployed at an edge site, the difference between the systems could be significant, and porting the application to the new platform will add delay.  Using the same system for development and deployment can significantly reduce the deployment cycle and improve timeliness and quality. There may be some cultural issues to overcome but it must be remembered the shortest possible development time isn’t the objective, the shortest possible new application deployment is the objective.

Putting the right incentives in place focused on application deployment is a step in the right direction, but re-thinking the system environment is another powerful tool. Edge computing has a unique set of requirements for low power and small size that may not be considered optimal for app development.  However, considering the goal is shortest possible new application deployment cycle, it makes sense to consider a different approach.

Consider developing your applications on the platform you will ultimately deploy, and when you are ready to deploy, you can send the same physical system that you developed on to the edge for deployment. Once you’ve developed and tested to your satisfaction you deploy the same hardware and operating system to your edge locations.  You don’t need to migrate the application to a new platform. The same system can be used for development and deployment especially in a specialized environment like edge AI.

The winning combination of development and operations objectives means the business can get an edge on the competition and keep a steady stream of new applications that are easy to deploy.

Advertisement

5 Tips for Picking an Edge AI Platform

Picking an Edge AI Application Development Platform doesn’t seem that complicated. However, choosing the wrong platform could impact your productivity and the transition from development and training to inference and production. 

As the evolution of AI continues to shift towards executing analytics at the data source, it is important to point out how the environment on the edge impacts platform technology choices.  Let’s take a closer look at some of the considerations: 

  1. Build the app on the same platform you plan to deploy the solution on. Performance and model behavior become increasingly important as you roll out your applications.  The training platform should be similar to the production platform so you can appropriately judge how the model will perform when moved into production.  

  2. Build the app on a platform that supports the entire application stack (e.g. custom databases, web servers, and caching).  You need to support training and inference but don’t forget the other aspects of your application stack that also need support.  Having a highly capable node designed for the edge is important for flexibility and ROI.

  3. Edge environments need low power but lots of capability. This is one of the biggest reasons that you want the development and production platform to be the same. The edge is a different world from the data center in terms of available power and space.  Your platform has to reflect these realities of analytics at the edge.

  4. Decide on your data strategy- where does the data come from? That’s where you want your Edge AI platform to be. Some data you might want to keep forever.  Some data will be of transitory use. If you are subject to audits to justify your training models you’ll want to keep the data that was used in training available.  Having a well thought out data strategy can help you avoid costly mistakes and wasted budget.

  5. Security is important- how do you get a secure network to protect your IP? Traditional approaches to security like a VPN can still leave you exposed to a man in the middle attack or spoofing. A zero trust network that authenticates end points is more secure and should be considered. 

Picking an Edge AI platform does require balancing a number of factors. Generic computing platforms for Edge AI may work, but it is not an optimized solution and can negatively impact your time to solution and ROI.  In today’s competitive environment you need every advantage, so it’s worth some time to explore your options.

While you’re exploring your options, why not take a look into Cachengo? Cachengo is an Edge AI platform that can support your Edge AI project with purpose-built hardware and an operating environment that makes model development and deployment easy. 

 Go to https://cachengo.com/

Analytics at the Edge with IOT Data

In an odd way, today’s enterprise challenge is too much data.  Data that is mostly going unused.  A focus on traditional data infrastructure is not sufficient any more, and it is the variety of new technology choices that are to blame. In particular, data at the edge of the enterprise includes the internet of things, IOT, which creates new opportunities and new challenges.

Examples of IOT include automation in warehouses, factories, mass transit, utilities, but also includes getting more out of your video feeds, powering autonomous vehicles, and monitoring jet turbines to optimize service schedules and operational efficiency. The existing IT infrastructure at the edge is usually targeted at solving a specific well-defined problem.  This serves to move processing to where the data resides, and avoids moving data to a central site for processing thereby avoiding network congestion. This is only part of the solution, however.  You’re solving today’s problem.  You have the opportunity to take a step ahead and consider a bigger picture.

Data strategy should play a bigger role in IOT implementations. As an example, the smart factory initiative seeks to automate functions which analyze machine and sensor data at the edge.  Data coming from a specific tool or machine can often be analyzed real-time to improve the operation of the machine.  If there is a data strategy in place, that data can also be used in aggregate to improve the operation of the entire manufacturing process by showing how the machines interact and how solving problems in one area can improve overall productivity.

By using data as a resource, reconsider the problem you are trying to solve. You quickly see that a data strategy can become a valuable tool in deciding how to structure work flows, improve operations, understand your customers and markets better, perhaps even gain a competitive advantage. Data silos exist to solve specific business problems.  One example of how to take advantage of a more comprehensive data strategy would be using public cloud analytic tools to process edge or IOT data and repatriating the results to a central site so it can be used in a broader context for additional benefit.  Think about moving beyond silos to incorporate a broader data strategy to get the full value of your data.  Don’t let IOT data malinger at the edge.  Stomp on the data accelerator to move ahead.

New possibilities in your data strategy emerge as new enabling technologies appear. Fundamental changes in the cost of processing, storage and networking create new ways to monetize your data beyond the IOT silo. Moving compute to the edge has never been easier and less expensive. Reimagine how new technologies can be used for purposes beyond the original intent to bring more value.

There’s been a lot of discussion about how 5G will make mobile and edge data more accessible with significantly higher bandwidth availability and security. It can also be the backbone of private networks and ultimately replace WiFi and perhaps Bluetooth. There are additional technologies that will impact edge and IOT such as NVMe-oF, a low latency highly scalable networking technology that can work over today’s Ethernet with minor adaptation. It’s not just hardware advances either, programs built with containers can make rolling out new applications easier and faster.  More open source APIs allow interaction among different parts of your infrastructure and can improve capabilities at the edge. Composable disaggregated infrastructure is another way to reduce costs and make more efficient use of resources.

Take these infrastructure advances as a way to rethink your data strategy and how to make the most of your data and your business. We are just starting to see the benefits of IOT and edge computing.  Think outside the silo.

For an extreme example of the benefit of analytics at the edge is NASA’s OSIRIS-REx team. They wanted to land a spacecraft on asteroid Bennu and collect a surface sample and return it to Earth. If you look at this short NASA video https://www.youtube.com/watch?v=xj0O-fLSV7c  you can see how the spacecraft makes a last minute adjustment to the probe to avoid a rock.  This is AI in action, saving a multi-million dollar project at the last minute.  Due to the distances involved, there is no way to control this from Earth, and onboard AI made the save.  You can bet the data from this mission will be reviewed in aggregate for additional insights.

Finding the right mix of compute and capability at the edge for IOT applications and developing a data strategy that makes sure you are getting the full value of your data is a moving target. By keeping the strategic importance of data in mind as you design your IOT solutions you can create a culture of continual improvement and get the full value of your data.

Scale-out Analytics

Problem

Analytics have already changed the world.  The business world.  The science world.  The education world. The government world. And still, most of the data we have has not been used.  An IDC research report, funded by Seagate Technology, states that 68% of enterprise data goes unused.1 If we are to extract more value from data, we need to analyze more data.

Most of today’s enterprise data centers aren’t configured to handle the petabytes of data needed for large-scale analytics. The approach that many enterprises take is to move to the cloud to get the scale and ease of use that is needed for scale-out analytics. It can work.  But it’s not cheap.

To give you an idea of how expensive cloud computing can be, consider Amazon Web Services.  AWS was responsible for 13 percent of Amazon’s overall revenue, but it accounts for about 71% of Amazon’s overall operating profits.2   However, you don’t need to go to the cloud to get the benefits of the cloud, and you might be able to save some money.  Having a data strategy that uses both cloud and on-prem resources appropriately gives you a framework to decide what applications are best served where. This becomes more important over time since you pay to export data from the cloud, so if you change your mind down the road, it might be prohibitively expensive to move the function back on-site. 

So back to the original problem: how should an enterprise build scale-out analytic infrastructure to take advantage of the massive amount of data available but still unused?

Solution

A significant reason that enterprises haven’t been executing scale-out analytics on premises is the existing infrastructure is largely designed to accommodate the bread and butter databases that manage transactions, inventory and customer service. Moving to a scale-out analytics infrastructure is another matter. 

One of the significant benefits that the cloud offers is a scale-out infrastructure for very large jobs. Cloud hyperscalers had to develop their own technology to provide scale-out infrastructure.  That technology was only available to you as a customer of the hyperscalers.  One of the critical enabling technologies for scale-out infrastructure is just coming to the general market now, NVMe-oF.

The foundational technology, NVMe has been around for a few years, and it has fundamentally changed the way flash storage communicates.  Previous technologies were built to accommodate hard drives, which are much slower and lack the queue and command depth of NVMe.  The resulting improvement in performance has been profound. The limitation of NVMe for devices is that the great performance works well in the box, but then the box is connected to the network, you’re plugging into the older, slower and less scalable network.  NVMe-oF, or NVMe over fabrics, creates a fabric that is both scalable and blazing fast. The architecture of NVMe-oF is built on Ethernet and IP so there is no need to struggle with an all-new unproven technology.  You can upgrade and expand your existing IP network to support NVMe-oF.

Benefits

With the speed, scale and flexibility of NVMe-oF, you can create a cloud-like architecture on your premises and reap the benefits in performance, scale and flexibility as well as cost savings. Additionally, when you own your infrastructure on-prem, you need not worry about the generosity of the cloud vendor in the future.  Enterprise IT operations can benefit from the flexibility of a composable disaggregated architecture built with NVMe-oF to create configurations that suit the application.  For example, lots of CPU power and ultra-fast storage for traditional database jobs or alternatively, lots of GPU power and fast storage for scale-out analytics or maybe lots of CPU and less expensive storage for backup and archive workloads.

Summary

Strong competitive forces are driving enterprises to make the most of their data. When scaling and flexibility are no longer limitations you have an opportunity to get more out of your data by using more data. The opportunity to discover new insights with data, previously unseen with smaller data sets, can improve science, government, education and business.  The ability to understand your customers more profoundly, detail your operations more precisely, and manage your marketing more effectively are game changers that get better with more data.  To get more out of your data you need a scale-out architecture, and NVMe-oF is the way to make it happen.

1 https://www.geekwire.com/2019/amazon-web-services-growth-slows-missing-analyst-expectations/#:~:text=AWS%20was%20responsible%20for%2013,of%20Amazon’s%20overall%20operating%20profits. 2 https://apnews.com/press-release/business-wire/e7e1851ee8a74ca3acb1b089f6bd0fa8

Creating a Data Strategy for Analytics

Powerful analytic tools are at our disposal.  These tools provide great insights into data that allows us to make better decisions.  But are we looking at the right data?  Don’t better decisions depend on the quality of the data?  What is the data strategy to make sure we are considering the right data in our analytics?

Golden copy Concept

The Golden Copy is the agreed-upon version of truth that is to be used for analytics and other important applications.  Sometimes known as the single source of truth, the Golden Copy will come from agreed to sources of data, in the appropriate timeliness which then passes through a data curation process that often includes traditional extract/transform/load processes (ETL). The result is data that has been vetted and can be used confidently and consistently for a multitude of purposes.

Data Ownership

The process of validating appropriate sources of data is no simple task.  This is not just an IT exercise, nor is it just a business unit (BU) process.  The BU needs to define the decisions and what sources of data are needed to support those decisions. IT and business analysts may be able to help the BU identify additional data sources.  For data to become a competitive advantage, it may be necessary to combine traditional and non-traditional sources of data.  There are plenty of data sources available but care must be taken to consider the sources and quality of data.  A common arrangement might be that data ownership resides with the BU and they make data source decisions but IT owns the infrastructure and processes for the data.  

Data Management

Data management is comprised of several different processes, and there are different considerations for data sources inside and outside the company.  Data management typically exists for data from inside the company such as sales data, manufacturing process data and supply chain data.  There may be other data from inside the company, often unstructured data, which may not have existing management to determine if it is suitable to be put into the Golden Copy repository.  Examples might include sensor data, customer service feedback, process machinery data or other IOT data that needs to adapt for more general use before it is placed in the Golden Copy repository.  

There are also potential data sources from outside the company which may have very different ETL needs compared to internal data.  For instance, social media data for the marketing department may require a lot more screening than internally sourced data from the CRM application.  There needs to be a process to prepare data for the Golden Copy to make it useful.    

There can be such a thing as being too careful, however.  Remember, more data is often more insightful than better algorithms. A competitive advantage may be to incorporate new sources of data to help your analytics even if there is work to be done before it can go into the Golden Copy.

Multiple Versions of the Truth

Now you have a Golden Copy repository/data lake, and you get your first request for data that is a outside the Golden Copy.  Time to talk about changing business needs and multiple versions of the truth. It is going to be necessary to modify the Golden Copy data to accommodate new requests, but consider using the same process you created to validate the Golden Copy data in the first place. Get the group of IT and BU decision makers together to make sure as the data is created it can be trusted and curated, and if appropriate, added to the Golden Copy, or just used for a one-off need.

Conclusion

Using data as a competitive advantage is a great concept.  Making data a competitive asset is a lot of work.  By putting the right strategy and processes in place you can create a model that will continue to improve over time and create a foundation for a sustainable competitive advantage. As always, data analytic success is best when the desired result is firmly in mind before proceeding.  In many cases, how you source and treat the data will be critical to data analytic success.

Big Data Success Factors

Computerworld has released a study by Computing, a UK company, regarding the state of Big Data in the real world.  Big Data Review 2015 holds not a lot of surprises, but certainly a lot of confirmation that Big Data tools need careful handling to get the right results.  In the early days of Big Data there was a lot of experimentation just to understand the capability of the tools.  There will certainly continue to be plenty of experimentation just given the nature of the process, but success depends on the business execution.

Computing’s 2015 findings show the rapid evolution of the industry based on a shift in the responses in survey questions from just last year.  The number of survey respondents concerned about the different topics has grown significantly, suggesting that Big Data methodologies is rapidly moving into the mainstream.  Nobody wants to be left behind, of course.  In terms of tools, data warehouses and analytics databases were tied with cloud based Big Data services, broadly speaking.  Big leaps in survey respondents considering these tools were seen.

This should come as no surprise to those practicing the art of Big Data.  You probably spend most of your time importing, moving, scrubbing and preparing data for analysis.  Indeed, just finding the data that is important to your analysis can take quite a bit of work.  Garbage in / garbage out still applies.  This report takes time to understand why projects have been perceived as successful, not wasn’t limited to just looking at the latest tools.

For instance, 76% of respondents focused on operational data to improve efficiencies. 24% used Big Data for external opportunities.  Why?  Pragmatic business decision making.  The fastest route to Return on Investment will be to refine and improve operations that you already know.  Operational savings go to the bottom line as increased profit.  New sales only net what the gross margin will allow.  Improving the gross margin impacts all sales.  One of the frustrations that business leaders have with Big Data is the ability to speculate or predict future events when planning business spending.

The power of predictive analytics is immense.  Descriptive analytics may allow you to refine existing operations with less risk but prescriptive analytics that model how things could be, allows you move to disruptive capabilities but higher risk of success.  This is traditionally the domain of the entrepreneur, the ability to remake markets and disrupt the status quo.  The larger challenge for existing companies is how they decide to manage risk and failure.

This tension between the business decision maker and the analytics professional has been true from the start, of course.  The difference now is that decision makers have seen how analytics can improve their business, and funding for analytics is increasing based on that success.  The most visible expression of this is the democratization of Big Data with more self-service tools for business professionals.  The counter trend is reluctance by some departments to share data or cooperate with broader Big Data projects.

To be successful with Big Data projects, the survey identifies a number of factors, the top 3 being: 1) business buy in 2) knowing your data and 3) Core understanding of the business.  The implications of just these three is important.  Business is moving more aggressively into analytics, but with a purpose.  High return on investment objectives will keep projects focused, and tends to lean to the operational side where there is a more likely return.  Knowing your data is another critical aspect of a Big Data project.  Integrating several sources of data and preparing it for analysis is not trivial and takes a lot of time and effort.  Most survey respondents felt that between 80-90% data accuracy is “good enough” for most projects.  The diminishing returns of further improvements may not make any changes to the decision making.  Finally, this isn’t an academic exercise.  The project’s success will depend on deep understanding of the business.  A big part of analytics is deciding what problem you want to solve.  A poorly formed premise or problem will lead to unsatisfactory results.

Computing’s 2015 report has some great information in it, and it highlights the changes in the industry just in the last year.  As analytics becomes more mature it will find application in more companies and more projects, and that’s good for Big Data and the economy.

See the report at: http://resources.computerworld.com/ccd/assets/84539/detail

Beyond Correlation: Process Models

Big Data is the hunt for meaning from an ocean of data.  Until tools like Hadoop and NoSQL became available, it wasn’t practical to derive much visibility from unstructured data, and certainly not much meaning from social media.  Now with these tools, we can provide order to chaos and look into data more closely.  One belief regarding Big Data analysis is that we don’t need to understand the cause of correlation we might find in data.  Simply by understanding the relationships between factors that become apparent in the data, we can find useful information.  Still, correlation is not causation.  We may not understand why two factors are related, but it is still useful to understand correlation.

To further our understanding beyond correlation, a step closer to understanding causation can be Process Mining.  Process Mining looks beyond correlation to further refine associations in the data.  Indeed, Process Mining per Wil van der Aalst of University of Technology in Eindhoven, would posit that by looking at a more structured view of data relationships, we can discover processes in the data.  Based on a Process Mining perspective we can identify processes, identify process bottlenecks, and by looking at what people are actually doing, potentially improve processes.  Finally, we can predict outcomes based on the processes we find by testing with real event data.

Process Mining is different from, but related to, data oriented analysis.  Both approaches start with event data.  The difference is that Process Mining starts with mapping a chain of events to create, refine and test models that fit the data.  You can then use the model suggested by this technique to look for bottlenecks with existing processes.  By testing with actual event data you see the way events are actually occurring, not how they should occur.

To test the relationships suggested by Process Mining, there are three approaches:

  • Play-Out: start from a proposed model. Describe a work items and the workflow. Look at all possible iterations to understand the range of potential processes identified.  Start with what’s already there.
  • Play-In: Look at the existing process and infer a model. Create a model based on what you want to be in place.
  • Replay: With either method of creating a model (Play-Out or Play-In) replay actual data on the model using event log data so you can assess the strengths and weaknesses of the different models. By testing a model with Replay, you can show deviations from expected results, and tune the model accordingly.

Classic Big Data mining starts with data.  Process Mining also starts with data.  The difference is the search for correlation in data mining, and the search for processes in Process Mining.  Data Mining might produce a simple set of Key Performance Indicators (KPI).  Dr. van der Aalst would argue that simple KPI will lead to problems because it is too simplistic.  Understanding how people work together to accomplish a task is more informative.  KPI may show deviations, but the user may have no idea where the deviation comes from, or how to get the process productive again.

By using Process Mining you can identify bottlenecks or unproductive process steps through conformance checking to make sure what you think is happening, really happens.  Data Mining can be either supervised learning with labeled data or unsupervised learning with unlabeled data. Unsupervised learning often relies on cluster or pattern discovery.  Supervised learning often employs regression to analyze relationships.  There are lots of tools to assist in Data Mining, and most of the attention in the industry has focused on data mining.

A good example of a Process Mining tool is the Decision Tree.  In a numerical analysis of decision trees, predictor variables are derived from response variables.  Therefore, decision trees can be used to predict outcomes based on an expected process flow identified in the Decision Tree.  One of the limits of the Decision Tree is to decide how much of a good thing is practical.  Multiple iterations of the decision tree with actual data (Replay) will allow you to continue to refine the decision branches in the Decision Tree model.  You might want to decide before you start how many levels of tree you want to allow (for ease of use) or what kind of identity ratios that you want before you stop developing the model.  Complex processes can create a large Decision Tree.  You should decide at the outset in your Decision Tree creation, what level of success is reasonable.  You should consider what you want the result to be before you start the process.

There are free tools for both Data Mining and Process Mining.

The Process Mining website: http://www.processmining.org/

Check out RapidMiner for Data Mining at http://sourceforge.net/projects/rapidminer/.

For Process mining look at ProM, and open source software tool at http://www.promtools.org/prom6/downloads/example-logs.zip

Enjoy the Process!

Big Data Preparation

Let’s face it, the big data world has a lot of unglamorous heavy lifting.  One of those less glamorous jobs is preparing the data for analysis.  Taking a bunch of unstructured data and creating some structure for further analysis takes some thought, rigorous process, and careful documentation.  Unstructured data lacks the row and column structure, which makes it hard to apply traditional analytic tools to such raw information.  Data preparation provides the structure that makes the data suitable for further analysis. 

In order to assure reproducible results, every step should be documented so that an independent researcher can follow the procedure and obtain the same result.  This is also important to detail the steps taken to prepare the data in the event of subsequent criticism, or to alter the process if the ultimate results are not satisfactory.  Because the very results of the downstream analysis depends on assumptions made in the data preparation, the process must be carefully captured. 

The raw data may consist of an unorganized, partial, or inconsistent (such as text) collection of elements for input to an analytic process.  The objective is to create structured data that is suitable for further analysis and should have four attributes according to Jeff Leek and company of Johns Hopkins:

  • One variable in exactly one column
  • Each observation variable in one row
  • One table for each kind of variable
  • Multiple tables with a column ID to link data between tables  

Best practice would also include descriptions at the top of the table with plain language description of the variables, not just a short-hand name that is easily forgotten or mis-identified. 

Good experimental hygiene also includes a code book and instruction list, again with the intent of creating reproducible results.  The code book can be a text file that includes a short section on study design, with descriptions of each variable, and the units they are measured with.  Summarizing the choices made in data preparation and experimental study design will assist future researchers understand your experiment, and reproduce it as necessary.  The instruction list would be the literal unstructured data, and the output is the structured data.  If every step can’t be captured in the computer script, then what ever documentation that would allow an independent researcher to reproduce the results should be produced to let people know exactly how you prepared the data.

Data can be captured from downloaded files, and Excel files are popular with business and science audiences, while XML files are commonly used for web file collection.  XML is an interesting case, since all data in a file can be processed with XML, but to pull files out selectively, you’ll need Xpath which is a related but different language from XML.   Finally, JSON is another structured file system somewhat akin to XML, but the structure/commands are all different.  Java Script Object Notation has its own following, and is commonly used and moving files into and out of JSON are easily supported.

In summary, to have a repeatable data preparation, you should document the process used, the variables, the structure, the files input, the files output, the code to process the raw data into your desired structured data.  Once the data is structured, you can scrub it to get rid of the Darth Vaders and Mickey Mouses.  Then you can start to think about your analysis. 

Why Implement an All Flash Data Center?

The argument goes something like this: if flash costs more than disk, why would you spend the money on an all-flash data center?  Some might suggest that you just use flash for intense I/O applications like databases where you can justify the additional expense over disk.

What we see from Violin customers is different.  Not all have gone all-flash, but for those that have the benefits are many.

All-flash data centers can provide new sources of revenue.  Lower operating costs.  Elimination of slow I/O workarounds.  Improved application response times.  Faster report turnaround.   Simplified operations. Lower capital costs.

As a storage subsystem manufacturer, Violin puts together the best system they can design, but they are constantly being schooled by their customers.  For instance, they had a large telecom customer who were missing some billing opportunities and redesigned their customer accounting software.  When the customer implemented it on their traditional storage system, they didn’t see much benefit. They saw the application wanted even more I/O, and brought in Violin.  As a result they found over $100 million in new revenue.  That paid for the project handsomely, of course.  This is revenue that wasn’t available with traditional storage, but is captured due to Violin’s low latency.

Another example of how flash storage changes the data center is the impact of low latency on servers and the software that runs on them.  Moving to a Violin All Flash Array speeds I/O so much, the traditional layers of overprovisioning and caching can be eliminated.  The result: better application performance with lower costs.  Customers have also told me they can also free up people from this consolidation to redeploy on more productive efforts since there is no need to manage the overprovisioning and caching infrastructure.

However, not all All Flash solutions are created equal.  SSD solutions are inferior to a backplane-based approach like Violin’s Flash Fabric Architecture™.  Consider key operating metrics such as power and floorspace.  For instance,  70 raw TB from Violin takes 3RU of space.  Common SSD-based solutions take 12RU (or more) for the same raw capacity. This density also translates into power.  The Violin 70TB will take 1500W, while common SSD approaches may take over 3000W for the same capacity.  This translates into operating expense savings.  One customer recently estimated they would save 71% in operating costs with Violin over traditional storage.

Additionally, the Violin Flash Fabric Architecture provides superior performance, due to the array-wide striping of data and parallel paths for high throughput that holds up under heavy loads.  It also provides for better resiliency, since hot spots are essentially eliminated.  The result is not just a big step up over traditional disk storage, it is a significant improvement over SSD-based arrays.

Customers who have gone all-flash for active data have found they can buy the new storage and server equipment, and still have money left over.  This is in addition to any new sources of revenue realized, such as the Telecom example.  Flash is essentially free.

The last hurdle has been data services.  Some customers who have Violin installed love the performance, but were hesitant to put all their data on it because they wanted to have the enterprise level availability features.  Capabilities such as synchronous and asynchronous replication, mirroring and clustering give enterprises a robust tool kit.  They configure their data centers in a variety of ways that will protect against local issues like fire, metro area problems like hurricanes/typhoons, and regional issues with a global replication.  These capabilities now exist in the Concerto 7000 All Flash Array from Violin Memory.  This allows enterprises who want to experience transformative performance to also employ the operational capabilities they need to meet their data center design goals.

The move to the all-flash data center is upon us.

For more information go to www.violin-memory.com

OpenStack for Big Data in the Cloud

Big data has two use cases for storage requirements: cost and scale as primary considerations with performance as secondary consideration for much of the data, and for real-time analytics performance and scale as primary concerns with cost as a secondary consideration.

OpenStack has positioned itself as the platform for the Open Cloud and has the potential to impact your Big Data storage issues.  It comes in two flavors: one for block and one for file/object storage.

Block storage is the usual mode for traditional storage area networks and is served by OpenStack’s Cinder product. File/object storage is the home of files, video, logs, etc is served by Swift from OpenStack.

SWIFT is for objects that aren’t used for transactions.  Why?  The data in Swift is eventually consistent, which isn’t appropriate for transaction data, but is just fine for much of the kinds of data found in static big data, such as photos, video, log data, machine data, social media feeds, backups, archives, etc.  Readers might recognize a previous Big Data Perspectives blog discussing the differences between consistency models and their appropriate applications.  A key/value pair might be a good fit for eventual consistency, but your bank records should be consistent with ACID compatibility.  One potential issue is the need to change applications because it is a new approach.  Swift can be converted through a gateway to allow legacy applications to work with Swift.  Applications have to be REST API compatible.  REST (representational state transfer) is a way to make web approaches more widely used, via HTTP type commands.  Riverbed is an example of a Swift implementation.  Without a traditional hierarchical structure in place, Swift does provide for unlimited scalability, but with uncertain performance.  The focus is on commodity hardware and open source software to keep the cost of storage low.

CINDER is for block data that could be attached to your SAN and could include transaction data in the cloud.  Where performance is more important, or for transactional and database requirement, Cinder is a more appropriate choice.  It does have big-time supporters such as IBM and NetApp.  You can understand the major storage vendor’s dilemma, however.  The whole focus of OpenStack is to use their software and commodity hardware to bring down the cost of storage.  They do provide API compatibility to allow their proprietary systems to communicate with an OpenStack node.

There might be away to bring the worlds of proprietary and open together to get the best from both.  By using the proprietary systems for ACID related data, typically transactions, databases, CRM, ERP and real-time analytics and OpenStack for less critical data, there is a way to put value where it is recognized, and commodity where it is not.