Extremely Large Database Conference

SLAC/Stanford Extremely Large Database Conference (XLDB)September 2012

The XLDB conference is targeted at highly sophisticated end users working on large commercial, scientific and government projects.  Companies included Ebay, Zynga, Google, Comcast, Sears, Facebook, Cloudera, AMD, HP, 10Gen, NTT, Teradata, EMC, Stumbleupon and Samsung.  Educational institutions include San Diego Super Computer Center, Michigan State, UC Santa Cruz, University of Chicago, University of Washington, Portland State U, University of Toronto, UC San Diego, GeorgetownUniversity, JacobsUniversity and Stanford/SLAC.  National Labs included CERN, Lawrence Berkeley, and the NationalCenter for Biotechnology Information.

Presenters come from a variety of backgrounds, but the common thread is solving complex large computing problems.  There is a bias toward open source software solutions, but commercial vendors were welcomed also, if they focus on large complex problems.  The audience is very familiar with the common approaches, and come together to refine their approaches and tools.  I have compressed the sessions I attended to glean the more interesting points from a variety of speakers and topics.

The growth of extremely large databases is not coming from transactional data.  Traditional IT shops with relational databases grow a relatively small amount each year.  As such, they might be good candidates for in-memory computing or SSD memory.  Because of the high velocity of this type of data, it is considered “hot” data and a common rule of thumb is when 90% of your I/O is “Hot Data” place it on high performance storage to get the best performance.

Much of the extremely large database load is often “Cold Data” that is infrequently accessed, but can still be valuable.  This is commonly used data for analytics that might include clickstream data, sensor data, and other non-transactional data.  For most analytics, bandwidth is more important than IOPS.  This sort of data is also commonly referred to as “Big Data”.

“Hot Data” is usually suitable for in-memory configurations.  The cost of memory is falling rapidly, and the performance benefits now outweigh the costs for high value transaction data.  Relational databases are the usual engines for transactional data.  For larger RDBMS flash memory and/or SSD storage can be a great fit.  There are some aspects of in-memory systems that need to be examined.  Not all “in-memory” databases allow writes- you may have to refresh the database to get new data in.  Then perform analytics, and refresh again.  Hana from SAP is an example of a technology looking at an “in-memory” solution.

For “Cold Data”, traditional HDD storage should provide the right performance for cost balance.  As we continue to accumulate data there is a natural consequence of Big Data; the mobility gap.  Because databases and data sets are getting so large, they can’t be moved as a practical matter.  This implies that the data centers need to move to 10Gb connections for everything as soon as possible to mitigate this phenomenon.

To bridge the gap between RDBMS and NoSQL databases like Mongo and Cassandra the array database is a new approach, an example being SciDB (www.scidb.org)  that is typically used for scientific research.  No tables, and it is massively scalable.  It is an open source project.

It was an excellent conference, and some very knowledgeable people presenting and in the audience.  Sometimes the discussions with the presenters got a little loud, but everyone was focused on getting to the bottom of whatever problem was being discussed.