Big Data in Transition

Over the past month I’ve been attending a series of Big Data related presentations in the Los Angeles area. I thought I’d provide a quick aerial view of observations from these presentations.

Big Data definition:

A term for data sets that are so large that traditional data process applications are inadequate.

This definition traces back to around 2004/2006. In this era Google published a paper with details on their MapReduce concept. People recognized the value of the work, got together in the open source community and put together Hadoop and HDFS.

The Big Data field is now a 10 year old. When it started, the “traditional data process applications” it proposed to supplant were SQL databases. Since the birth of Hadoop, the amount of data being generated has been doubling every 2 years.

But as we all know, as things get older and as things change, questions start to arise. The questions surrounding the subject of Big Data are:

  • Isn’t the Hadoop/HDFS solution now the “traditional data process application” itself?
  • The size of data grown by way more than an order of magnitude since MapReduce, Hadoop, HDFS concepts were architected. Would you do it the same way now?

In short, is it once again time for something new?

The job of an engineer is to build a cost effective solution for the problem at hand. When a requirement increases by an order of magnitude, it’s rare that a new “greenfield” solution can’t do much better than an incremental enhancement to the original. When a requirement grows by two orders of magnitude, the original approach might be ridiculous.

Let’s say you were asked to design a vehicle that could move a single person at speeds of up to 5 miles/hour. A skateboard would be a fine, cost effective, solution.

What happens when people love the skateboard, and ask for a top speed improvement to 50 miles/hour. You decide to put a motor on the skateboard. Yes, it would be far from optimal, but you could build one, and it would work. You’d never do it this way given a clean start.

What if the requests don’t stop there? What if people request to reach the goal of 500 miles/hour. Yes, you could strap a jet engine on a skateboard, it’s theoretically possible. I wouldn’t ride it, would you? You could then successfully say you’ve “gone to plaid.”

Well, unless you want to say that what constitutes the label “BIG”  got carved in stone 10 years ago, and will never change again, you have to consider that the suite of big data tools is ripe for change.

Over the past 10 years:

  • Hardware changes have been dramatic
    • SSD vs HDD
    • Direct memory storage interfaces vs SATA, FibreChannel, Infiniband
    • CPU architecture focus moved from higher Ghz to higher core count
  • Data ingest rates and size of retained data sets are up orders of magnitude and growth is not slowing

Thankfully I’m not the only one to notice an opportunity for improvement here.

Below are just some of the interesting developments underway in the Big Data arena:

Many organizations have noticed that incoming data streams are still needed for traditional transactional workloads, alongside a Big Data analytics workload. In practice this can result in keeping multiple copies of data with a lot of overhead associated with data movement. The “holy grail” is to come up with a means to efficiently support random queries and analytics simultaneously, from a single copy of the data.

At the beginning, the practitioners of Big Data analytics criticized the popular shared storage architectures of the era, such as SAN and NAS. They favored use of DAS, underneath a distributed file system (HDFS). HBASE was used to supplement performance for applications needing random access.

New developments on SSD and storage interface fronts, may be presenting opportunities to benefit from a new wave of re-architecture.

Amr Awadallah, CTO and co-founder of Cloudera gave a convincing presentation on Kudu, a project that takes advantage of modern hardware to supplement HDFS and HBASE by offering something that bridges the performance chasm between them. Based on measures of latency and throughput, Kudu achieves this goal.

John Leach, CTO and co-founder of Splice Machine, described a solution that addresses transactional and analytic workloads using a single copy of data. It supports transactional workloads by placing a SQL query engine as a top of a layer on an HBASE/HDFS stack, with SPARK on the side. The query engine is based on DERBY.

Arun Murthy, co-founder of Hortonworks, described achieving high performance SQL using Apache HIVE running on YARN (a scheduler for big data processing). Other advances include moving HDFS from a single storage class into a tiered (Memory+SSD+HDD) service, and Yarn enhancements to support dynamic execution with data locality.

Google may be causing another shake up, reminiscent of the original MapReduce paper, with the release of Dataflow as Apache Beam.

It is interesting to observe that the Big Data field that disrupted traditional data processing and storage seems to be taking the wise step of disrupting itself, in a refresh 10 years later.

Leave a Reply

Your email address will not be published. Required fields are marked *