Over the past month I’ve been attending a series of Big Data related presentations in the Los Angeles area. I thought I’d provide a quick aerial view of observations from these presentations.

Big Data definition:

A term for data sets that are so large that traditional data process applications are inadequate.

This definition traces back to around 2004/2006. In this era Google published a paper with details on their MapReduce concept. People recognized the value of the work, got together in the open source community and put together Hadoop and HDFS.

The Big Data field is now a 10 year old. When it started, the “traditional data process applications” it proposed to supplant were SQL databases. Since the birth of Hadoop, the amount of data being generated has been doubling every 2 years.

But as we all know, as things get older and as things change, questions start to arise. The questions surrounding the subject of Big Data are:

  • Isn’t the Hadoop/HDFS solution now the “traditional data process application” itself?
  • The size of data grown by way more than an order of magnitude since MapReduce, Hadoop, HDFS concepts were architected. Would you do it the same way now?

In short, is it once again time for something new?

The job of an engineer is to build a cost effective solution for the problem at hand. When a requirement increases by an order of magnitude, it’s rare that a new “greenfield” solution can’t do much better than an incremental enhancement to the original. When a requirement grows by two orders of magnitude, the original approach might be ridiculous.

Let’s say you were asked to design a vehicle that could move a single person at speeds of up to 5 miles/hour. A skateboard would be a fine, cost effective, solution.

What happens when people love the skateboard, and ask for a top speed improvement to 50 miles/hour. You decide to put a motor on the skateboard. Yes, it would be far from optimal, but you could build one, and it would work. You’d never do it this way given a clean start.

What if the requests don’t stop there? What if people request to reach the goal of 500 miles/hour. Yes, you could strap a jet engine on a skateboard, it’s theoretically possible. I wouldn’t ride it, would you? You could then successfully say you’ve “gone to plaid.”

[youtube https://www.youtube.com/watch?v=ygE01sOhzz0&w=420&h=315]

Well, unless you want to say that what constitutes the label “BIG”  got carved in stone 10 years ago, and will never change again, you have to consider that the suite of big data tools is ripe for change.

Over the past 10 years:

  • Hardware changes have been dramatic
    • SSD vs HDD
    • Direct memory storage interfaces vs SATA, FibreChannel, Infiniband
    • CPU architecture focus moved from higher Ghz to higher core count
  • Data ingest rates and size of retained data sets are up orders of magnitude and growth is not slowing

Thankfully I’m not the only one to notice an opportunity for improvement here.

Below are just some of the interesting developments underway in the Big Data arena:

Many organizations have noticed that incoming data streams are still needed for traditional transactional workloads, alongside a Big Data analytics workload. In practice this can result in keeping multiple copies of data with a lot of overhead associated with data movement. The “holy grail” is to come up with a means to efficiently support random queries and analytics simultaneously, from a single copy of the data.

At the beginning, the practitioners of Big Data analytics criticized the popular shared storage architectures of the era, such as SAN and NAS. They favored use of DAS, underneath a distributed file system (HDFS). HBASE was used to supplement performance for applications needing random access.

New developments on SSD and storage interface fronts, may be presenting opportunities to benefit from a new wave of re-architecture.

Amr Awadallah, CTO and co-founder of Cloudera gave a convincing presentation on Kudu, a project that takes advantage of modern hardware to supplement HDFS and HBASE by offering something that bridges the performance chasm between them. Based on measures of latency and throughput, Kudu achieves this goal.

John Leach, CTO and co-founder of Splice Machine, described a solution that addresses transactional and analytic workloads using a single copy of data. It supports transactional workloads by placing a SQL query engine as a top of a layer on an HBASE/HDFS stack, with SPARK on the side. The query engine is based on DERBY.

Arun Murthy, co-founder of Hortonworks, described achieving high performance SQL using Apache HIVE running on YARN (a scheduler for big data processing). Other advances include moving HDFS from a single storage class into a tiered (Memory+SSD+HDD) service, and Yarn enhancements to support dynamic execution with data locality.

Google may be causing another shake up, reminiscent of the original MapReduce paper, with the release of Dataflow as Apache Beam.

It is interesting to observe that the Big Data field that disrupted traditional data processing and storage seems to be taking the wise step of disrupting itself, in a refresh 10 years later.

I recently returned from AT&T’s Developer Summit & Hackathon. This was a developer training and social event directed toward new applications enabled by device connectivity including home automation, wearable devices, wireless telemetry, and “Smart Cities”.

This area, sometimes called the Internet Of Things, is fostering collection of data related to many aspects of environment, commerce, health, and transportation. I set out to write a brief trip report and I have to apologize in advance because this subject triggered some passions. This one “broke off the leash”.

Examples of data sources:

  • Crowd sourced temperature and humidity readings from privately operated devices with weather sensors (cars, consumer electronics).
  • Parking slot occupied/empty status.
  • Retail store refrigerator case status.
  • Bus/Train stop occupancy count.
  • Vehicle position and accelerometer readings
  • Cameras

Examples of applications:

  • Cars observing a rapid transition through the freezing point, under wet conditions could trigger icy road warnings specific to individual curves in the road.
  • Public transportation could actively adjust capacity and routing based a live demand.

AT&T cited an ABI Research forecast of 40.9 billion connected devices by 2020. Being a skeptic of claims made during marketing events, I looked for a second source and found this forecast from the Gartner analyst organization:

Internet of Things Units, Installed Base by Category (millions)

Business: Cross-Industry6328151,0924,408
Grand Total3.8074,9036,39220,797

Source: Gartner November 2015

Gartner’s forecast is smaller than ABI’s, but still, the predicted device count is a multiple of the world’s population. If forecasts are even remotely true, this will have huge impacts, not just on the IT industry, but on non-IT industries.

Think back on the impact of the internet and cell phone connectivity over the past two decades. Huge companies have been forged from nothing more than people and ideas. Others have been swept from market leader to extinction. It is obvious that connectivity (IOT) will fundamentally change the way the “things” are built.


  1. IOT is going to generate way more data than we have today. Even a tiny device can generate a huge amount of data. And we will have lots of devices.
  2. Tiny devices, are not going to hold data for long. The devices will forward data to storage in internet connected remote data centers. This will boost demand for data center storage.
  3. Data in the cloud will beget an environment conducive to data analysis in the cloud. This will boost demand for data center compute capacity.
  4. User interaction requirements will drive demand for applications that provide user interface for configuration and display. These applications will run on connected devices, or on general purpose devices like smartphones. It’s likely that the same cloud hosting the data and analysis, will be suitable for hosting the mobile device applications that use the data.
  5. Some control loops, such as those demanding low latency, and extremely high security and availability, should run at the local site. However there are plenty of applications are suited to remote cloud execution. Examples: furnace/AC filter life monitoring; vehicle brake pad wear prediction based on accelerometer measurement of driving pattern. There will be both local and cloud hosted apps.
  6. There will be a Metcalfe’s Law effect with cloud hosted IOT data. cantbewong’s adjunct to MetCalfe’s Law: cloud value is proportional to the square of the number of IOT data connections. When you live in the most popular cloud, you will have lower latency access to the universe of other data you interact with, without adding an addition point of failure. Hosting for IOT data has aspects of a “winner takes all” poker game.

Ultimately, IOT will drive demands for carrier bandwidth, and cloud hosting business. Cloud hosting will include data storage, data analysis, and application hosting and delivery.

Existing companies who ignore an IOT strategy do this at peril of their lives. Basis of argument: Look at what internet version 1 did to the retail, music selling, video and IT industries.

I came away from this conference with the impression that AT&T is in full “land grab” mode for this opportunity. What actually surprises me is how few other technology companies seem to be making strategic commitments in this sector.

Technology choices are likely to be “sticky”

The technology choices include connectivity (WiFi, vs cellular) and cloud hosts (APIs, analytic services, application delivery services). Embedded devices are not updated frequently, or in an easy risk-free fashion.

WiFi and cellular data networks are currently the leading candidates for device connectivity. Resiliency, power consumption, and widespread geographic availability tends to make cellular attractive for battery operated and roaming mobile devices.


During the event keynote, AT&T brought to stage these partners:

  • Red Bull demonstrated use of live vehicle telemetry feeds for use by the Formula 1 team they sponsor. Red Bull also demonstrated connected refrigerators that they provide to retailers of their energy drinks. There will be 200,000 of these deployed in the US, reporting GPS location, temperature, and door open/close cycles (which correlates predictably with sale of a beverage). Through these refrigerators, they expect to be able to monitor and optimize sales and inventory
  • Ford announced an exclusive multi-year agreement that will result in connectivity of all new Ford vehicles sold in the US and Canada by 2020.

On the expo floor, I saw an intriguing demo of an augmented reality enhancement to a mountain bike. Force, strain and angle sensors were installed on the bike. Sensor readings could be superimposed on live video of the bike while in operation. This is useful as design tool for bike engineers, showing actual dynamic loads under real use conditions.

Augmented reality bike

augemented reality bike prototype

I’m not sure how big the market is, but this is an example of IOT inspiring human minds to consider things that couldn’t be done before. Eventually, some of these ideas are going to change the world. link: Santa Cruz bike video


The first 2 days of the conference were devoted to a hands on Hackathon competition. I expected something like the Hackathon events that have been held at other conferences I have attended, such as Linuxcon. These generally involve participants encouraged to develop code for an open source project and award a Raspberry Pi or Arduino to the winner.

The AT&T event’s Hackathon was in a different league. Over $100,000 in cash was awarded, along with $100K’s of additional amounts in hardware. Many vendors of embedded system hardware and home automation were present, not just AT&T. Sponsors included Samsung Smart Things and Amazon Alexa.

I quickly realized I would be at a disadvantage, arriving as a solo contestant – many arrived in organized teams, with advance preparation. The Hackathon event attendance was bigger than some whole tech conferences. I’m guessing over 700 people. When I registered, I wasn’t paying attention to the Hackathon details, and enrolled for the educational value. I did not come away disappointed and I had a great time.

I decided to build an attic fan controller that operates based on interior temperature/humidity and exterior conditions, including weather forecasts obtained via REST API over wireless connection. My device logged all operational data to the AT&T cloud using their M2X API. This data was viewable from a smartphone.

my hackathon creation

my Hackathon entry – attic fan controller using local weather forecast via API, with performance logging to cloud

In assembling my prototype I felt like a kid at the free candy store. Vendors contributed hardware, sensors, and live support for the asking. The Mbed (ARM) booth was particularly helpful. In the end, I got to keep all the hardware. I think the DIRECTV developer boxes, and Digital Life security systems were the only exceptions to the take-it-home offer.

I ended up learning so much in the Hackathon experience, that the API sessions during the main conference seemed repetitive. My summary review for the Hackathon is: If you are a hands-on hardware + software developer, and they hold another one, you should go.

Disruptive hardware

I got to try out the Occulus powered Gear VR at the conference. 3D headset technology gets a lot of press, but I think that some of the home automation hardware might be even more disruptive in terms of enabling applications, and displacing incumbent vendors.

The early generation home automation hardware optimized for cost. As home automation moves applications such as door lock control, fire detection and intrusion detection, devices are getting better.

UL compliance in this arena requires battery backup, security, including anti-jamming mitigation, and supervision. This will start out “good enough” for the consumer market, with very low prices compared to what is currently used in industrial markets such as SCADA. This has the earmarks of a classic Clayton Christensen style market disruption.

I used to work in the industrial control business, and I can tell you that industry has been clinging to a combination of ancient legacy data communication technologies, along with dubiously implemented security. You often here users make the claim that “my network is safe because it is isolated from the Internet”.

But if your user interface stations are based on Windows, and other software, keeping this security patched over an air gap is problematic. We live in a real world where your own employees are subject to foibles, and even terrorist recruitment. An internet isolation breach is just a routing accident or a tethered cell phone away.

How well is this working now?, not very well according to the US government CERT team. It might be time to end the failing attempt to air gap SCADA systems, and put in place technology that keeps them safe even while a path to the Internet exists. I think it’s likely that the R&D investments and economies of scale of IOT hardware and firmware components will displace moribund industrial sensor and control technologies.

Disruptive software

Big Data is sometimes described as a data set that is so big or so large that traditional data processing tools are inadequate.

If you think your current data is big, wait till you see what IOT is going to unleash – I am going to call it Big Data².

If Big Data is a fire hose, Big Data² is a dam overflowing the spillways – this is going to drive the creation of new technologies.

Corps tests Hartwell Dam spillway gates

Source: Photo by Doug Young, Lake Hartwell Association

Disruptions to Society

Growing up with internet entertainment and social networks has shaped the defining characteristics of the millennial generation. The internet usurped the role of video, music, and news delivery. The internet and cell phones usurped the role of written letters and landline calls. Internet hosting of written documents fostered demand for search, resulting in the formation of Google.

When you look at the internet of things, it is poised to collect and host data related to environment, health, transportation, public, and private infrastructure. But it will go beyond simple data collection. It will also host analysis and reactions to this data. In short, it is poised to become the nervous system for society.

As an IOT grid drives control loops for commerce, transportation and public safety, it will become a target for mischief, criminals, and terrorists. It will be important that the system is highly availability and protected from tampering.

There is great opportunity to improve the quality of life for citizenry, but it should not be squandered by careless house-of-cards solutions. High availability, security, resistance to denial of service attacks, and protection against data forgery and tampering are all important.

Miscreants are not the only potential threat to open flow of information. Look at this “Smart City” website for the city of Los Angeles. You are greeted with this introduction:

A Message from Mayor Eric Garcetti

We are sharing city data with the public to increase transparency, accountability and customer service and to empower companies, individuals and non-profit organizations with the ability to harness a vast array of useful information to improve life in our city. I hope that this data will help drive innovation and problem solving within the public and private sectors and that Angelenos will use it to more deeply understand and engage with their city. I encourage you to explore data.lacity.org to conduct research, develop apps or simply to poke around.


This has the potential for real good, but it also has the potential to become politically charged. Much of the data related to government services could be interpreted as a “report card” for elected officials. There will be temptations to manipulate or restrict data.

There are already examples of government attacks on independently gathered data that predate IOT. The Fukushima nuclear disaster, and Flint, Michigan water contamination incident show that independent data sources can have great value.

When a Flint doctor submitted data showing high blood lead levels in toddlers, “the state publicly denounced her work, saying she was causing near hysteria”.

When Sean Bonner, entrepreneur and co-founder of the Los Angeles hackerspace, Crash Space, found that no meaningful radiation measurement were being made available by the Japanese government, he started a revolutionary movement to commission an army of volunteers to construct instrumentation, and crowd source data collection. The SAFECAST organization was born.

The internet may have deprecated newspapers and even the printing press, but open flow of data needs to be viewed as the modern form of Freedom of the Press.

There is much professionally organized PR about wondrous benefits of IOT, enabling self driving cars, to lower traffic congestion and provide transportation for the disabled. Lurking skepticism , leads me to suspect that taxation and revenue generation will be what really drives the IOT discussions in smoke filled rooms of government planners. You can already observe that some of the same cities that criticize Uber for surge pricing, are eager to deploy parking meters that adjust for time of day supply demand imbalances.

Just as an example of an unforeseen outcome, I can envision high parking prices inducing owners of self driving cars to command their vehicle to circle the neighborhood, empty, to avoid high parking prices. Outcome: self driving cars lead to more, not less traffic congestion. Likely second move on chessboard: Track vehicle usage, and charge by the mile, with surcharges in congested areas (“surge” pricing?). Whenever “Smart Cities” involves revenue generation, cat and mouse arms races will be inspired. Some of these battles will become  political “hot potatoes”.

The open data manifesto

IOT is going to disrupt the world. It’s not a question of if, but how much, and how soon.

Government agencies, and entrenched interests will occasionally feel threatened by IOT.

Open source software has benefited the world at large, but at times pressured commercial interests. Open data will do the same.

There is a great opportunity at stake. It will be important to establish a broad-based perception that data should be unrestricted.

From Wikipedia:
Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.[1] The goals of the open data movement are similar to those of other “open” movements such as open source, open hardware, open content, and open access…

I submit that there is a need for an Open Data Bill of Rights:

  1. There should be no restriction on publication of legally obtained data.
  2. Data collected by government entities, or at government expense, should be published with open and free access, in raw form, without subjective post processing or manipulation.
  3. Those who corrupt data, or interfere with its availability, including government, will be subject to criminal prosecution.
  4. The people have a right to be secure in their homes, personal effects and information stores. As such, government shall be prohibited from compelling publication, or seizing, private data without a court order based on probable cause. You cannot be compelled to publish a data source coming from within your home.

Call for your contribution

I am hoping that watch dogs, like the EFF, will be there to protect the public interest. But, the technology here is complex. It is really up to technologists, such as those who I hope read this blog, to educate and inform the public, and those in public service. Everyone in the technology field stands to benefit personally, and perhaps even career-wise.

Community contribution on a technical, educational, and political basis will be needed to allow the IOT opportunity to deliver its full potential. If it does this, it can improve transportation, safety, healthcare, and the environment. It can also boost productivity and the economy. In short, it can raise the quality of life.

The IOT revolution is coming. It will not be televised. But it will be blogged and tweeted, spread the word amigos…