Back in December 2009, fresh off the heals of publishing what I hoped would be the definitive guide to strategy-driven analytics Driven to Perform, and being accidentally admitted to the Enterprise Irregulars, I wrote The Top 10 Trends for 2010 in Analytics, Business Intelligence, and Performance Management where I summarized where I thought the industry was at that time and where it needed to head in the 2010s and beyond. And like all such pieces, I managed to get a few trends right on, a few trends partially right, and a few that made me look like I had been hanging out with Rob Ford one too many nights in Toronto. As people far wiser than me have said, “the best way to predict the future is to create it”, so over the last four years I’ve worked to make my own humble contributions in creating the future of the analytics industry. But A LOT can change in four years. For example, there are now FIVE people in the world who have actually read my book (thanks Mom and Dad!) and aren’t solely using it for its sleep-inducing characteristics on long flights. Therefore I thought it was high time I took a look at the state of our industry and refresh my perspective.
- The bloom will come off the rose, and big data will need to prove its business value.
- Tableau and QlikView licenses will outpace Excel licenses.
- By the end of 2014, half of all BI vendors will begin seeking an exit.
- Internet of Things (IoT) starts to gain traction.
- We can finally realize the full potential of data scientists.
You can read the details of those predictions in the Forbes article, so I wanted to use this blog post to cover five key technology trends have emerged in the last four years that I see facilitating these predictions:
- The arrival of the post MapReduce only Hadoop infrastructure fueling composite data management systems. While the existing MapReduce programming model has shown huge leverage and subsequent business value, it has also shown itself to be quite limited in the types of applications it could support. With the arrival of YARN as a key component in the Hadoop 2.0 infrastructure, multiple programming models can be hosted as “applications” on top of a generic distributed resource and storage infrastructure with MapReduce style processing as but one exemplar. With pluggable processing models for graphs like Apache Giraph, key value pairs, and streams like Storm, all sitting on top of YARN, Hadoop really becomes a composite data management system that allows for purpose specific processing on a shared data backbone. This is a big architectural change from existing relational database models that have also co-opted graphs and streams, as they assume an underlying relational substrate that all operators must be converted into. In the new “Enterprise Data Hub” world, purpose-specific systems can operate on each individual workload with no impedance mismatches forced by the assumption of a single underlying model. Even the venerable MapReduce will make a quantum leap forward with the ascent of Spark, already being described as “the leading candidate for successor to MapReduce”, which provides significantly more flexibility in terms of job construction and huge performance improvements via the in-memory Resilient Distributed Dataset model. With all of the different processing models available in the Hadoop 2.0 world, integrated metadata, lineage, provenance, etc. will become absolutely crucial to extract value from an otherwise chaotic cacophony of data.
- The renewed focus on data preparation led by the business as a mission critical topic for Big Data. We have witnessed tremendous innovation in the Big Data platform space with the rise of the so-called NoSQL movement with pioneering companies like Cloudera and have seen similar levels of innovation with data discovery and visualization techniques as manifested in products like Tableau and QlikView. But all Big Data practitioners know that 80% of the time we spend is in preparing data, and until last year, no one was focused on bringing a next-generation approach to the market for this. But with the launch of pioneering companies like Paxata (insert standard disclaimer here) with meaningful market traction as well the funding announcements of other companies attacking the data preparation problem head on for the first time, change is in the air, with the twist of focusing on business-led data preparation as opposed to the traditional IT-centric procedural extract-transform-load (ETL) model. Call this new form of data preparation “agile information management” complete with real-time collaboration, an IT approved approach to governance over how data is used, and cutting edge ways to enrich data sets beyond what users dream of today will be expected.
- The ascendancy of semantic machine learning techniques that automate the sense making of Big Data. We are well past the breaking point of the ability for humans to intelligently process the increasing volumes, varieties, and velocities of data being presented in our environment and its getting exponentially worse. The only solution I see is to start leveraging intelligent technologies that can make sense of data in concert with our own cognitive abilities. With the rise of commercial solutions like IBM Watson, Wolfram Alpha, and Google Knowledge Graph, and the explosion of research in semantic techniques, machine learning, etc. I think we will see a whole new breed of commercial solutions that embed these techniques in the interest of facilitating decision-making for humans. After all, if Google is now at the point that it can build neural networks that can autonomously do object recognition of cats, it doesn’t seem that much of a stretch to think that we can use these state of the art techniques from the natural language processing and machine learning worlds on commodity hardware to automate potentially simpler tasks, like automatically detecting entities and their semantic types and the relationships among them.
- The evolution from one-off programs to scheduled batch procedures to ad hoc patterns of intent facilitate business people’s productive interactive analytical explorations in volatile environments. In our continuing quest to democratize access to technology and make decisions as quickly as possible, the level of abstraction must continue to rise to meet the intents of the end user versus the needs of the machine. Our systems began by necessity to be those that required hard coded one-off programs created by computer professionals. We made a big leap forward about 20 years ago when we were able to abstract the program away and provide flow-centric models that allowed less technical professionals to draw BPMN diagrams or express ETL jobs and then submit them to a largely batch infrastructure. But in order to meet the requirements of the environment we find ourselves in today, where little is planned in advance and every individual needs to be empowered to make decisions, we must raise the level of abstraction even further to the patterns of intent of those individuals and have them receive immediate feedback on that intent to make them productive, and this is already happening across many layers of the analytics stack. As but one example, just as few of us write nested loop joins in code today, but many more folks are able to pull a join operator onto a canvas in a data flow diagram, we will see systems that expose the ability to combine data while completely abstracting away the underlying join itself, allowing the end user to fluidly compose information without regard to how it is being done and with the immediate response time they expect. This will allow truly emergent knowledge flows that can be very quickly evolved as the situation morphs.
- The democratization and ultimate dissolution of notion of data service providers. When we think of data service providers, the immediate organizations that come to mind include Nielsen, IRI, Experian, Thomsen Reuters, et al whose entire business models were predicated on the notion of being able to aggregate data in a high friction environment where such aggregation was exceedingly difficult. We’ve also seen the arrival of newer variants of the data service provider model for the newer “Big Data” sources of information like Factual for location data and DataSift for social information. There is also strong interest in making government and scientific collected data available as part of the Open Data movement as manifested through sites like data.gov. But in an environment where literally every item in the world, from your mobile phone to your FitBit, let alone every organization now generates data and makes it available via API’s in the so-called API economy, the entire notion of the data service provider will be disrupted. Business value will no longer be derived just from having access to the information, which will be plentiful, but from aggregating, enriching, and ultimately synthesizing into a coherent tapestry the data available from companies that were never thought of as data service providers, like UPS for logistics information, Amazon for retail POS information, and PayPal for purchasing pattern information in addition to those companies providing such data services now. These companies will build their own highly valuable data product businesses to monetize the information that up until recently was merely the byproduct of their business processes.
These are my thoughts for what I see driving much of the key innovation in 2014. What are yours? I welcome your input!
Here are some other good reads on this topic that I found worth perusing while thinking about this topic:
- Big Data In 2014: Top Technologies, Trends
- Predicting Big Data’s 2014
- Big Data Myths Give Way To Reality In 2014
P.S. I’m still trying to figure what “Big Data” means but they haven’t told us yet in my MOOC on “Data Science”. 🙂