The success of the recent Strata and Structure conferences (conclusions from last year’s conference and resulting trends can be found here) reinforced the accelerating corporate interest in big data and the specific need for applications and techniques that take advantage of Hadoop. Based on the presentations I attended or read about, it appears that more companies are getting comfortable collecting Petabyte scale data sets from sensor networks, social applications, IT log files, and online advertising applications. However, many of the companies collecting these data sets appear to have been struggling with the analysis of the collected data, primarily because of its volume and its complexity (reflected in both the number of unique attributes but also in the fact that data relevant for a particular type of analysis may be spread among several data sets that need to be brought together). Larger corporations that have already developed data warehousing and analytic infrastructures must understand how to use Hadoop in order to deal with the data volumes being collected in the context of their existing infrastructures. Smaller private companies, on the other hand, are trying to establish such infrastructures for the first time and trying to determine what role Hadoop can ultimately play.
Companies analyze data either through interactive queries or by creating models. Such models can predict the outcome of future actions or automatically describe characteristics of data sets. Examples of emerging analytic big data applications include social media sentiment analysis, and analysis of various forms of data logs for online advertising, computer security, etc. Hadoop’s MapReduce batch processing programming model has proven effective for model creation. One may even say that for model creation Hadoop could further benefit by interfacing with open source modeling tools such as those offered by Revolution Analytics. But for applications that require the generation of real time decisions using such models, e.g., deciding which online ad to display in a particular web page, or determining the likelihood that a credit card transaction is fraudulent before authorizing it, Hadoop’s batch model is inappropriate. Similarly, Hadoop’s batch processing model is inappropriate for applications that are driven by interactive analytic queries, e.g., identifying the effectiveness of particular price discounts in increasing the unit sales of a product in a specific geographic region. Hadoop’s strengths make it the ideal “preprocessor” for big data; essentially, the ETL tool for big data analytic infrastructures.
Let me elaborate. Users of analytic applications that are driven by interactive queries can use Hadoop to select the right data to query (which often involves testing the information content of various data sources before deciding which ones to use), determine whether data from the selected sources needs to be merged into a new source, or transform the data in some other way, (for example, by creating new derived attributes). In other words, a lot of what commercial ETL tools do today. In a company with a pre-existing data warehousing infrastructure, analysts can import (load) the preprocessed data into the commercial analytic database, e.g., Teradata, Netezza, Greenplum, or the in-memory database, e.g., Qliktech, and start interacting with it through queries. In a company without such infrastructure, as is the case with most startups that make heavy use of big data analytics, the preprocessed data can be subsequently loaded in a NoSQL database for further analysis.
Where does this leave big data analytic workbenches like those offered by companies such as Karmasphere and Datameer? To reach the big data analytic mainstream such workbenches must be part of the Hadoop ecosystem as they have the potential to make it easy for users to set up analytic tasks and perform analyses. However, their ultimate success will depend on whether Hadoop continues to be viewed as an important component for big data analytics, the approach big data stack vendors, e.g., Cloudera, decide to take regarding the end user/analyst, i.e., will they support such a user directly or through partners, and how well these workbenches become integrated with each big data stack.