10 Amazing Things to Do With a Hadoop-Based Data Lake

Cross posted from The Pivotal POV Blog.

The following is a summary of a talk I gave at Strata NY that is proving popular among a lot of people who are still trying to understand use cases for Apache Hadoop® and big data.  In this talk, I introduce the concept of a Big Data Lake, which utilizes Apache Hadoop® as storage, and powerful open source and Pivotal technologies. Here are 10 amazing things companies can do with such a big data lake, ordered according to increasing impact on the business.


This is an architecture for a Business Data Lake, and  it is centered around Hadoop-based storage. It includes tools and components for ingesting data from different kinds of data sources, processing data for analytics and insights, and for supporting applications that utilize data, implement insights, and contribute data back to the data lake as sources of new data.  In this presentation, we will look at the various components of a business data lake architecture, and show how when put together these technologies help maximize the value of your company’s data.

The Benefits of Hadoop-Based Storage

In the context of the overall architecture, the orange box in the diagram below is the location of Hadoop-based storage and allows companies to store massive data sets and mix disparate data sources. In this architecture, storage is part of Pivotal HD and its underlying Apache Hadoop® File System.


Let’s first look at why Apache Hadoop® and HDFS-based storage make a lot of sense.

1. Store Massive Data Sets


Apache Hadoop®, and the underlying Apache Hadoop® File System, or HDFS, is a distributed file system that supports arbitrarily large clusters and scales out on commodity hardware.  This means your data storage can theoretically be as large as needed and fit any need at a reasonable cost.  You simply add more clusters as you need more space. Apache Hadoop® clusters also converge computing resources close to storage, facilitating faster processing of the large stored data sets.

2. Mix Disparate Data Sources


HDFS is also schema-less, which means it can support files of any type and format.  This is great for storing unstructured or semi-structured data, as well as non-relational data formats such as binary streams from sensors, image data, or machine logging. It’s also just fine for storing structured, relational tabular data. There was a recent example where one of our data science teams mixed structured and unstructured data to analyze the causes of student success.

Storing these different types of data sets simply is not possible in traditional databases, and leads to siloed data sources, not aiding the integration of data sets.

Getting Different Types and Velocities of Data into Apache Hadoop®


In looking at the orange outline of the Ingestion Tier, there are multiple capabilities to allow data to be ingested in bulk or at a high velocity, and these capabilities include Apache Hadoop®’s Sqoop and Flume tools as well as GemFire XD and Spring XD.

In this environment, data storage can take any kind of data from any kind of source at various speeds and in various volumes, and the process can be a challenge.  This is why a wide selection of tools for ingest are used to implement a data lake.

3. Ingest Bulk Data


Ingesting bulk data really shows up in two forms—regular batches and micro-batches. There are three scalable, open source tools that can all be used depending on the scenario.

Sqoop, for example, is great for handling large data batch loading and is designed to pull data from legacy databases.

Alternatively, companies don’t want to just load the data, but they want to also do something with the data as it is loaded. For example, sometimes a loading operation needs additional processing, formats may need to be transformed, metadata must be created as the data is loaded, or analytics, such as for counts and ranges, must be captured as the data is ingested. In these cases, Spring XD provides a significant amount of scale and flexibility.

Microbatches are for smaller, recurring loads, such as data change deltas or event-triggered updates, and handled well by Flume.

Similar to different formats, the different speeds and ingest use cases led to siloed data sets in traditional data management. Being able to address all ingest use cases helps facilitate more unified processing of all of a company’s data.

4. Ingest High Velocity Data


Streaming high-velocity data into Apache Hadoop® is a different challenge altogether. When there is a large volume to consider at speed, you need tools that can capture and queue data at any scale or volume until the Apache Hadoop® cluster is able to store.

A data lake based on Pivotal Big Data Suite has two tools built for these use cases, and they work together:

  1. Spring XD can scale to handle data streaming at real time and provide the same capabilities of processing and analyzing as mentioned before.
  2. Pivotal GemFire XD can also work with Spring XD to provide advanced database operations such as searching for duplicates in a window of time and allows you to ensure consistency of data in writes. Since it is  a SQL-based database, it is also great for helping convert or add structure to ingested data.

In addition to avoiding siloing of data, streaming ingest of high volume, high velocity data meant that a lot of data simply could not be captured, and a lot of data had to be thrown away. Now we can capture large detailed sets, which allow for more advanced processing as you will see soon.

Processing, Analyzing, and Taking Action on Data in Apache Hadoop®


Once you have the ability to store and load data into your data lake, the next is deriving business value by processing, gaining insights, and taking action on the data.

There are a multitude of capabilities that apply in this context. For example, GemFire can be used incomplex event processing or run as a high scale database for e-commerce. HAWQ and Pivotal Greenplumprovide massively parallel analytics capabilities via SQL along with the Hadoop® components of MapReduce, Hive, and Pig. As well, data can be moved into an application via middleware like RabbitMQ, a data structure server like Redis, or into an application, such as one running on Pivotal CF or within Spring. Together, these offer many possibilities for taking action on analytical insights.

5. Apply Structure to Unstructured/Semi-Structured Data


It’s great that one can can get any kind of data into an HDFS data store. To be able to conduct advanced analytics on it, you often need to make it accessible to structured-based analysis tools.

This kind of processing may involve direct transformation of file types, transforming words into counts or categories, or simply analyzing and creating meta data about afile. For example, retail website data can be parsed and turned into analytical information and applications.

Structure can be applied on ingest with some of the tools described, or it can be processed after being stored in Apache Hadoop® as noted in the retail example above.

Other examples include transforming binary image formats into RDBMS tables to enable large scale image processing, part of speech tagging, or even simple ETL processes on web or system logs that are later turned into fact tables.

Processing data to apply structure is nothing new.  What is new is that we have the ability to do this with mixed data sets in the same data store with much richer detail, thus allowing us to process this data together in advanced scenarios.

6. Make Data Available for Fast Processing with SQL on Hadoop


Once you have structure applied to your data, using Pivotal’s Business Data Lake, you can reuse your SQL skills on modern SQL-based tools now available inside Hadoop® clusters to do fast processing on your data without actually moving the data. Only HAWQ provides full analytic SQL support IN Hadoop® with a massively parallel SQL query processor. This allows you to enjoy very high performance on a very large data set and leverage advanced analytics functions in MADlib as well as when using analytics applications such as SAS. The diagram above shows how HAWQ utilizes compute resources within Hadoop® nodes to do massively parallel processing of data directly without moving data.

Try pulling this large data set out of Apache Hadoop® for analytic processing, and you will greatly increase your time to results, and end up utilizing way more system resources than needed.

7. Achieve Data Integration


With structure applied to your data, and the ability to deploy advanced analytics, now you can start doing some very powerful investigation, which is actually supported on a single storage layer provided by Apache Hadoop®.

By discovering relationships between otherwise seemingly unrelated data sets, like system commands and help tickets, it is possible to discover correlations and potential causation or create multi-dimensional analytical models that have higher precision in predictive analytics.

The ability to find correlations and create context between data sets is aided significantly by the large data sets, and ability to mix source data formats in the single storage layer for Apache Hadoop®.  Without this richness and flexibility, integrating unrelated data sets its a much more difficult undertaking using traditional methods.

8. Improve Machine Learning & Predictive Analytics


Since HDFS allows you to store as much data as you want at a very cheap price, it is possible to store larger detail data sets such as time series feeds and application logs. In traditional data warehousing, ETL processes will aggregate and summarize this information while losing detail for the purposes of facilitating reporting. By saving the detail, it is possible to run machine learning algorithms on the data to help build more accurate and higher quality predictive analytics.

9. Deploy Real-Time Automation at Scale


Distributed in-memory databases such as Pivotal GemFire XD make it possible to deploy real-time, data-driven applications at very high scale in terms of processing happening in parallel. These kinds of applications tend to generate tremendous volumes of historical data and logs, which are perfect for storing in Apache Hadoop® for future analysis.

Examples of such applications include responding to and processing incoming streaming data, such as for Internet of Things events or sensor time series values. Or perhaps you are deploying a large scale mobile-web application, with intelligent user experiences, and smart automation and processing in the backend.  You want to be able to capture and store detailed logging of all interactions for further analysis to allow for continual improvement of the user experience.

10. Achieve Continuous Innovation at Scale


The ability to deploy automation at scale, capture and store all data, and analyze to discover insights and algorithms is an ongoing process of continuous improvement and innovation.


Using the full set of capabilities within a data lake, from storing massive data sets to achieving continuous innovation, your company can maximize the business value it generates from its data.

Learn more:

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: