New Benchmark Results: Pivotal Query Optimizer Speeds Up Big Data Queries Up To 1000x

Cross posted from my original blog at Pivotal P.O.V

Have you heard about the new super-efficient Pivotal Query Optimizer developed by the Greenplum engineering team? Previously codenamed “Orca”, this new feature has been released as part of the HAWQ query engine in Pivotal HD, Pivotal’s commercially-supported distribution of Apache Hadoop.

This new optimizer has been undergoing months of performance testing and improvements and is nearly ready for market. Pivotal will be showcasing a peer-reviewed paper at ACM SIGMOD Conference 2014, June 22 – 27, on the results of this performance study. Titled “Orca: A Modular Query Optimizer Architecture for Big Data”, this paper explains how they built the query optimizer, and show the results they’ve seen so far in customer usage and ongoing testing. If you would like to get a copy of the paper yourself and see the detailed benchmark results, ask at the Pivotal booth (booth S32) at this week’s Hadoop Summit in San Jose.

The Pivotal Query Optimizer is now also available to Pivotal Greenplum DB customers as part of an early access program. For customers that are interested in trying this out, please register here.

Sophisticated Computer Science

Developing a query optimizer involves some very sophisticated computer science. The team wanted to create a new SQL-compliant query technology that was better suited to the trends we are seeing in big data:

  • Increasing volume from companies keeping detail data, not aggregates, from many more sources.
  • More variety in the types of data to be incorporated into queries such as application logs, sensor time series, geospatially tagged data, genomics data, and social media feeds.
  • Diverse storage due to an increasing variety of data technologies being instead of traditional RDBMS for storing and managing this data.
  • Complex queries generated by advanced analytics algorithms being applied to all this data.

This technology is laser focused on providing fast SQL query results on petabytes of data and be portable across data architectures, such as Pivotal HD and Pivotal Greenplum.

PQO_system_architecture
© 2014 ACM, used with permission.

Figure 1. The Pivotal Query Optimizer is a stand alone optimizer that is portable across databases that implement Data eXchange Language (DXL).

Along with further enhancements with the release of Pivotal HD 2.0, this new query optimizer is allowing customers to make use of full ANSI SQL compliant queries against Hadoop at a rate up to 1000X faster than they could with Pivotal HD 1.0. Not only does it speed up your queries, it makes Hadoop more practical for some serious data science work. Now you can better take advantage of more analytics use cases on Hadoop through faster queries in HAWQ, which comes with support for GraphLab, MADLib, languages such as R, Python and Java, and all new support for Parquet files.

PQO_internal_architecture
© 2014 ACM, used with permission.

Figure 2. The Pivotal Query Optimizer finds fastest query plans for full ANSI SQL-compliant queries hitting either Pivotal Hadoop and Pivotal Greenplum Database.

Performance Testing on Hadoop

I’m pleased to be able to preview some of these testing results with you in this blog—for a certain purpose. Pivotal is looking for a few customers of Greenplum DB to help with final testing and validation of the new query optimizer. We’d love for you to join the early access program, and experience for yourself the performance benefits and new use cases you can achieve with the new Pivotal Query Optimizer on Greenplum DB.

Part of validating the new Pivotal Query Optimizer includes performance testing against the TCP-DS benchmark. As mentioned, testing of Pivotal HD 2.0 versus Pivotal HD 1.0 against the benchmark showed some of the queries had up to a 1000X improvement. More importantly, with the new query optimizer, Pivotal HD 2.0 is able to complete the entire benchmark of 111 queries. For the first time in the market, a commercially supported Hadoop stack can now be effectively used for ad hoc analytical use cases as well as leverage existing applications and expertise.

Performance Testing on Pivotal Greenplum DB

We did similar performance testing of the new version of Greenplum DB vs. the prior version of Greenplum DB using the TCP-DS benchmark. GPDB configured with the Pivotal Query Optimizer database versus GPDB configured to use the legacy query optimizer planner showed an overall 5X improvement in running the entire benchmark of 111 queries. For some specific queries we see as much as a 1000X improvement. We timed out the test at that point.

query_results
© 2014 ACM, used with permission.

Figure 3. TCP-DS performance testing results of Pivotal Greenplum with Pivotal Query Optimizer vs. Pivotal Greenplum with “planner” query optimizer.

What many of these significantly improved queries have in common is layers of nested queries, often with window functions. We find these kinds of queries occur when users are working with advanced analytics packages against such as SAS on top of GreenplumDB. We expect to see significant improvement in analysis times for users of these tools as we ramp up early access.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: