Presto is an open-source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics, and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook. Running Presto on Amazon EMR is a popular choice because Amazon EMR provides the latest, stable, open-source community Presto innovations and Amazon EMR platform-level optimizations for Presto workloads.
With Amazon EMR 5.31, you have one more reason to run Presto on Amazon EMR. Amazon EMR now includes EMR runtime for Presto, a performance-optimized runtime environment for Presto that includes custom performance improvements. With EMR runtime for Presto, your queries run up to 2.6 times faster. EMR runtime for Presto is 100% API compatible with open-source Presto. Therefore, you can run Presto applications on Amazon EMR without having to make any changes. EMR runtime for Presto is available by default on Amazon EMR release 5.31 and later, and 6.1 and later.
Results observed using TPC-DS benchmark
To measure performance improvements, we compared Amazon EMR 5.31.0, which includes EMR runtime for Presto compatible with open-source Presto version 0.238.3, and Amazon EMR 5.29.0, which includes open-source Presto version 0.227. We used TPC-DS benchmark queries with 1 TB scale and ran them on a 6-node r4.8xlarge EMR cluster with data in Amazon Simple Storage Service (Amazon S3).
We measured performance improvements as the total query time across all queries and the geometric mean of improvement in total query runs The geometric mean of runtime for n queries is calculated by multiplying query times for the n queries, and taking the nth root of the product. The geometric mean, unlike arithmetic mean, doesn’t get skewed with outliers with performance improvements or regressions from individual queries, and is generally used for comparing benchmarking results from TPC-DS benchmarks. We observed total runtime performance improving by 4.91 times, the geometric mean of query runtime improving by 2.62 times, with individual queries improving by up to 7.9 times (and one query, query 72 from TPC-DS, improving by over 400 times). In our tests, query 4 from TPC-DS failed to run on the EMR 5.29 cluster, and was excluded from the comparison.
The following graph shows performance improvements measured as total runtime for TPC-DS queries. Amazon EMR 5.31 with EMR runtime has the better (lower) runtime.
The following graph shows performance improvements measured as the geometric mean for TPC-DS queries. Amazon EMR 5.31 with EMR runtime has the better (lower) geometric mean.
The following graph shows performance improvements in Amazon EMR 5.31 with EMR runtime compared to Amazon EMR 5.29 without EMR runtime for short-running queries (running for less than 60 seconds in Amazon EMR 5.29).
The following graph shows the performance improvements in Amazon EMR 5.31 with EMR runtime compared to Amazon EMR 5.29 without EMR runtime for long-running queries (running for more than 60 seconds, and excluding query 72, which has a speedup of over 400 times, and is shown in a separate graph). Again, the higher numbers are better.
Queries running for less than 60 seconds are up to 6 times faster, as seen in query 98. Queries running for more than 60 seconds are up to 7.9 times faster, as seen on query 61, with one query, query 72, at 421 times faster with EMR runtime than without.
With every new release of Amazon EMR, you benefit from better query performance. To keep up to date, subscribe to the AWS Big Data blog’s RSS feed to learn about more Presto optimizations, configuration best practices, and tuning advice.
About the Authors
Al MS is a product manager for Amazon EMR at Amazon Web Services.
Peter Gvozdjak is a senior engineering manager for EMR at Amazon Web Services.