Customers use Apache Hive with Amazon EMR to provide SQL-based access to petabytes of data stored on Amazon S3. Amazon EMR 6.0.0 adds support for Hive LLAP, providing an average performance speedup of 2x over EMR 5.29, with up to 10x improvement on individual Hive TPC-DS queries. This post shows you how to enable Hive LLAP, and outlines the performance gains we’ve observed using queries from the TPC-DS benchmark.

Twice as fast compared to Amazon EMR 5.29.0

To evaluate the performance benefits of running Hive with Amazon EMR release 6.0.0, we’re using 70 TCP-DS queries with a 3 TB Apache Parquet dataset on a six-node c4.8xlarge EMR cluster to compare the total runtime and geometric mean with results from EMR release 5.29.0.

The results show that the TPC-DS queries run twice as fast in Amazon EMR 6.0.0 (Hive 3.1.2) compared to Amazon EMR 5.29.0 (Hive 2.3.6) with the default Amazon EMR Hive configuration.

The following graph shows performance improvements measured as total runtime for 70 TPC-DS queries. Amazon EMR 6.0.0 has the better (lower) runtime.

EMRHive1 rev

The following graph shows performance improvements measured as geometric mean for 70 TPC-DS queries. Amazon EMR 6.0.0 has the better (lower) geometric mean.

EMRHive2 rev

The following graph shows the performance improvements on a per-query basis sorted by highest performance gain. In this comparison, the higher numbers are better.

EMRHive3 rev

Hive Live Long and Process (LLAP)

Hive LLAP enhances the execution model of Hive, using persistent daemons with dynamic in-memory caching for low-latency queries. These daemons run on the core and task nodes in EMR clusters, caching data and metadata, and avoid the container startup overhead of traditional Hive queries because they are long lived processes.

By default, LLAP daemons do not start as part of EMR cluster start-up. You can enable LLAP in Amazon EMR 6.0.0 with the following hive configuration:

[
  {
    "Classification": "hive",
    "Properties": {
      "hive.llap.enabled": "true"
    }
  }
]

Since Hive LLAP uses persistent daemons that run on YARN, a percentage of the EMR cluster’s YARN resources will be reserved for Hive LLAP daemons when LLAP is enabled. You can override the following properties, which are predefined/calculated by EMR, using the hive configuration when launching an EMR cluster.

PropertyDescriptionDefault
hive.llap.num-instancesDefines the number of LLAP instances to run on the EMR cluster. There can be at max one LLAP instance per Node manager node (Core/Task).Number of Core/Task nodes in the cluster
hive.llap.percent-allocationDefines percentage of YARN NodeManager resources allocated to LLAP instance. This only applies to nodes running an LLAP instance defined by hive.llap.num-instances.0.6 (60%)

For example, to define 80% of YARN NodeManager resources to LLAP, use the following configuration:

 [
      {
        "classification": "hive",
        "properties": {
          "hive.llap.percent-allocation": "0.8",
          "hive.llap.enabled": "true"
        },
        "configurations": []
      }
    ]

You can override the following properties which are predefined/calculated by EMR using hive-site classification when launching an EMR cluster.

PropertyDescription
hive.llap.daemon.yarn.container.mbYARN container size for each LLAP daemon
hive.llap.daemon.memory.per.instance.mbLLAP Xmx
hive.llap.io.memory.sizeSize of LLAP IO cache in an LLAP daemon
hive.llap.daemon.num.executorsNumber of executors (tasks that can execute in parallel) per LLAP daemon

For more details on configuring Hive LLAP in Amazon EMR 6.0.0, please refer to Using Hive LLAP.

Resizing LLAP Daemons

You can modify the number of LLAP instances using the YARN CLI. Here’s a sample command:

$ yarn app -flex <Application Name> -component llap -1

  • Application Name – llap0 (Default)

You can check the status of the Hive LLAP daemons with the following command:

$ hive --service llapstatus

Hive LLAP Web Services

The Hive LLAP Monitor listens on port 15002 in each core and task node running the LLAP daemon.

The following table provides an overview of the Web Services available in LLAP.

URIDescription
http://coretask-public-dns-name:15002Shows the overview of heap, cache, executor and system metrics
http://coretask-public-dns-name:15002/confShows the current LLAP configuration
http://coretask-public-dns-name:15002/peersShows the details of LLAP nodes in the cluster extracted from the Zookeeper server
http://coretask-public-dns-name:15002/iomemShows details about the cache contents and usage
http://coretask-public-dns-name:15002/jmxShows the LLAP daemon’s JVM metrics
http://coretask-public-dns-name:15002/stacksShows JVM stack traces of all threads
http://coretask-public-dns-name:15002/conflogShows the current log levels
http://coretask-public-dns-name:15002/statusShows the status of LLAP

 Hive Performance (LLAP vs Container)

In EMR 6.0.0, Hive LLAP is optional and all Hive queries are executed using dynamically allocated containers when Hive LLAP is disabled. To illustrate the differences of running Hive queries with persistent Hive LLAP daemons versus dynamically allocated containers, we’ve used a subset of the TCP-DS benchmark. The results show an overall performance improvement of 27%, with some queries sped up by up to 4x.

The following graph shows performance improvements measured as total runtime for 50 TPC-DS queries. Amazon EMR 6.0.0 using LLAP has the better (lower) runtime.

EMRHive4 rev

The following graph shows performance improvements measured as geometric mean for 50 TPC-DS queries. Amazon EMR 6.0.0 using LLAP has the better (lower) runtime.

EMRHive5 rev

The following graph shows the performance improvements on a per-query basis sorted by highest performance gain. In this comparison, the higher numbers are better.

EMRHive6 rev

Summary

This post demonstrated the performance improvement of Hive on Amazon EMR 6.0.0 in comparison to the previous Amazon EMR 5.29 release. This improvement in performance helps to reduce query runtime and cost. You also learned about to use Hive LLAP with Amazon EMR 6.0.0, how to configure it, how to view the status and metrics using LLAP Monitor, and saw the performance gains when Hive LLAP is enabled. Stay tuned for additional updates on new features and further improvements in Apache Hive on Amazon EMR.

 


About the Author

Suthan1Suthan Phillips is a big data architect at AWS. He works with customers to provide them architectural guidance and helps them achieve performance enhancements for complex applications on Amazon EMR. In his spare time, he enjoys hiking and exploring the Pacific Northwest.

 

 

from AWS Big Data Blog: https://aws.amazon.com/blogs/big-data/apache-hive-is-2x-faster-with-hive-llap-on-emr-6-0-0/