Apache Hive is an open-source data warehouse package that runs on top of an Apache Hadoop cluster. You can use Hive for batch processing and large-scale data analysis. Hive uses Hive Query Language (HiveQL), which is similar to SQL.
ACID (atomicity, consistency, isolation, and durability) properties make sure that the transactions in a database are atomic, consistent, isolated, and reliable.
Amazon EMR 6.1.0 adds support for Hive ACID transactions so it complies with the ACID properties of a database. With this feature, you can run INSERT, UPDATE, DELETE, and MERGE operations in Hive managed tables with data in Amazon Simple Storage Service (Amazon S3). This is a key feature for use cases like streaming ingestion, data restatement, bulk updates using MERGE, and slowly changing dimensions.
This post demonstrates how to enable Hive ACID transactions in Amazon EMR, how to create a Hive transactional table, how it can achieve atomic and isolated operations, and the concepts, best practices, and limitations of using Hive ACID in Amazon EMR.
Enabling Hive ACID in Amazon EMR
To enable Hive ACID as the default for all Hive managed tables in an EMR 6.1.0 cluster, use the following hive-site configuration:
For the complete list of configuration parameters related to Hive ACID and descriptions of the preceding parameters, see Hive Transactions.
Hive ACID use case
In this section, we explain the Hive ACID transactions with a straightforward use case in Amazon EMR.
Enter the following Hive command in the master node of an EMR cluster (6.1.0 release) and replace <s3-bucket-name> with the bucket name in your account:
After Hive ACID is enabled on an Amazon EMR cluster, you can run the CREATE TABLE DDLs for Hive transaction tables.
To define a Hive table as transactional, set the table property
The following CREATE TABLE DDL is used in the script that creates a Hive transaction table acid_tbl:
This script generates three partitions in the provided Amazon S3 path. See the following screenshot.
The first partition,
trans_date=2020-08-01, has the data generated as a result of sample INSERT, UPDATE, DELETE, and MERGE statements. We use the second and third partitions when explaining minor and major compactions later in this post.
ACID is achieved in Apache Hive using three types of files:
delete_delta. Edits are written in delta and
base file is created by the
Insert Overwrite Table query or as the result of major compaction over a partition, where all the files are consolidated into a single
base_<write id> file, where the write ID is allocated by the Hive transaction manager for every write. This helps achieve isolation of Hive write queries and enables them to run in parallel.
The INSERT operation creates a new
delta_<write id>_<write id> directory.
The DELETE operation creates a new
delete_delta_<write id>_<write id> directory.
To support deletes, a unique
row__id is added to each row on writes. When a DELETE statement runs, the corresponding
row__id gets added to the
delete_delta_<write id>_<write id> directory, which should be ignored on reads. See the following screenshot.
The UPDATE operation creates a new
delta_<write id>_<write id> directory and a
delete<write id>_<write id> directory.
The following screenshot shows the second partition in Amazon S3,
A Hive transaction provides snapshot isolation for reads. When an application or query reads the transaction table, it opens all the files of a partition/bucket and returns the records from the last transaction committed.
With the previously mentioned logic for Hive writes on a transactional table, many small
delete_delta files are created, which could adversely impact read performance over time because each read over a particular partition has to open all the files (including
delete_delta) to eliminate the deleted rows.
This brings the need for a compaction logic for Hive transactions. In the following sections, we use the same use case to explain minor and major compactions in Hive.
A minor compaction merges all the
delete_delta files within a partition or bucket to a single
delta_<start write id>_<end write id> and
delete_delta_<start write id>_<end write id> file.
We can trigger the minor compaction manually for the second partition (
trans_date=2020-08-02) in Amazon S3 with the following code:
If you check the same second partition in Amazon S3, after a minor compaction, it looks like the following screenshot.
You can see all the
delete_delta files from write ID 0000005–0000009 merged to single
delete_delta files, respectively.
A major compaction merges the
delete_delta files within a partition or bucket to a single
base_<latest write id>. Here the deleted data gets cleaned.
A major compaction is automatically triggered in the third partition (
trans_date='2020-08-03') because the default Amazon EMR compaction threshold is met, as described in the next section. See the following screenshot.
To check the progress of compactions, enter the following command:
The following screenshot shows the output.
Compaction in Amazon EMR
Compaction is enabled by default in Amazon EMR 6.1.0. The following property determines the number of concurrent compaction tasks:
- hive.compactor.worker.threads – Number of worker threads to run in the instance. The default is 1 or vCores/8, whichever is greater.
Automatic compaction is triggered in Amazon EMR 6.1.0 based on the following configuration parameters:
- hive.compactor.check.interval – Time period in seconds to check if any partition requires compaction. The default is 300 seconds.
- hive.compactor.delta.num.threshold – Triggers minor compaction when the total number of
deltafiles is greater than this value. The default is 10.
- hive.compactor.delta.pct.threshold – Triggers major compaction when the total size of
deltafiles is greater than this percentage size of base file. The default is 0.1, or 10%.
The following are some best practices when using this feature:
- Use an external Hive metastore for Hive ACID tables – Our customers use EMR clusters for compute purposes and Amazon S3 as storage for cost-optimization. With this architecture, you can stop the EMR cluster when the Hive jobs are complete. However, if you use a local Hive metastore, the metadata is lost upon stopping the cluster, and the corresponding data in Amazon S3 becomes unusable. To persist the metastore, we strongly recommend using an external Hive metastore like an Amazon RDS for MySQL instance or Amazon Aurora. Also, if you need multiple EMR clusters running ACID transactions (read or write) on the same Hive table, you need to use an external Hive metastore.
- Use ORC format – Use ORC format to get full ACID support for INSERT, UPDATE, DELETE, and MERGE statements.
- Partition your data – This technique helps improve performance for large datasets.
- Enable an EMRFS consistent view if using Amazon S3 as storage – Because you have frequent movement of files in Amazon S3, we recommend using an EMRFS consistent view to mitigate the issues related to the eventual consistency nature of Amazon S3.
- Use Hive authorization – Because Hive transactional tables are Hive managed tables, to prevent users from deleting data in Amazon S3, we suggest implementing Hive authorization with required privileges for each user.
Keep in mind the following limitations of this feature:
- The AWS Glue Data Catalog doesn’t support Hive ACID transactions.
- Hive external tables don’t support Hive ACID transactions.
- Bucketing is optional in Hive 3, but in Amazon EMR 6.1.0 (as of this writing), if the table is partitioned, it needs to be bucketed. You can mitigate this issue in Amazon EMR 6.1.0 using the following bootstrap action:
This post introduced the Hive ACID feature in EMR 6.1.0 clusters, explained how it works and its concepts with a straightforward use case, described the default behavior of Hive ACID on Amazon EMR, and offered some best practices. Stay tuned for additional updates on new features and further improvements in Apache Hive on Amazon EMR.
About the Authors
Suthan Phillips is a big data architect at AWS. He works with customers to provide them architectural guidance and helps them achieve performance enhancements for complex applications on Amazon EMR. In his spare time, he enjoys hiking and exploring the Pacific Northwest.
Chao Gao is a Software Development Engineer at Amazon EMR. He mainly works on Apache Hive project at EMR, and has some in-depth knowledge of distributed database and database internals. In his spare time, he enjoys making roadtrips, visiting all the national parks and traveling around the world.