Many organizations have an Apache Hive metastore that stores the schemas for their data lake. You can use Amazon Athena due to its serverless nature; Athena makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. You may also want to reliably query the rich datasets in the lake, with their schemas hosted in an external Hive metastore. In response to customers’ requests, AWS announced Athena’s support for Hive metastore in early 2020. This extends the ability in Athena to query external data stores, with a combined benefit of better performance and lower cost.
In this post, we provide an AWS CloudFormation template that configures a remote Hive metastore based on Amazon Relational Database Service (Amazon RDS) and MySQL with Amazon EMR located in a private subnet to perform ETL tasks. We then demonstrate how you can use a Spark step to pull COVID-19 datasets from a public repository and transform the data into a performant Parquet columnar storage format. We also walk through the steps to query the data with Athena and visualize it with Amazon QuickSight. QuickSight is a fully managed data visualization service; it lets the you easily create and publish interactive dashboards by analyzing data from various data sources, including Athena.
The following diagram shows the architecture for our solution.
As shown in the preceding architecture, we have an Availability Zone within a VPC in an AWS Region. The Availability Zone hosts subnets that are either public or private. A multi-master EMR cluster that has Hive software components installed is launched in a private subnet with egress internet traffic through a NAT gateway to download data from public sites for analysis. The multi-master feature also ensures that the primary nodes are highly available. The Hive metastore is backed by a remote RDS for MySQL instance located in the same private subnet.
We also have an Amazon Simple Storage Service (Amazon S3)-based data lake. A Spark step in Amazon EMR retrieves the data in CSV format, saves it in the provisioned S3 bucket, and transforms the data into Parquet format. The Spark step creates an external Hive table referencing the Parquet data and is ready for Athena to query.
The connector is called a catalog, which when invoked in a SQL statement with Athena, invokes the Lambda function. The function exits if the connector is not active for 15 minutes. For queries that run longer than 15 minutes, it’s recommended to let the query complete before retrieving the query results in an Amazon S3 location you can specify.
In Athena, you can compose a SQL statement against the Hive tables with predicates to further limit the size of the query result for faster visualization by QuickSight.
Deploying the resources with AWS CloudFormation
To demonstrate our solution, we provide a CloudFormation template that you can download to easily deploy the necessary AWS resources. The template creates the following resources to simulate the environment:
- VPC and subnets – A VPC with one public and one private subnets. A NAT gateway is also created to allow outbound internet access from the EMR cluster to download public COVID-19 datasets from the Johns Hopkins GitHub repo.
- EMR cluster – A multi-master EMR cluster with Hive, running on three primary nodes (m5.xlarge) and two core nodes (m5.xlarge), is launched to support the thrift connection required by the Athena Lambda connectors.
- Amazon RDS for MySQL database – An RDS for MySQL primary instance is launched in the same subnet as the EMR cluster. The RDS instance serves as the Hive metastore backend data store.
- S3 bucket – An S3 bucket stores files in Parquet format by Amazon EMR and is accessed later by Athena.
- AWS IAM users – Two AWS Identity and Access Management (IAM) users belonging to different user groups. The first user, the data engineer, has permissions to access the Lambda-based Athena data source connector. The other user, the salesperson, does not have permissions to access the connector.
To get started, you need to have an AWS account. If you don’t have one, go to aws.amazon.com to sign up for one. Then complete the following steps:
- Sign in to the AWS Management Console as an IAM power user, preferably an admin user.
- Choose Launch Stack to launch the CloudFormation template:
This template has been tested in the US East (N. Virginia) Region.
- Choose Next.
You’re prompted to enter a few launch parameters.
- For Stack name, enter a name for the stack (for example,
- For Hive Metastore Host Number, choose the Amazon EMR primary node to connect to (or use the default value).
- Continue to choose Next and leave other parameters at their default.
- On the review page, select the three check boxes to confirm that AWS CloudFormation might create resources.
- Choose Create stack.
The stack takes 15–20 minutes to complete.
- On the Outputs tab of the stack details, save the key-value pairs to use later.
When the EMR cluster is provisioned, it uses a bootstrap action to install the necessary Python libraries. It runs an Amazon EMR Spark step (a PySpark script) that downloads three COVID-19 datasets for confirmed, recovered, and death cases from the John Hopkins GitHub repo in CSV format and stores them in the
csv subfolder of the S3 bucket created by the CloudFormation stack. Lastly, the final transformed data is converted to Parquet format and external Hive tables are created referencing the Parquet data located in the
parquet subfolder of the S3 bucket.
The following are the source codes for the bootstrap and Spark step actions for your reference:
To validate the data, on the Amazon S3 console, choose the bucket name from the CloudFormation template outputs. You should see a
covid_data folder in the bucket. The folder contains the two subfolders,
parquet, which store the raw CSV and transformed Parquet data, respectively.
Querying the data in Athena
The CloudFormation template creates two users belonging to two different AWS IAM groups. The
de_user_us-east-1_athena-hms user is a data engineer with permissions to access the Lambda function to communicate with the Hive metastore using the Athena data source connector. They belong to the group
sales_user_us-east-1_athena-hms user is a salesperson with no access to the connector. They belong to the group
To query the data, first retrieve your secret values in AWS Secrets Manager:
- On the Secrets Manager console, choose Secrets.
- Choose DataEngineerUserCreds.
- Choose Retrieve secret value.
- Save the username and password.
- Repeat these steps for
With the data in place, we can now use Athena to query it.
Accessing data as the data engineer
To query the data as the
de_user_us-east-1_athena-hms user, complete the following steps:
- Sign in to the Athena console as the
de_user_us-east-1_athena-hmsuser with the credentials retrieved from Secrets Manager.
After logging in, we need to create a data source for Hive metastore.
- On the navigation bar, choose Data sources.
- Choose Connect data source.
- For Choose where you data is located, select Query data in Amazon S3.
- For Choose a metadata catalog, select Apache Hive metastore.
- Chose Next.
- For Lambda function, choose the function the CloudFormation template created with the key
- For Catalog name, enter a name for the Athena catalog (for example,
- Choose Connect.
You should now have a catalog named
demo_hive_metastore with the catalog type
- On the navigation bar, choose Query editor.
- Enter the following SQL statement:
This SQL statement selects all the columns in the
covid_confirmed_cases table with predicates to only include a few countries of interest. We use a table name with the pattern
<hive database name>.
<hive table name in the database>, which for this post translates to
- Choose Run query.
The following screenshot shows your query results.
Make sure you completely sign out of the console before moving on to the next steps.
Accessing data as the salesperson
Sign in to the console as
sales_user_us-east-1_athena-hms user. Because the salesperson user doesn’t have the appropriate IAM policies to access the Hive metastore connection, you can’t see the tables.
The data engineer has additional policies attached to their IAM group in addition to the managed
AmazonAthenaFullAccess policy: the
<stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policies created by the CloudFormation template. Therefore, the data engineer can view the tables, but the salesperson can’t.
These policies are available on the IAM console, on the Permissions tab for the group
The following sample JSON code is the
<stack-name>-DataBucketReadWriteAccessPolicy-xxxxx policy to allow access to the provisioned S3 bucket:
The following sample JSON code is the
<stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policy to allow access to the Lambda Hive metastore function:
Next, we walk through using QuickSight to visualize the results. Make sure you completely sign out of the console as the salesperson user before proceeding to the next steps.
Signing up for QuickSight
You can skip this section if you have already signed up for QuickSight previously.
- Sign in to the console as the IAM power user who deployed the CloudFormation template or any user with enough IAM privileges to set up QuickSight.
- On the QuickSight console, choose Sign up for QuickSight.
- Select either the Standard or Enterprise edition for QuickSight.
- Choose Continue.
- For QuickSight account name, enter your AWS account ID.
- For Notification email address, enter your email.
- Select Amazon S3.
- Select the provisioned S3 bucket to grant QuickSight permission to access.
- Choose Finish.
Your QuickSight account should now be set up.
Attaching the Hive metastore access policy
Before you can use QuickSight, you have to attach the Hive metastore access policy to the QuickSight service role.
- On the IAM console, search for the service role
- Choose the role.
- Search for the
- Select the policy and attach it to the QuickSight service role.
Creating your data source and performing data conversions
Before we can create visualizations, we need to set up our data source.
- Download the SQL script covid-19.sql.
- On the QuickSight console, choose Datasets in the navigation pane.
- Choose New dataset.
- Choose Athena.
- In the pop-up window, for Data source name, enter
- Choose Validate.
- When the connection is validated, choose Create data source.
- In the next window, choose Use custom SQL.
- Enter the content of the covid-19.sql script in the query window.
- Choose Confirm query.
- Leave Import to SPICE for quicker analytics
- Choose Visualize.
Now we perform a few data type conversions before visualizing the data.
- Choose the Edit icon next to Data set on the menu bar.
- Choose the … icon.
- Choose Edit.
- Expand date and choose Change data type.
- Choose Date.
yyyy-MM-ddto convert the date format.
- Choose Update.
Now we create a coordinate using the latitude and longitude values.r
- Expand lat and choose Ad to coordinates.
- Leave Create new geospatial coordinates
- Chose Add.
- In the pop-up window, for Name your coordinates, enter coordinate.
- For Field to use for longitude, choose lon.
- Choose Create coordinates.
- Choose Save and visualize on the menu bar.
Creating visualizations in QuickSight
Now we can visualize our data. For this post, we create a map visualization.
- For Visual types, choose the map
- For Geospatial, drag coordinates.
- For Size, drag confirmed.
- For Color, drag country.
This world map shows the accumulated confirmed cases for selected countries over time; you need to use a filter to look at confirmed cases on a specific date.
- In the navigation pane, choose Filter.
- Choose the + icon.
- For Filter type, choose Time range.
- Choose start and end dates, such as 2020-09-10 00:00 and 2020-09-11 00:00, respectively.
- Choose Apply.
This plots the confirmed cases on September 10, 2020, for these countries.
Similarly, you can choose other visual types, such as a line chart, and generate the mortality rate for selected countries over time.
Using highly available primary nodes of the Amazon EMR cluster
The EMR cluster has a multi-master configuration with three primary nodes running to meet high availability requirements. At any time, the Lambda function communicates with one of these three EMR primary nodes. In the rare event that this node goes down, you can quickly re-establish the Athena data source connector to the external Hive metastore by failing over to another active primary node.
To perform this failover, complete the following steps:
- On the AWS CloudFormation console, choose Stacks.
- Choose athena-hms.
- Choose update.
- Choose Use current update.
- Choose Next.
- For Hive Metastore Host Number, choose a host other than the current one you’re using.
- Choose Next.
- Acknowledge that AWS CloudFormation might create IAM resources.
- Choose Update stack.
In less than a minute, you should be able to access the Hive metastore and continue to query on the Athena console.
You may want to clean up the demo environment when you’re done. To do so, on the AWS CloudFormation console, select the template and choose Delete.
This action also deletes the S3 bucket and any data in it. If you want to retain the data for future use, you should make a copy of the bucket before you delete it.
In this post, we walked through a solution using Athena to query external Hive tables with public COVID-19 datasets hosted in an S3 bucket and visualizing the data with QuickSight. We provided a CloudFormation template to automate the deployment of necessary AWS services for the demo. We encourage you to use these managed and scalable services for your specific use cases in production.
About the Authors
James Sun is a Senior Solutions Architect with Amazon Web Services. James has over 15 years of experience in information technology. Prior to AWS, he held several senior technical positions at MapR, HP, NetApp, Yahoo, and EMC. He holds a PhD from Stanford University.
Chinmayi Narasimhadevara is a Solutions Architect with Amazon Web Services. Chinmayi has over 15 years of experience in information technology and has worked with organizations ranging from large enterprises to mid-sized startups. She helps AWS customers leverage the correct mix of AWS services to achieve success for their business goals.
Gagan Brahmi is a Solutions Architect focused on Big Data & Analytics at Amazon Web Services. Gagan has over 15 years of experience in information technology. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.