Data lakes on Amazon Simple Storage Service (Amazon S3) have become the default repository for all enterprise data and serve as common choice for a large number of users querying from a variety of analytics and ML tools. Often times you want to ingest data continuously into the data lake from multiple sources and query against the data lake from many analytics tools concurrently. Previously, getting consistent results meant that you had to build custom pipelines to keep data consistent and resulted in delays before the data became available to the consuming tools.

We announced AWS Lake Formation transactions, row-level security, and acceleration for preview at AWS re:Invent 2020. In previous posts, we focused on setting up Lake Formation governed tables (part1) and streaming ingest into the data lake (part2). In this post we focus on atomicity, consistency, isolation and durability (ACID) transactions. We demonstrate how ACID transactions work on Lake Formation governed tables on Amazon S3.

Effective data lake using AWS Lake Formation

ACID transactions in Lake Formation

AWS Lake Formation ACID transactions make it easy to keep S3 data updated while continuously ingesting new data and simultaneously running analytics queries that return consistent and up-to-date results. Lake Formation ACID transactions provide snapshot isolation, which ensures that concurrent transactions see a consistent version of the database. These versions are isolated from each other, and run concurrently without impacting each other. ACID transactions enable multiple users to concurrently and reliably add and delete S3 objects in a atomic manner, while isolating any existing queries by maintaining read consistency for queries against the data lake.

Lake Formation introduces a new table type called governed tables that support ACID transactions on S3 objects. A transaction can have multiple operations and span multiple governed tables. Furthermore, a transaction can be performed by multiple users using multiple tools. If an operation within a transaction fails, the entire transaction is rolled back. Any transactions that are in progress are isolated from the changes performed by another transaction. Transactions protect the integrity of governed tables by ensuring that you always see a consistent view of your metadata and data in S3.

To illustrate read consistency with ACID transactions consider the following example –  assume that a user Andy starts a query in Amazon Athena that scans all partitions of a table to aggregate sales figures. Assume that another user Betty adds a partition to the same table while Andy’s query’s scan is in progress and not yet completed. If Andy performed the query within a transaction, they see query results that reflect the state of the table at the start of the query. The results don’t include data from Betty’s newly added partition.

Setting up resources with AWS CloudFormation

In this post, I demonstrate how you can create a new governed table and read from or write to the governed table using an Amazon SageMaker notebook and an AWS Glue development endpoint. The setup for this demonstration is a fictional e-commerce company that sells multiple products. We use two separate notebooks to illustrate ACID transactions. Both notebooks update and read the products table. Notebook1 starts off by populating the product table, which is then read and modified by Notebook2.  Within the same transaction, Notebook1 then reads the products table and is isolated from changes made by Notebook2. After Notebook2 has finished its transaction, Notebook1 starts a new transaction and sees the modified data.

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. If you prefer setting up resources on the AWS Management Console rather than AWS CloudFormation, see the instructions in the appendix at the end of this post.

The CloudFormation template generates the following resources:

  • AWS Identity and Access Management(IAM) users, roles, and policies
  • Lake Formation data lake settings and permissions
  • An AWS Glue development endpoint
  • An AWS Glue SageMaker notebook

When following the steps in this section, use the Region us-east-1 or us-west-2 because as of this writing, this Lake Formation preview feature is available only in us-east-1 and us-west-2.

To create your resources, complete the following steps:

  1. Sign in to the CloudFormation console in the us-east-1 or us-west-2 Region.
  2. Choose Launch Stack:
    LaunchStack
  3. Choose Next.
  4. For DatalakeAdminUserNameand DatalakeAdminUserPassword, enter your IAM user name and password for data lake admin user.
  5. For DataLakeBucketName, enter the name of your data lake bucket.
  6. For DatabaseName, leave as the default.
  7. Choose Next.
  8. On the next page, choose Next.
  9. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  10. Choose Create.

Stack creation can take approximately 15 minutes.

Setting up sample notebooks on your SageMaker notebook

We also provide two Jupyter notebooks to replicate the same code in your environment.

  1. Sign in to the Amazon SageMaker console in the same Region using the DatalakeAdmin3 user.
  2. Choose Notebook instances.
  3. Select the notebook instance aws-glue-<CloudFormation stack name> and choose Open JupyterLab.
  4. Choose Terminal.
  5. Run following command in the terminal:
    $ cd /home/ec2-user/SageMaker/
    $ curl 'https://aws-bigdata-blog.s3.amazonaws.com/artifacts/lakeformation_acid_transactions/notebook1.ipynb' -o notebook1.ipynb
    $ curl 'https://aws-bigdata-blog.s3.amazonaws.com/artifacts/lakeformation_acid_transactions/notebook2.ipynb' -o notebook2.ipynb

Now you can see these two new notebooks in JupyterLab’s left pane.

bdb1340 effective data lakes part 3 1

Running ACID transactions on SageMaker notebook

To run your ACID transactions, complete the following steps:

  1. Open the Amazon SageMaker console, choose Notebooks instances.
  2. Select the notebook instance aws-glue-<CloudFormation stack name> and choose Open JupyterLab.

Let’s go through the notebooks and see how Lake Formation transactions work.

  1. Open the tab of notebook1.ipynb. This notebook simulates the ETL job and creates the products in the company catalog.
  2. In the first cell, replace datalake-bucket with your bucket name which you have used in CloudFormation template’s DataLakeBucketName section.
  3. Run the first cell for initialization.

When the first cell is completed, Spark application information is visible.

bdb1340 effective data lakes part 3 2

  1. Run the cell under Create a DynamicFrame from sample data to create a sample data as a Spark DataFrame, and convert it to an AWS Glue DynamicFrame.

Now you can see a sample DynamicFrame including three product records: Television, USB charger, and Blender.

bdb1340 effective data lakes part 3 3

  1. Run the cell under Begin a new transaction to start a new Lake Formation transaction by calling the begin_transaction API.

The transaction ID is printed out. We use this transaction ID for writing the data.

bdb1340 effective data lakes part 3 4

  1. Run the cell under Write the DynamicFrame into a Governed table to write the sample DynamicFrame into a new Lake Formation governed table.

A new governed table is automatically created. You do not have to create this new table in advance in the Lake Formation console or the AWS SDK.

bdb1340 effective data lakes part 3 5

  1. On the Lake Formation console, choose Tables.

You can see the products table on the Lake Formation console.

bdb1340 effective data lakes part 3 6

  1. On the Amazon S3 console, choose your data lake bucket.

You can see that new data has also been ingested into Amazon S3.

bdb1340 effective data lakes part 3 7

  1. Go back to JupyterLab, and run the cell under Get transaction ID on notebook1 to start another transaction.
    bdb1340 effective data lakes part 3 8
  2. Run the cell under Read original table from notebook1 to read from the governed table.

You can see the data being read from the governed table using the transaction ID.

bdb1340 effective data lakes part 3 9

Then, let’s open another notebook and run the cells to simulate other query user’s activity.

  1. Open the browser tab of notebook2.ipynb.
  2. In the first cell, replace datalake-bucket with your bucket name which you have used in CloudFormation template’s DataLakeBucketName section.
  3. Run the first cell for initialization.
    bdb1340 effective data lakes part 3 10
  4. Run the cell under Get transaction ID on notebook 2 to start a new Lake Formation transaction. This ensures that notebook2 uses a new transaction ID within whose bounds all Notebook2 operations occur.
    bdb1340 effective data lakes part 3 11
  5. Run the cell under Read original table from notebook 2 to read from the governed table you created in notebook1.ipynb. This read uses new transactionID which is different from the one used in notebook 1.

You can see the data being read from the governed table which you created in notebook1.

bdb1340 effective data lakes part 3 12

  1. Run the subsequent two cells under Notebook 2 – Write 2 additional rows to write two new records (product_id=00004 and product_id=00005) into the governed table.

Note that this transaction is not committed yet.

bdb1340 effective data lakes part 3 13

  1. Run the next cell to read from the governed table using the same transaction ID, which is not committed yet.

You can see the uncommitted changes (product_id=00004 and product_id=00005) because you’re using the same transaction ID that you used in writing the records.

bdb1340 effective data lakes part 3 14

Then let’s move to notebook1.ipynb and see how the uncommitted changes look in the other user’s notebook.

  1. Switch to the browser tab of notebook1.ipynb.
  2. Run the cell under No change – waiting for notebook 2 to write to confirm that you don’t see the two new records yet.

This is because the transaction you performed in notebook2.ipynb hasn’t been committed.

bdb1340 effective data lakes part 3 15

Now let’s move to notebook2.ipynb to commit the ongoing transaction to see how it looks in notebook1.ipynb after the commit.

  1. Switch to the browser tab of notebook2.ipynb.
  2. Run the cell under Commit notebook 2 updates to commit the change you made in notebook2.ipynb.
    bdb1340 effective data lakes part 3 16
  3. Switch to the browser tab of notebook1.ipynb.
  4. Run the cell under Notebook 2 committed but still notebook 1 pointing original transaction ID to confirm that you don’t see the two new records yet.

This is because you’re using the original transaction ID txId2 when reading the data. This ensures that you are isolated from changes happening outside of this transaction.

bdb1340 effective data lakes part 3 17

  1. Run the subsequent cells under Notebook 1 gets new transaction ID which reflects the changes from notebook 2 to commit txId2, start another new transaction txId3, and read the data committed in Notebook 2.

Now you can see the committed change.

bdb1340 effective data lakes part 3 18

  1. Switch to the browser tab of Notebook 2.
  2. Run the subsequent cells under Both notebooks using the most recent transaction ID to start another new transaction, and read the data.

Now you can see that both notebooks show the latest data when using the most recent transaction ID.

bdb1340 effective data lakes part 3 19

Cleaning up

Now to the final step, cleaning up the resources.

  1. Empty the Amazon S3 data lake bucket, then delete the bucket.
  2. Delete the CloudFormation stack.

The governed table you created is also automatically deleted when you delete the CloudFormation stack.

Conclusion

In this post, we explained how to use Lake Formation transactions and illustrated atomicity and isolation between two applications simultaneously reading from and writing to  the same table. With Lake Formation transactions, you can achieve ACID transactions over multiple tables and statements, and you can get consistent and up-to-date results.

Lake Formations transactions, row-level security, and acceleration are currently available for preview in the US East (N. Virginia) AWS Region. To get early access to these capabilities, sign up for the preview.


Appendix: Setting up resources via the console

When following the steps in this section, use the Region us-east-1 or us-west-2 because as of this writing, this Lake Formation preview feature is available only in us-east-1 and us-west-2.

Configuring IAM roles and users

First, you need to set up two IAM roles: one for AWS Glue ETL jobs, another for the Lake Formation data lake location.

IAM policies

To create the IAM policies for your roles complete the following steps:

  1. On the IAM console, create a new policy for Amazon S3.
  2. Save the IAM policy as S3DataLakePolicy using the instructions below (replace <datalake-bucket> with your bucket name):
    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::<datalake-bucket>/*" ] }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<datalake-bucket>/*" ] } ]
    }

  3. Create the new IAM policy named LFTransactionPolicy with the following statements:
    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "lakeformation:BeginTransaction", "lakeformation:CommitTransaction", "lakeformation:AbortTransaction", "lakeformation:ExtendTransaction", "lakeformation:GetTransaction", "lakeformation:GetTransactions", "lakeformation:UpdateTableObjects" ], "Resource": "*" } ]
    }

  4. Create a new IAM policy named LFLocationPolicy with the following statements:
    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "execute-api:Invoke", "Resource": "arn:aws:execute-api:*:*:*/*/POST/reportStatus" }, { "Effect": "Allow", "Action": [ "lakeformation:BeginTransaction", "lakeformation:CommitTransaction", "lakeformation:AbortTransaction", "lakeformation:GetTableObjects", "lakeformation:UpdateTableObjects" ], "Resource": "*" }, { "Action": [ "glue:GetDatabase", "glue:GetDatabases", "glue:GetTableVersions", "glue:GetPartitions", "glue:GetTable", "glue:GetTables", "glue:UpdateTable" ], "Resource": "*", "Effect": "Allow" } ]
    }

  5. Create a new IAM policy named LFQueryPolicywith the following statements:
    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "execute-api:Invoke", "Resource": "arn:aws:execute-api:*:*:*/*/POST/reportStatus" }, { "Effect": "Allow", "Action": [ "lakeformation:BeginTransaction", "lakeformation:CommitTransaction", "lakeformation:AbortTransaction", "lakeformation:ExtendTransaction", "lakeformation:PlanQuery", "lakeformation:GetTableObjects", "lakeformation:GetQueryState", "lakeformation:GetWorkUnits", "lakeformation:Execute" ], "Resource": "*" } ]
    }

IAM role for AWS Glue development endpoint.

Create a new IAM role for the AWS Glue development endpoints:

  1. On the IAM console, create the role GlueETLServiceRole with the following AWS Glue trust relationship for AWS Glue:
    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "glue.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ]
    }

  2. Attach the following AWS managed policies:
    1. AWSGlueServiceRole
    2. AWSLakeFormationDataAdmin
  3. Attach the following customer managed policies:
    1. S3DataLakePolicy
    2. LFTransactionPolicy

IAM role for AWS Lake Formation data lake location

To create your IAM role for Lake Formation, complete the following steps:

  1. Create a new IAM role called LFRegisterLocationServiceRolewith the following Lake Formation trust relationship:
    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "lakeformation.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ]
    }

  2. Attach the following customer managed policies:
    1. S3DataLakePolicy
    2. LFLocationPolicy

IAM users

To create your IAM users, complete the following steps:

  1. Create an IAM user named DatalakeAdmin3.
  2. Attach the following AWS managed policies:
    1. AWSLakeFormationDataAdmin
    2. AmazonAthenaFullAccess
    3. AWSGlueConsoleFullAccess
    4. AWSGlueConsoleSageMakerNotebookFullAccess
    5. IAMReadOnlyAccess
  3. Attach the customer managed policy LFQueryPolicy.

Creating your AWS Glue development endpoint

To create your AWS Glue development endpoint, complete the following steps:

  1. On the AWS Glue console, choose Dev endpoints.
  2. Choose Add endpoint.
  3. For Development endpoint name, enter acid-transaction.
  4. For IAM role, choose GlueETLServiceRole.
  5. Under Security configuration, script libraries, and job parameters (optional), for Worker type, choose 1X.
  6. Choose Next.
  7. For How do you want to provide the network information used to create the development endpoint?, leave as the default, and choose Next.
  8. Choose Next.
  9. Choose Finish.

Development endpoint creation can take up to 10 minutes to complete.

Creating your AWS Glue SageMaker notebook

To create your AWS Glue SageMaker notebook, complete the following steps:

  1. On the AWS Glue console, choose Notebooks.
  2. Under SageMaker notebooks, choose Create notebook.
  3. For Notebook name, enter acid-transactions-notebook.
  4. For Attach to development endpoint, choose acid-transaction.
  5. Select Create an IAM role.
  6. For IAM role, enter acid-transactions-notebook.
  7. Choose Create notebook.

Notebook creation can take up to 5 minutes to complete.


About the Author

Noritaka Sekiyama pNoritaka Sekiyama is a Senior Big Data Architect at AWS Glue & Lake Formation. He loves distributed data processing systems, especially Apache Spark and Apache Hadoop. He enjoys working on launching new features by collaborating with talented engineers across multiple teams.