To grow and develop your business, you must collect data from a myriad of sources (such as relational and NoSQL databases, clickstream, and IoT events) and combine it using analytics to extract insight. AWS Lake Formation enables you to ingest data from many different sources into a data lake based in Amazon S3. Furthermore, you can use Lake Formation to control access to this data from a single place.
This post shows how to ingest data from Amazon RDS into a data lake on Amazon S3 using Lake Formation blueprints and how to have column-level access controls for running SQL queries on the extracted data from Amazon Athena.
A blueprint is a data management template that enables you to ingest data into a data lake easily. Lake Formation provides several blueprints, each for a predefined source type, such as a relational database or AWS CloudTrail logs. From a blueprint, you can create a workflow. Workflows consist of AWS Glue crawlers, jobs, and triggers that orchestrate the loading and update of data. Blueprints take the data source, data target, and schedule as input to configure the workflow.
For more information about Lake Formation, see AWS Lake Formation: How It Works.
A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake enables you to break down data silos and combine different types of analytics to gain insights and guide better business decisions.
Lake Formation makes it easy to set up a secure data lake quickly. The walkthrough contains the following steps:
- Register an S3 bucket as a data lake storage
- Create a logical database to hold the tables that represent your data
- Create a workflow for ingesting data from Amazon RDS for SQL Server to the data lake
- Grant permissions to access the tables in the data lake
To complete this walkthrough, you must have the following:
- An Amazon RDS for SQL Server up and running
- An IAM user with access to Lake Formation, Amazon S3, AWS Glue, and Athena
Registering your Amazon S3 storage
To register your S3 bucket, complete the following steps:
- On the AWS Lake Formation console, choose Get Started.
- Choose Register location.
- For Amazon S3 path, choose Browse.
- Choose the S3 bucket you designated for your data lake.
- Make sure that the S3 bucket you selected has the correct IAM permissions to work with Lake Formation.
- Choose Review location permissions.
- For IAM role, choose an IAM role designated to allow Lake Formation to read and write data in the S3 bucket.You have the option to let Lake Formation create an IAM role with the necessary permissions, or you can select an existing IAM role.
- Choose Register location.
- On the AWS Lake Formation console, under Register and ingest, choose Data lake locations.You can see your S3 bucket registered.
Creating a database
You are now ready to create a database to hold your data lake tables. Databases are logical and can be treated as namespaces. They are containers for the metadata tables that the AWS Glue Data Catalog stores. Databases can have an optional location property. This location needs to be within an Amazon S3 location that is registered with Lake Formation.
To create your database, complete the following steps:
- Choose Create database.
- For Name, enter a name for your database.
- For Location, enter the location of your S3 bucket.
- Choose Create database.
Creating a workflow to ingest data
Now that your data lake is set up, you are ready to ingest data. Lake Formation includes a feature called blueprints, which allows you to ingest data from MySQL, PostgreSQL, Oracle, and SQL Server databases into your data lake. Blueprints allow you to ingest data by copying a table or incrementally loading newly inserted data. When you use blueprints, they create an AWS Glue workflow and all relevant jobs and crawlers to enable the discovery and movement of data.
To create your workflow to ingest data, complete the following steps:
- On the Lake Formation console, under Register and ingest, choose Blueprints.
- In the Blueprint overview section, choose Use blueprints.
- For Blueprint type, select what type of data you want to ingest to your data lake.For this walkthrough, choose Database snapshot, because the source table is not very large. For large tables, an incremental database may be a better option.To ingest data to your data lake, you must define a database connection with parameters to allow AWS Glue to access the source database.
- In the Import source section, for Database connection, choose Create a connection in AWS Glue.This step redirects you to the AWS Glue console.
- On the Connections page, for Connection name, enter the name of your connection.
- For Connection type, choose your driver.You can choose between Amazon RDS, Amazon Redshift, and JDBC as the driver. For this walkthrough, choose JDBC, so your job can talk to both Amazon RDS and SQL Server on Amazon EC2 using the same connection configuration.
- Select if the connection requires SSL.
- Choose Next.
- For JDBC URL, you must follow the path structure mentioned in the AWS Glue documentation.Change the value MYRDSENDPOINT in the JDBC URL with the real Amazon RDS endpoint from your environment.
- Enter a username and password that has read permission to your database.
- Choose the VPC and Subnet where your database instance is running.
- Choose a Security Group that allows AWS Glue to connect to your database instance.
- Choose Next.
- On the Review all steps page, choose Finish.You return to the Lake Formation console.
- In the Import source section, choose the refresh button next to the Database connection drop-down menu.
- Choose the connection you just created.
- For Source data path, enter the database name.If you choose to only read a particular table, you can update this field to reflect the exact database, schema, and table you want. Wildcards are allowed.
- As an option, enter any exclude patterns to exclude any data you may not want to ingest into your data lake.For more information, see Include and Exclude Patterns.You are now ready to configure the target of the workflow.
- In the Import target section, for Target database, choose the database you just created.
- For Target storage location, enter the data lake location you created.
- For Data format, choose Parquet.For more information about Parquet, see the Apache Parquet website.
- For Import frequency, choose your import frequency.For this walkthrough, choose Run on demand.
- In the Import options section, for Workflow name, enter a name for the workflow.
- For IAM Role, choose an IAM Role that allows AWS Glue to call AWS services on your behalf.For instructions on setting up this IAM role, see documentation.
- For Table prefix, enter the table prefix the job uses when creating the final output table in the data lake.This post enters
sqlserveras a way to discover the tables the workflow ingests quickly.
- For Maximum capacity, enter the number of DPUs to allocate when the job runs.This post enters
- For Concurrency, enter the maximum number of concurrent runs that are allowed for this job.This post enters
- Choose Create.
- On the Lake Formation console, choose Blueprints.You can see the workflow you created under Workflows.
- Choose the workflow.
- From the Actions drop-down menu, choose Start.The process of ingestion and transformation varies based on the database size.
- Choose the workflow.
- Choose the run ID of the current run.This step redirects you to the AWS Glue console, where you can dig deeper into each task of the workflow.
Now that you have data in your data lake on Amazon S3 and metadata tables in your database on the Data Catalog, you can grant users permissions to access the data. Complete the following steps:
- Choose Grant permissions.
- For IAM users and roles, choose the IAM users or roles that are allowed to access the data.You have the option to grant access to a database, a set of tables within a database, or a specific set of columns within a table.
For this post, the login is federated through Active Directory Federation Services (ADFS). Active Directory Domain Services Groups (ADDS) manages access privileges, and users can select which IAM Role to assume.
This post assumes the IAM Role
ADFS-Marketing. It has access to the database
sqlserver_appdatabase01_dbo_customer. Furthermore, the role does not have permission to access the phone column.You can now verify that permissions are working as expected.
- Log in with a federated user assuming the
ADFS-Marketingrole.The following screenshot shows that the
phonecolumn isn’t available.
However, if you assume a role that has access to all of the specific table, you can view the phone column. See the following screenshot.
This post demonstrated how to quickly register an Amazon S3 data lake in Lake Formation, ingest data from an operational SQL Server database using blueprints and grant permissions at the column level. Not only this but, also show customers that is possible to reduce third-party tools license costs using AWS Services.
About the Author
Marcio Morales is a Senior Microsoft Specialist SA at AWS.
from AWS Big Data Blog: https://aws.amazon.com/blogs/big-data/integrating-aws-lake-formation-with-amazon-rds-for-sql-server/