A common challenge ETL and big data developers face is working with data files that don’t have proper name header records. They’re tasked with renaming the columns of the data files appropriately so that downstream application and mappings for data load can work seamlessly.
One example use case is while working with ORC files and Hive as a metadata store. ORC files are completely self-describing and contain the metadata information. Working with Hive can create challenges such as discrepancies with Hive metadata when exporting the files for downstream processing. For more information, see Hive does not store column names in ORC.
Another example is when a file contains the name header record but needs to rename column metadata based on another file of the same column length. Traditionally, you can use manual column renaming solutions while developing the code, like using Spark DataFrame’s withColumnRenamed method or writing a static ApplyMapping transformation step inside the AWS Glue job script. However, this solution has scalability challenges when you consider hundreds or thousands of different files that an enterprise solution developer might have to deal with and can be prone to manual errors (such as typos and incorrect order of mappings).
In this two-part post, I show how we can create a generic AWS Glue job to process data file renaming using another data file. I then show how can we use AWS Lambda, the AWS Glue Data Catalog, and Amazon Simple Storage Service (Amazon S3) Event Notifications to automate large-scale automatic dynamic renaming irrespective of the file schema, without creating multiple AWS Glue ETL jobs or Lambda functions for each file.
In this post, we cover creating the generic AWS Glue job. We use two Data Catalog tables for this purpose: the first table is the actual data file that needs the columns to be renamed, and the second table is the data file with column names that need to be applied to the first file.
Dataset and files
For this post, we use a dataset comprising of Medicare provider payment data: Inpatient Charge Data FY 2011. The data is available in CSV format. For our example, I have converted the data into an ORC file and renamed the columns to generic names (
_Col1, and so on). The second file, which is our name file, contains just the column name headers and a single row of data, so the type of data doesn’t matter for the purposes of this post.
The crawler has already run for these files, so the schemas of the files are available as tables in the Data Catalog. For more information about preparing the catalog tables, see Working with Crawlers on the AWS Glue Console.
The following screenshot shows the data file when queried from Amazon Athena. The columns need to be renamed.
The crawler created the table
sample1 in the database
The following screenshot shows the name file when queried from Athena. This has the column names, which needs to be applied to the data file.
The crawler created the preceding table
sample1namefile in the database
You can use any two files to follow along with this post, provided they have the same number of columns. In Part 2 of this series, we automate the process of crawling and cataloging the data.
For this walkthrough, you should have the following prerequisites:
- An AWS Account
- Knowledge of working with AWS Glue crawlers
- Knowledge of working with the AWS Glue Data Catalog
- Knowledge of working with AWS Glue ETL jobs and PySpark
- Knowledge of working with roles and policies using AWS Identity and Access Management (IAM)
- Knowledge of working with Amazon S3
- Optionally, knowledge of using Athena to query Data Catalog tables
Overview of solution
The following diagram showcases the overall solution steps and the integration points with AWS Glue and Amazon S3.
The workflow includes the following steps:
- An AWS Glue crawler crawls the data file and name file in Amazon S3.
- The crawler creates tables for the data file and name file in the Data Catalog.
- An AWS Glue job processes and renames the file.
- The job writes the renamed file to the destination S3 bucket.
Crawling the files and creating tables
Our walkthrough assumes that you already completed Steps 1–2 of the solution workflow, so your tables are registered in the Data Catalog and you have your data and name files in their respective buckets.
For more information about crawling the files, see Working with Crawlers on the AWS Glue Console.
The crawled files create tables in the Data Catalog. The name of the table is created based upon the last prefix of the file path. For example, the data file table is named
sample1, and the name file table is named
sample1namefile. This is important when we automate this solution in Part 2.
For this post, I use the following file paths:
- Data file path –
- Name file path –
- Renamed file path –
The following screenshot shows the cataloged tables.
Creating the AWS Glue job
Now let’s create the AWS Glue job that runs the renaming process. Although we use the specific file and table names in this post, we parameterize this in Part 2 to have a single job that we can use to rename files of any schema.
To create a new job, complete the following steps:
- On the AWS Glue console, choose Jobs.
- Choose Add job.
- For Job name, enter
- For IAM role¸ choose a role that has access to the Data Catalog database, tables, and Amazon S3 resources.
For more information about IAM roles, see Step 2: Create an IAM Role for AWS Glue.
- For Type, choose Spark.
- For Glue version, choose Spark 2.4, Python with improved startup times (Glue Version 2.0).
- For This job runs, select A new script authored by you.
- For Script file name, enter a name for your script file.
- For S3 path where the script is stored, enter the appropriate S3 path.
- For Temporary directory, enter the appropriate S3 path.
- Leave the other properties as their default.
We change the concurrency parameters and add job parameters in Part 2.
- Choose Next.
- Leave settings on the Connection page as their default, because we’re connecting to the Data Catalog tables.
- Choose Save job and edit script.
Now let’s walk through the script that you author, which is the heart of the file renaming process.
- Enter the following code:
- We now create two DynamicFrames from the Data Catalog tables:
In this post, we’re hardcoding the table names. We look at using the job arguments so the job can process any table in Part 2.
- To extract the column names from the files and create a dynamic renaming script, we use the schema() function of the dynamic frame.
Now that we have all the information ready, we generate the applymapping script dynamically, which is the key to making our solution agnostic for files of any schema, and run the generated command. ApplyMapping is an AWS Glue transform in PySpark that allows you to change the column names and data type.
- Generate the script with the following code:
We now have our new DynamicFrame ready with the correct column names applied. Note that the data types aren’t changed. We now write the DynamicFrame back to the S3 bucket in the destination location, where it can be picked up for further processing. Alternatively, you can choose to further transform the data as needed and then sink it into any of the destinations supported by AWS Glue, for example Amazon Redshift, directly.
- Enter the following script, providing your S3 destination bucket name and path:
- Save and exit the script editor.
Running the AWS Glue job
You’re now ready to run your job.
- On the AWS Glue console, on the Jobs page, select your job.
- On the Action menu, choose Run job.
The job creates the new file in the destination bucket of your choosing. After you create the file, you can run the AWS Glue crawler to catalog the file, and then you can analyze it with Athena, load it into Amazon Redshift, or perform additional actions. The file now has the required column names.
To avoid incurring future charges, delete the data in the S3 buckets.
With AWS Glue, you pay an hourly rate, billed by the second, for crawlers (discovering data) and ETL jobs (processing and loading data). If you’re not running an ETL job or crawler, you’re not charged. You can store up to a million objects in the Data Catalog for free.
Alternatively, you can delete the AWS Glue ETL job, Data Catalog tables, and crawlers.
In this post, we looked at one of the common problems that enterprise ETL developers have to deal with while working with data files, which is renaming columns. We looked at how we can use AWS Glue ETL jobs and Data Catalog tables to create a generic file renaming job. We also touched on how to use AWS Glue transforms for
In Part 2 of this series, we look at scaling this solution to automate this task
About the Author
Divyesh Sah is as a Sr. Enterprise Solutions Architect in AWS focusing on financial services customers, helping them with cloud transformation initiatives in the areas of migrations, application modernization, and cloud native solutions. He has over 18 years of technical experience specializing in AI/ML, databases, big data, containers, and BI and analytics. Prior to AWS, he has experience in areas of sales, program management, and professional services.