Whether you are building a data lake, a data analytics pipeline, or a simple data feed, you may have small volumes of data that need to be processed and refreshed regularly. This post shows how you can build and deploy a micro extract, transform, and load (ETL) pipeline to handle this requirement. In addition, you configure a reusable Python environment to build and deploy micro ETL pipelines using your source of data.
What’s a micro ETL pipeline? It’s a short process that you can schedule to handle a small volume of data. Sometimes you only need to ingest, transform, and load a subset of a larger dataset without using expensive and complex computational resources. This is where micro ETL comes to the rescue.
A micro ETL process is helpful when you deal with small data feeds that need to be refreshed regularly, such as daily currencies exchange rates, hourly stock availability for a small category of products, 5-minute weather measurements for a city, daily test results, and so on.
Micro ETL processes work best with serverless architecture, which is why you use AWS Serverless Application Model (AWS SAM) for this solution.
AWS SAM is an open-source framework for building serverless applications. You can define your resources with just few lines of code using YAML, which enables you to build serverless applications quickly.
Overview of the solution
The project includes a local environment to inspect the data, experiment, and deploy the stack using the AWS SAM CLI. The deployment includes an time-based event that triggers an AWS Lambda function. The function ingests and stores the raw data from an external source, transforms the content, and saves the clean information. The raw and clean data is stored in an Amazon Simple Storage Service (Amazon S3) bucket. The following diagram illustrates our architecture.
The solution includes the following high-level steps:
- Download the code.
- Set up the working environment.
- Analyze the data with a Jupyter notebook.
- Inspect the function and the AWS SAM template.
- Build and deploy the ETL process.
For this walkthrough, you should have the following prerequisites:
- An AWS account
- The AWS Command Line Interface (AWS CLI) installed and configured
- Python (3.8 preferable)
- Conda or Miniconda
Downloading the code
You can download the code from GitHub or using git with the following code:
git clone https://github.com/aws-samples/micro-etl-pipeline.git
After that, step into the directory you just created.
Setting up the environment
The code comes with a preconfigured Conda environment, so you don’t need to spend time installing the dependencies. A Conda environment is a directory that contains a specific collection of Conda packages, in our case defined in the
You can create the environment with the following code:
conda env create -f environment.yml
Then activate the environment:
conda activate aws-micro-etl
You can deactivate the environment with the following code:
Analyzing the data with a Jupyter notebook
Jupyter notebooks are perfect for experimenting and collaboration. You can run a Jupyter notebook on AWS in many different ways, such as with Amazon Sagemaker. In this post, we run the notebook locally.
After you activate the environment, you’re ready to launch your Jupyter notebook and follow the narrative text.
From the command line, start the notebook with the following code:
A browser window appears with a Jupyter dashboard opened into the root project folder.
Choose the file
aws_mini_etl_sample.ipynb and follow the narrative.
This Jupyter notebook contains a sample micro ETL process. The ETL process uses publicly available data from the HM land registry, which contains average price by property type series. Feel free to experiment and substitute the data source with your own.
The notebook provides some useful scenarios, such as:
- The possibility to support partial requests and so only fetch a small portion of a larger file
- The ability to inspect and manipulate the data using and achieve the right outcome
- The support of file types other than CSV
- A quick way to save a CSV file directly into an S3 bucket
Inspecting the function and the AWS SAM template
The project directory includes an additional folder called
micro-etl-app, which contains our ETL process defined with the AWS SAM template ready to be deployed as a Lambda function.
AWS SAM provides shorthand syntax to express functions, APIs, databases, and event source mappings. With just a few lines per resource, you can define the application you want and model it using YAML. During deployment, AWS SAM transforms and expands the AWS SAM syntax into AWS CloudFormation syntax, which enables you to build serverless applications faster.
The AWS SAM app is composed of three main files:
- template.yml – Contains the configuration to build and deploy the Lambda function
- app/app.py – Contains the code of our application coming from the Jupyter notebook
- app/requirements.txt – Contains the list of Python libraries needed for our function to run
Let’s go through them one by one.
template.yml file contains all the details to build and deploy our ETL process as permissions, schedule rules, variables, and more.
The most important factor to consider in this type of micro application is to allocate the right amount of memory and timeout to avoid latency issues or resource restrictions. Memory and timeout for the Lambda function are under the
Other important settings are defined inside the
Property statement. For instance, environment variables allow you to control settings like the URL to fetch without redeploying the code.
Finally, a definition of a cron event is under the
Events statement, which triggers the Lambda function every day at 8:00 AM.
For more information about scheduling cron expressions, see Cron Expressions.
app.py file has an initial section to import the required dependencies, an area for environment variables and other supporting statements, and the main code inside the Lambda handler.
app.py contains comments that explain each statement, and you can see how the majority of the code comes from the Jupyter notebook.
Let’s see in detail the two most important statements used to fetch the data.
The first statement uses the
requests library to fetch the last 2,000,000 bytes of our data source file defined in the URL environment variable:
res = requests.get(URL, headers=range_header(-2000000), allow_redirects=True)
The second statement creates a pandas DataFrame directly from the source stream and removes the first row with the
skiprows parameter. It removes the first row because it could be difficult to fetch with precision the beginning of a row using byte range. Finally, the statement assigns predefined column headers, which are missing as part of the initial chunk of the file. See the following code:
df = pd.read_csv(io.StringIO(res.content.decode('utf-8')), engine='python', error_bad_lines=False, names=columns, skiprows=1)
The last file in the application is
requirement.txt, which the AWS SAM CLI uses to build and package the dependencies needed for the function to correctly work. You can use additional Python libraries in your application, but remember to define those in the
requirement.txt file as well.
Building and deploying the ETL process
You’re now ready to build and deploy the application using the AWS SAM CLI.
- From the command line, move inside the
sam buildto let the AWS SAM CLI process the template file and bundle the application code and any applicable dependencies.
sam deploy --stack-name my-micro-etl –guidedto deploy the process, providing and saving parameters for future deploys.
The deployment outputs the Lambda function ARN, which you can use to test the process.
You can invoke the function and inspect the log at the same time from the command line with the following code:
aws lambda invoke --function-name FUNCTION_ARN out --log-type Tail --query 'LogResult' --output text | base64 -d
The base64 utility is available on Linux, macOS, and Ubuntu on Windows. For macOS, the command is
Alternatively, you can invoke the function on the Lambda console and inspect the CloudWatch log group associated with it, which is named
The last line of the log shows the URL for the generated file in the S3 bucket. It should look similar to the following code:
## FILE PATH s3://micro-etl-bucket-xxxxxxxx/avg-price-property-uk.csv
You can use the AWS CLI to inspect the content of the file and see that it contains only rows from the range defined in
aws s3 cp s3://micro-etl-bucket-xxxxxxxx/avg-price-property-uk.csv local_file.csv
From here, you can extend the solution in various ways, such as the following:
- Create a REST API as an Amazon S3 proxy in API Gateway
- Share an object with others securely
- Query the data in place or build pipelines that extract data from multiple data sources
To avoid incurring future charges, delete the resources running from the command line:
aws cloudformation delete-stack --stack-name my-micro-etl
The preceding command removes all the resources created during this post, S3 bucket included. Be careful when running this command and be absolutely sure to point to the right stack. Alternatively, you can delete the stack on the AWS CloudFormation console.
In addition, you can deactivate the Conda environment with the following code:
In this post, you saw how quick and easy is to build and deploy a cost-effective infrastructure to manage and transform a small amount of data regularly.
The solution provided can be another tool in your toolbox and useful when working with multiple available data sources.
AWS offers other many ways to build a secure and agile serverless architecture. In addition, you can extend this micro pipeline with analytics services like AWS Glue, Amazon Athena, and more. Finally, you can connect multiple sources and deliver useful dashboards or reports with Amazon QuickSight.