This post was written by Dzenan Softic and Sam Dengler.
Many organizations rely on Apache Airflow, an open source project, to orchestrate their data pipelines. In 2020, Amazon Web Services (AWS) released Amazon Managed Workflows for Apache Airflow (Amazon MWAA), which lets engineers focus on business solutions rather than on running and maintaining infrastructure for Airflow. Apache Airflow is written in Python, letting developers use its rich ecosystem of libraries or even write their own.
Development teams creating in-house libraries hosted in private repositories is common. AWS CodeArtifact is a fully managed software artifact repository service that makes securely storing, publishing, and sharing packages easier. With CodeArtifact, making a connection to public repository, such as PyPi, to consume open source libraries is also possible.
In this post, we demonstrate how to use a CodeArtifact repository with Apache Airflow. We focus on Amazon MWAA, but the same approach can be applied to self-hosted Apache Airflow on AWS.
Amazon MWAA is deployed to private subnets across two Availability Zones. In this example solution, Amazon MWAA has no internet access and uses VPC endpoints to communicate with other AWS services. Amazon MWAA fetches directed acyclic graphs (DAGs) and a requirements file from an Amazon Simple Storage Service (Amazon S3) bucket. It connects to an AWS CodeArtifact private repository to install required Python packages. This repository is configured to have an external connection to public PyPi repository, which enables collecting open source packages.
To connect to CodeArtifact, index-url is constructed with the repository URL and authorization token. Because the CodeArtifact authorization token is valid for a maximum of 12 hours, we need a way to refresh the token automatically. We use an AWS Lambda function to obtain a new authorization token and update the index-url, and trigger it to run every 10 hours using Amazon CloudWatch Events. During initial infrastructure provisioning, Lambda is invoked via AWS CloudFormation custom resource.
This architecture does not require Amazon MWAA to have access to public internet to fetch libraries from PyPi, so we don’t need to provision a pair of NAT gateways in our VPC. This means that we can use a private repository for both in-house and public open source libraries.
You can deploy this solution from a local machine.
- An AWS account
- Npm package manager
- AWS Command Line Interface (AWS CLI)
- AWS Cloud Development Kit (AWS CDK) version 1.102.0
- Python version 3.6 or higher
Project setup and deployment
To get started, clone the GitHub repository to a local machine:
This repository contains multiple projects, so we must navigate to the correct folder:
Create Python virtual environment:
This rule will create a virtual environment in
infra/venv and install all required dependencies for the project. Before we can deploy, we must set environment variables in
.env for AWS CDK. Edit the
.env file with an AWS Region of your choice and a unique Amazon S3 bucket name:
You can choose between two supported versions of Apache Airflow on Amazon MWAA: 1.10.12 or 2.0.2.
We are now ready to deploy. To do that, run:
The AWS CDK CLI will ask for permission to deploy specific resources, so acknowledge by typing y in your terminal and pressing Enter. Deployment can take up to 30 minutes. You can track the deployment status via CLI or in the AWS Console.
Once deployment has finished, we can investigate whether the provisioned Amazon MWAA environment successfully connected to the CodeArtifact repository to install preferred packages in
If you look more closely at the
requirements.txt, the first line points to
codeartifact.txt that should contain the correct
--index-url to a private PyPi repository in CodeArtifact. It tells
pip to install packages from the CodeArtifact repository—in this case,
numpy library. The Lambda function generated
--index-url during the deployment phase, and will update it with a new authorization token every 10 hours:
Navigate to Amazon MWAA in the AWS Management Console and open the
mwaa_codeartifact_env environment that we provisioned. We will now inspect Airflow scheduler logs to confirm that it connected to the CodeArtifact repository to install
numpy. Navigate to Monitoring and open the Airflow scheduler log group.
From the scheduler logs, we can observe that it connected to the CodeArtifact repository with the authorization token to download and install
numpy. You can also open the Airflow UI from the AWS Management Console and try to run
example_dag, which prints the
Also, you can navigate to CodeArtifact to verify that the
numpy package is fetched and available in the repository.
Add new Python dependencies
Install preferred Python dependencies to an Amazon MWAA environment by updating
requirememnts.txt. To make these changes take effect, you must upload
requirements.txt to an Amazon S3 bucket and update the Amazon MWAA environment with a new file version. You can do it in the AWS Management Console or via the AWS CLI.
Add a library of your choice and run the following to upload
requirements.txt to Amazon S3:
requirements.txt versions, run:
Finally, update an Amazon MWAA environment with the latest version:
If you build your own Python packages, you can publish those to the same CodeArtifact repository and update the Amazon MWAA environment as a part of a release pipeline.
Once you are finished exploring this solution, you can clean up the account to avoid unnecessary cost. To delete all resources associated with this blog post, run the following command:
In this post, we demonstrated how to integrate Amazon MWAA with AWS CodeArtifact for Python dependencies.
We created a private CodeArtifact repository that can be used for both in-house and public libraries. We also experimented with VPC endpoints, AWS Lambda, and Amazon CloudWatch Events.
Finally, we deployed the infrastructure with AWS CDK.
You can find the source code from this post on GitHub and use it as a basis to build your own solution. If you have any questions or suggestions, please comment on the blog or open an issue in the GitHub repository.
- Self-hosted Apache Airflow on AWS Fargate
- Build a DataOps platform to break silos between engineers and analysts