This post was written by Dzenan Softic and Sam Dengler.

Many organizations rely on Apache Airflow, an open source project, to orchestrate their data pipelines. In 2020, Amazon Web Services (AWS) released Amazon Managed Workflows for Apache Airflow (Amazon MWAA), which lets engineers focus on business solutions rather than on running and maintaining infrastructure for Airflow. Apache Airflow is written in Python, letting developers use its rich ecosystem of libraries or even write their own.

Development teams creating in-house libraries hosted in private repositories is common. AWS CodeArtifact is a fully managed software artifact repository service that makes securely storing, publishing, and sharing packages easier. With CodeArtifact, making a connection to public repository, such as PyPi, to consume open source libraries is also possible.

In this post, we demonstrate how to use a CodeArtifact repository with Apache Airflow. We focus on Amazon MWAA, but the same approach can be applied to self-hosted Apache Airflow on AWS.

Solution overview

Amazon MWAA is deployed to private subnets across two Availability Zones. In this example solution, Amazon MWAA has no internet access and uses VPC endpoints to communicate with other AWS services. Amazon MWAA fetches directed acyclic graphs (DAGs) and a requirements file from an Amazon Simple Storage Service (Amazon S3) bucket. It connects to an AWS CodeArtifact private repository to install required Python packages. This repository is configured to have an external connection to public PyPi repository, which enables collecting open source packages.

To connect to CodeArtifact, index-url is constructed with the repository URL and authorization token. Because the CodeArtifact authorization token is valid for a maximum of 12 hours, we need a way to refresh the token automatically. We use an AWS Lambda function to obtain a new authorization token and update the index-url, and trigger it to run every 10 hours using Amazon CloudWatch Events. During initial infrastructure provisioning, Lambda is invoked via AWS CloudFormation custom resource.

This architecture does not require Amazon MWAA to have access to public internet to fetch libraries from PyPi, so we don’t need to provision a pair of NAT gateways in our VPC. This means that we can use a private repository for both in-house and public open source libraries.

solution architecture described in blog post

Walkthrough

You can deploy this solution from a local machine.

Prerequisites

Project setup and deployment

To get started, clone the GitHub repository to a local machine:

bash $ > git clone [email protected]:aws-samples/amazon-mwaa-examples.git

This repository contains multiple projects, so we must navigate to the correct folder:

bash $ > cd amazon-mwaa-examples/usecases/mwaa-with-codeartifact

Create Python virtual environment:

bash $ > make venv

This rule will create a virtual environment in infra/venv and install all required dependencies for the project. Before we can deploy, we must set environment variables in .env for AWS CDK. Edit the .env file with an AWS Region of your choice and a unique Amazon S3 bucket name:

AWS_REGION=eu-west-1
BUCKET_NAME=my-unique-bucket-name
AIRFLOW_VERSION=2.0.2

You can choose between two supported versions of Apache Airflow on Amazon MWAA: 1.10.12 or 2.0.2.

We are now ready to deploy. To do that, run:

bash $ > make deploy

The AWS CDK CLI will ask for permission to deploy specific resources, so acknowledge by typing y in your terminal and pressing Enter. Deployment can take up to 30 minutes. You can track the deployment status via CLI or in the AWS Console.

console showing status create_complete in green

Once deployment has finished, we can investigate whether the provisioned Amazon MWAA environment successfully connected to the CodeArtifact repository to install preferred packages in requirements.txt.

// mwaa-with-codeartifact/mwaa-ca-bucket-content/requirements.txt
-r /usr/local/airflow/dags/codeartifact.txt
numpy==1.20.3

If you look more closely at the requirements.txt, the first line points to codeartifact.txt that should contain the correct --index-url to a private PyPi repository in CodeArtifact. It tells pip to install packages from the CodeArtifact repository—in this case, numpy library. The Lambda function generated --index-url during the deployment phase, and will update it with a new authorization token every 10 hours:

// YOUR_S3_BUCKET/dags/codeartifact.txt
--index-url https://aws:[email protected]/pypi/mwaa_repo/simple/

Navigate to Amazon MWAA in the AWS Management Console and open the mwaa_codeartifact_env environment that we provisioned. We will now inspect Airflow scheduler logs to confirm that it connected to the CodeArtifact repository to install numpy. Navigate to Monitoring and open the Airflow scheduler log group.

airflow scheduler log group airflow-mwaa_codeartifact_env-Scheduler in blue as a clickable link

From the scheduler logs, we can observe that it connected to the CodeArtifact repository with the authorization token to download and install numpy. You can also open the Airflow UI from the AWS Management Console and try to run example_dag, which prints the numpy array.

message in console log events showing succesfully installed numpy-1.20.3

Also, you can navigate to CodeArtifact to verify that the numpy package is fetched and available in the repository.

numpy listed under packages

Add new Python dependencies

Install preferred Python dependencies to an Amazon MWAA environment by updating requirememnts.txt. To make these changes take effect, you must upload requirements.txt to an Amazon S3 bucket and update the Amazon MWAA environment with a new file version. You can do it in the AWS Management Console or via the AWS CLI.

Add a library of your choice and run the following to upload requirements.txt to Amazon S3:

bash $ > aws s3 cp mwaa-ca-bucket-content/requirements.txt s3://YOUR-BUCKET-NAME/

To get requirements.txt versions, run:

bash $ > aws s3api list-object-versions --bucket YOUR-BUCKET-NAME --prefix requirements.txt

Finally, update an Amazon MWAA environment with the latest version:

bash $ > aws mwaa update-environment --name mwaa_codeartifact_env --requirements-s3-object-version OBJECT_VERSION

If you build your own Python packages, you can publish those to the same CodeArtifact repository and update the Amazon MWAA environment as a part of a release pipeline.

Cleaning up

Once you are finished exploring this solution, you can clean up the account to avoid unnecessary cost. To delete all resources associated with this blog post, run the following command:

bash $ > make destroy

Conclusion

In this post, we demonstrated how to integrate Amazon MWAA with AWS CodeArtifact for Python dependencies.

We created a private CodeArtifact repository that can be used for both in-house and public libraries. We also experimented with VPC endpoints, AWS Lambda, and Amazon CloudWatch Events.

Finally, we deployed the infrastructure with AWS CDK.

You can find the source code from this post on GitHub and use it as a basis to build your own solution. If you have any questions or suggestions, please comment on the blog or open an issue in the GitHub repository.

Further reading

Sam Dengler

Sam Dengler

Sam Dengler is a Principal Serverless Solutions Architect for AWS, focused on the Serverless platform. Sam is responsible for helping customers design and operate Serverless applications and event-driven solutions using services like Lambda, API Gateway, EventBridge, SNS, and SQS. He is a regular speaker at AWS Summits, re:Invent, and various tech events. Sam holds a Bachelor of Science and Masters of Computer Science from NC State University.

Dzenan Softic

Dzenan Softic

Dzenan Softic is a Solutions Architect at AWS. He works with startups to help them define and execute their ideas. Previously, his main focus was in data engineering and infrastructure.

Categories: Open Source