Amazon Studios invests in content that drives global growth of Amazon Prime Video and IMDb TV. Amazon Studios has a number of internal-facing applications that aim to streamline end-to-end business processes and information workflows for the entire content creation lifecycle. The Amazon Studios Data Infrastructure (ASDI) is a centralized, curated, and secure data lake that stores data, both in its original form and processed for analysis and machine learning (ML). The centralized ASDI is essential to break down data silos and combine different types of analytics, thereby allowing Amazon Studios to gain valuable insights, guide better business decisions, and innovate using the latest ML concepts.

What are the primary goals for Amazon MWAA adoption?

Amazon Managed Workflows for Apache Airflow (MWAA) is a fully managed service that makes it easier to run open-source versions of Apache Airflow on AWS. Builders at Amazon.com are engineering Amazon MWAA Directed Acyclic Graphs (DAGs) with prerequisites for provisioning the least privilege access model to the underlying services and resources, and restricting the blast radius of a given task.

Apache Airflow connections provide mechanisms for securely accessing the resources during DAG execution and are intended for coarse-grained access. Incorporating fine-grained access requires different mechanisms for implementation and code review prior to deployment. The additional challenge of codifying the infrastructure and stitching multiple systems together can also inject redundant activities when implementing fine-grained access patterns in Airflow.

How did Amazon achieve this goal?

The objective to enforce security for DAGs at its lowest possible granularity is done at the DAG’s task level. The solution aligns with integration of Amazon MWAA task security with AWS Identity and Access Management (IAM) service and AWS Security Token Service (AWS STS). The engineers customized the existing Airflow PythonOperators to tightly couple task access requirements to separately deployed IAM roles. The customized Airflow operator takes advantage of AWS STS to assume the associated IAM role. The temporary session created from AWS STS is used within PythonOperator to access the underlying resources required to run the task.

In this post, we discuss how to strengthen security in Amazon MWAA with role-based access control.

Prerequisites

To implement this solution, complete the following prerequisites:

  1. Create an AWS account with admin access.
  2. Create an Amazon MWAA environment.
    1. Note down the execution role ARN associated with the Amazon MWAA environment. This is available in the Permissions section of the environment.

BDB 1448 image001

  1. Create two Amazon Simple Storage Service (Amazon S3) buckets:
    1. s3://<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-processed/
    2. s3://<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-published/
  2. Create two IAM roles; one for each of the buckets:
    1. write_access_processed_bucket with the following policy:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:DeleteObject" ], "Resource": "arn:aws:s3:::<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-processed/*" } ]
}

    1. write_access_published_bucket with the following policy:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:DeleteObject" ], "Resource": "arn:aws:s3:::<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-published/*" } ]
}

  1. Update the trust relationship for the preceding two roles with the Amazon MWAA execution role obtained from Amazon MWAA environment page:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::<AWS_ACCOUNT_ID>:assumed-role/<MWAA-EXECUTION_ROLE>/AmazonMWAA-airflow" ], "Service": "s3.amazonaws.com" }, "Action": "sts:AssumeRole" } ]
}

In the preceding policy, replace AWS_ACCOUNT_ID and MWAA-EXECUTION_ROLE with the respective account number, region and Amazon MWAA execution role.

Run the DAG

The proposed DAG has two tasks that access each of the preceding buckets created:

  • Process task – Performs a task in the processed S3 bucket, which mocks a transformation using the Python sleep() function. The last step in this task adds a control file with the current timestamp.
  • Publish task – Performs a similar transformation in the published S3 bucket, which again mocks a transformation using the Python sleep() function. The last step in this task adds a control file with the current timestamp.

The fine-grained access restriction is enforced by a custom implementation of a widely used Airflow operator: PythonOperator. The custom PythonOperator negotiates with AWS STS to trade a session using the IAM role. The session is exclusively used by the tasks’ callable to access the underlying AWS resources. The following diagram shows the sequence of events.

BDB 1448 image002 1

The source code for the preceding implementation is available in the mwaa-rbac-task GitHub repository.

The code base is set up in the following location in Amazon S3, as seen from the Amazon MWAA environment on the Amazon MWAA console.

image004 1

Run the DAG and monitor its progress, as shown in the following screenshot.
BDB 1448 image006

After you run the DAG, the following files are created with timestamps updated:

  • s3://<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-processed/control_file/processed.json { "processed_dt": "03/05/2021 01:03:58" }

  • s3://<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-published/control_file/published.json { "published_dt": "03/05/2021 01:04:12" }

The change in the preceding control files reflects that the tasks in the DAGs enforced the policies defined for these tasks.

Create custom Airflow Operators to support least privilege access

You can extend the demonstrated methodology for enabling fine-grained access using a customized PythonOperator to other Airflow operators and sensors as needed. For more information about how to customize operators, see Creating a custom Operator.

Conclusion

In this post, we presented a solution to bolster security in Amazon MWAA with role-based access controls. You can extend the concept to other Airflow operators in order enhance the workflow security at the task level. In addition, using the AWS Cloud Development Kit (AWS CDK) can make provisioning the Amazon MWAA environment and fine-grained IAM task roles seamless. We look forward to sharing more about fine-grained access patterns for Airflow tasks in a future post.


About the Author

Kishan Desai Kishan Desai is a Data Engineer at Amazon Studios building a data platform to support the content creation process. He is passionate about building flexible and modular systems on AWS using serverless paradigms. Outside of work, Kishan enjoys learning new technologies, watching sports, experiencing SoCal’s great food, and spending quality time with friends and family.

 

 

Virendhar Sivaraman 2 1Virendhar (Viru) Sivaraman is a strategic Big Data & Analytics Architect with Amazon Web Services. He is passionate about building scalable big data and analytics solutions in the cloud. Besides work, he enjoys spending time with family, hiking & mountain biking.

Categories: Big Data