In this article, we’ll show how to stand up an Exploratory Data Analysis (EDA) dashboard for business users using Amazon Web Services (AWS) with Streamlit. Streamlit is an open source framework for data scientists to efficiently create interactive web-based data applications in pure Python. In this tutorial, the EDA dashboard allows for quick end-to-end deployment with minimal effort and the ability to scale out the application and database layers as needed. The EDA dashboard serves insights in a secure and robust way without getting bogged down in time-consuming front-end development.

Architecture diagram

The database layer is backed by Amazon Simple Storage Service (Amazon S3), AWS Glue, and Amazon Athena. Business users can upload flat files into the Amazon S3 bucket; this then triggers an AWS Glue crawler, which loads the data into a database table for querying by Amazon Athena.

The application layer makes use of a combination of Streamlit, Amazon Cognito, an Application Load Balancer (ALB), Amazon Elastic Container Service (Amazon ECS), and Amazon SageMaker. The Streamlit application is implemented via a SageMaker notebook and hosted on ECS behind an Application Load Balancer. Business users then use Amazon Cognito to log in and run queries against the Amazon Athena database, receiving analytic results visually from the dashboard.

Getting started

To get started, you will first need to install the required packages on your local machine or on an Amazon EC2 instance. To learn more, read Getting Started with Amazon EC2. If you are using your local machine, credentials must first be configured, as explained in the Set up AWS Credentials and Region for Development documentation. Additionally, make sure you have Docker and the AWS Command Line Interface (AWS CLI) already installed. This tutorial assumes that you have an environment with the necessary AWS Identity and Access Management (IAM) permissions.

First, clone the GitHub repo into a local folder:

git clone https://github.com/aws-samples/streamlit-application-deployment-on-aws.git

Building out the infrastructure

In the cloned directory, there should be a file called standup.sh. We will use this script to build out the application layer and database layer infrastructure.

A quick preview of the first lines shows that the resource names are set here. Specifically, this includes the stack and sub-stack names, along with the names for the S3 bucket where the data is stored, names for Glue, and the region that hosts the dashboard.

For this tutorial, we will leave these variables for resource names, including the default region, as they are. The resource names and region are changeable inside the first few lines of the Bash script, as shown below:

#!/bin/bash
stack_name=streamlit-dashboard # Using Default aws region
AWS_DEFAULT_REGION=$(aws configure list | grep region | awk '{print $2}') # Variables set from the stack
S3_BUCKET_NAME=${stack_name}-$(uuidgen | cut -d '-' -f 1)
DATABASE_NAME=${S3_BUCKET_NAME}
GLUE_CRAWLER_NAME=${stack_name}-glue-crawler
TABLE_NAME=$(echo ${DATABASE_NAME} | tr - _) # Cognito user parameter for first login
[email protected]
...

With that in mind, we’ll first make sure that the Python dependencies are installed:

python3 -m pip install --upgrade pip
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Next, to kick off the Yahoo! Finance data pull and cloud infrastructure creation, run the following command:

bash standup.sh

This step will start building the necessary resources in your default AWS account using the AWS CLI and AWS CloudFormation templates.

Specifically, this script will create:

  • An S3 bucket in which to hold the Yahoo! Finance forex data.
  • Run a Python job that downloads the Yahoo! Finance forex data and pushes it to S3.
  • An Amazon Athena Database to query the data and an AWS Glue Crawler to load the data into Athena.
  • Packages the application layer into a single CloudFormation template and then deploys the resources for the application layer, including the SageMaker notebook for standing up Streamlit.

More details on standing up the dashboard

Note are that the Python script will download files temporarily into your local folder before loading them into S3. The Python downloader script, ./script/yahoo_idx.py, has parameters passed that are set for a specific date range and specific indexes. In this example, we will pull dates from May 2006 to February 2021 for the SP500, AX200, and certain currencies against the Australian dollar.

start_dates = ["2006-05-16"] + start_dates
end_dates = end_dates + ["2021-02-18"]
# dict of the name for the output file and then query string for Yahoo Finance
tickers = {
"SP500": "^GSPC",
"AX200": "^AXJO",
"AUDUSD": "AUDUSD=X",
"AUDCNY": "AUDCNY=X",
"AUDJPN": "AUDJPY=X",
"AUDEUR": "AUDEUR=X",
}

In addition to requiring an S3 bucket, AWS Glue needs an Amazon Athena primary workgroup to be in place before it can load data from S3 into a data table. In this article, we are creating one from scratch in our account. Note that if you already have one set up, it will need to be manually updated with the S3 URL. The CloudFormation create-change-set and execute-change-set commands do this:

aws cloudformation create-change-set --stack-name ${stack_name}-athena --change-set-name ImportChangeSet --change-set-type IMPORT \
--resources-to-import "[{\"ResourceType\":\"AWS::Athena::WorkGroup\",\"LogicalResourceId\":\"AthenaPrimaryWorkGroup\",\"ResourceIdentifier\":{\"Name\":\"primary\"}}]" \
--template-body file://cfn/01-athena.yaml --parameters ParameterKey="DataBucketName",ParameterValue=${S3_BUCKET_NAME} aws cloudformation execute-change-set --change-set-name ImportChangeSet --stack-name ${stack_name}-athena

Now that Amazon Athena is taken care of, we can create a Glue crawler and Glue database to run over the forex data CSV files in S3 and consolidate them into single data table.

aws cloudformation create-stack --stack-name ${stack_name}-glue \
--template-body file://cfn/02-crawler.yaml --capabilities CAPABILITY_NAMED_IAM \
--parameters ParameterKey=RawDataBucketName,ParameterValue=${S3_BUCKET_NAME} \
ParameterKey=CrawlerName,ParameterValue=${GLUE_CRAWLER_NAME}

Once this is up and running, the only thing left is to run the Glue crawler job.

aws glue start-crawler --name ${GLUE_CRAWLER_NAME}

Next, the script will build out the infrastructure for the dashboard application layer. This code is taken directly from SageMaker dashboards for ML with minor permission modifications. The script packages custom AWS Lambda functions needed for deploying the resources.

cd ./deployment/sagemaker-dashboards-for-ml cd ./cloudformation/deployment/self-signed-certificate/ && pip install -r requirements.txt -t ./src/site-packages
cd ../../..
cd ./cloudformation/deployment/string-functions/ && pip install -r requirements.txt -t ./src/site-packages
cd ../../..
cd ./cloudformation/assistants/solution-assistant/ && pip install -r requirements.txt -t ./src/site-packages
cd ../../..
cd ./cloudformation/assistants/bucket-assistant/ && pip install -r requirements.txt -t ./src/site-packages
cd ../../../..

After this step is complete, all the CloudFormation templates are packaged into a single deployment file via the AWS CLI aws cloudformation package, like so:

aws cloudformation package \
--template-file ./sagemaker-dashboards-for-ml/cloudformation/template.yaml \
--s3-bucket ${S3_BUCKET_NAME} \
--s3-prefix cfn \
--output-template-file ../deployment/sagemaker-dashboards-for-ml/packaged.yaml

This step creates a single CloudFormation template packaged.yaml, for which all of the necessary application layer resources are now configured. The script deploys these resources as a nested stack to your AWS environment.

aws cloudformation create-stack \
--stack-name "${stack_name}-sdy" \
--template-body file://./sagemaker-dashboards-for-ml/packaged.yaml \
--capabilities CAPABILITY_IAM \
--parameters ParameterKey=ResourceName,ParameterValue=streamlit-dashboard-cfn-resource \
ParameterKey=SageMakerNotebookGitRepository,ParameterValue=https://github.com/aws-samples/streamlit-application-deployment-on-aws.git \
ParameterKey=CognitoAuthenticationSampleUserEmail,ParameterValue=${COGNITO_USER} --disable-rollback

Note the ParameterKey and ParameterValue, which are provided to the create-stack command at the end point to the repository with the Streamlit front end that is cloned directly into the SageMaker notebook (in this case, the Streamlit application deployment on AWS example). The second parameter is the Cognito login that you will need to get inside the dashboard once it is up and running.

The CloudFormation will take some time to stand up all the resources. Once the process has completed, you can go into the AWS console for CloudFormation and confirm that all resources are created. The console should show all stacks and nested stacks in the green:

The console should show all stacks and nested stacks in the green

At the end of the Bash script, the script takes the environment variables set for the stack resource names and writes them to two key files. The first of these is streamlit-package/dashboard/script/config.py. This file is used for configuring the front-end deployment of the dashboard once it is inside of the SageMaker notebook. For reference, the code below is what it looks like before standup.sh. Note that after the script runs, these will be updated based on how the stack environment variable names were set.

REGION = "your_region_name"
BUCKET = "your_bucket_name"
DATABASE = "your_database_name"
TABLE = "your_table_name"
INDEX_COLUMN_NAME = "date"

The second script, delete_resources.sh, contains similar values but is intended for the cleanup process of tearing down the CloudFormation stacks and deleting the S3 bucket with the data. This script also updates with the stack environment variable names populated.

Front-end deployment

Now that all the underlying infrastructure is fully constructed and the data is loaded in to Amazon Athena, we can go to the next step of deploying the Streamlit application for users to access.

In the AWS console, confirm that the notebook instance is started by navigating to the Amazon SageMaker service menu. From there, go to Notebooks and then select Notebook instances. The one created by the script should be displayed.

Lane SageMakerMenu F3

Navigate into the notebook via Jupyter and to the folder /deployment/sagemaker-dashboards-for-ml/examples/yahoo_finance.

Next, locate the file config.py file under dashboard/script.

Lane ConfigFile F4

Update the Notebook’s config.py parameter names with the config.py names that populated in your local directory after you ran the standup.sh script. Save the changes to the config file.

Once the configurations are set, navigate back to the yahoo-finance directory level and open the notebook titled yahoo_finance.ipynb.

Lane InsideSageMaker F5

Instructions on how to stand up the Streamlit application can be found inside this notebook. The notebook will walk through how to build the Streamlit Docker container locally and how to test that the dashboard is running.

Next we will push the Docker container to Amazon Elastic Container Registry (Amazon ECR), where it is deployed to Amazon ECS as a service. Amazon ECS is a fully managed service for running Docker containers. You don’t need to provision or manage servers; just define the task that needs to be run and specify the resources the task needs. The AWS CloudFormation stack already created a number of Amazon ECS resources for the dashboard: most notably a cluster, a task definition, and a service.

Accomplish the image build and push by running the following commands with the passed environment variables set by you inside the SageMaker notebook.

(cd dashboard && docker build -t {image_name} --build-arg DASHBOARD_SAGEMAKER_MODEL={model_name} .) docker tag {image_name} {AWS_ACCOUNT_ID}.dkr.ecr.{AWS_REGION}.amazonaws.com/{DASHBOARD_ECR_REPOSITORY}:latest
eval $(aws ecr get-login --no-include-email)
docker push {AWS_ACCOUNT_ID}.dkr.ecr.{AWS_REGION}.amazonaws.com/{DASHBOARD_ECR_REPOSITORY}:latest

After this, all you need is an update to the prebuilt ECS service with the new ECR container:

aws ecs update-service --cluster {DASHBOARD_ECS_CLUSTER} --service {DASHBOARD_ECR_SERVICE} --desired-count 2

The Amazon ECS services is placed in front of an Application Load Balancer, which is used to distribute traffic across tasks. When a task fails, the service will de-provision the failing task and provision a replacement. The notebook will automatically print out the ALB URL to access the dashboard in this line:

if DASHBOARD_URL != DASHBOARD_ALB: warnings.warn('\n' + '\n'.join([ "Add CNAME record on your domain before continuing!", "from: {}".format(DASHBOARD_URL), "to: {}".format(DASHBOARD_ALB), "Otherwise you will see 'An error was encountered with the requested page' with Amazon Cognito." ]))
print(f"DASHBOARD_URL: https://{DASHBOARD_URL}")

Note that you will receive a warning from your browser when accessing the dashboard if you didn’t provide a custom SSL certificate when launching the CloudFormation stack. A self-signed certificate is created and used as a backup, but this is certainly not recommended for production use cases. You should obtain an SSL certificate that has been validated by a certificate authority, import it into AWS Certificate Manager, and reference this when launching the AWS CloudFormation stack.

If you want to continue with the self-signed certificate (for development purposes), you should be able to proceed past the browser warning page. With Chrome, you will get a Your connection is not private error message (NET::ERR_CERT_AUTHORITY_INVALID), but by selecting Advanced, you should then receive a link to proceed.

Once you have loaded the URL website, you will be greeted with a login screen for Amazon Cognito. Enter the user email credentials you set within the standup.sh script. Note that you can always go to the Cognito console to manage the User Pool if you encounter any difficulty logging in.

Lane Cognito F6

Once the credentials are entered, you will be able to access the dashboard.

Lane Dashboard F7

Cleaning up

To clean up the resources and prevent further charges, run the following file:

bash delete_resources.sh

This step will tear down the CloudFormation stacks and delete the S3 bucket in which the data is stored. To confirm that everything is deleted, go to the CloudFormation console. The console should now be absent of all related stacks.

Lane CloudFormation F8

Conclusion

In this article, we showed how to stand up an interactive dashboard for EDA with AWS and Streamlit, but this setup is really only a starting point. Using Amazon Cognito combined with ECS and an AWS Application Load Balancer allows for the application layer to scale out as needed for business users. Likewise, the AWS Glue and Amazon Athena database backend allows for new data sources to be added and provides a way in which data is easily refreshed. Finally, one can extend the dashboard further by using Amazon SageMaker to run machine learning on the data as it comes into the dashboard.

References

  1. Streamlit on AWS: A fully featured solution for Streamlit deployments
  2. Deploy PyCaret and Streamlit app using AWS Fargate — serverless infrastructure
  3. How to Deploy a Streamlit App using an Amazon Free ec2 instance?
  4. Deploying Streamlit app to EC2 instance
  5. Deploying ML web apps with Streamlit, Docker and AWS