Modern data science environments often involve many independent projects, each spanning multiple accounts. In order to maintain a global overview of the activities within the projects, a mechanism to collect data from the different accounts into a central one is crucial. In this post, we show how to leverage existing services—Amazon DynamoDB, AWS Lambda, Amazon EventBridge—to deploy a lightweight infrastructure that allows the flow of relevant metrics from spoke accounts to hub accounts.

To be general enough, such mechanism also must be highly modular, permitting each user to choose which quantities to monitor, and possibly to implement their own code to extract and monitor custom metrics. The open source approach lets you contribute custom code back to this project, thus encouraging sharing and reuse. In time, this should evolve into a rich library of possible key performance indicators (KPIs) this project will be able to monitor.

We will focus on the general architecture of the solution and on the data-exchange mechanisms. We also provide code for extracting a few simple metrics, and we rely on the open source community to contribute additional modules to extract more metrics. We will focus on scalar metrics (that is, numbers not vectors). Extension to multi-dimensional metrics is trivial.

In this example, we monitor quantities that are closely related to Amazon SageMaker. The same architecture can be extended to monitor any other metric.

General architecture

The overview of the solution is shown in the following diagram:

overview of the solution with spoke account, hub account, and related services described in blog post

We use Amazon EventBridge for the cross-account information exchange, and Amazon DynamoDB as a data store in the hub account. AWS Lambda functions are used to extract information from the spoke accounts and to store it in the hub. The red arrows show the configuration flow, which happens only once. Green lines describe the flow for requesting new data from the spokes. Blue lines show the flow of data from the spokes to the hub account.

Configuration

Using Amazon EventBridge as communication layer means that the permissions needed to operate the dashboard are minimal. The information extraction runs in the spoke account, and the hub account does not need to have any cross-account access.

We also chose to let the hub trigger a refresh of the values for all spokes. This is done by generating a special event in an AWS Lambda function and sending it to the spokes, where a rule will invoke the extraction function.

The only cross-account permission that must be set is the one that configures the event forward from the spoke/hub to the hub/spoke account. This requires that:

  • The hub account must permit (in the resource policy of the receiving event bus) events:PutEvent from each of the spokes to which it is connected. The spokes must permit the same operation from the hub.
  • The spoke account must define an Amazon EventBridge rule that forwards events generated by the information extraction to the hub account. The hub must have a rule to forward the refresh command to the spokes.

We use the AWS Systems Manager Parameter Store to store, within each account, the information needed to configure the event forwards. This offers the advantage that the information concerning the structure of hubs and spokes is explicitly stored in the accounts.

A dedicated Lambda function reads the configuration from the Parameter Store and applies the needed configuration in each account. The code is set up in such a way as to let any account be connected to multiple monitors, and itself to serve (at the same time) as monitor for other accounts. A connection requires two parameters to be set: one in the spoke (pointing it to the hub), and one in the hub (pointing it to the spoke).

Information extraction

An AWS Lambda function in each spoke account takes care of extracting the needed information. We chose to write this part of code to be highly modular, and to allow fine-grained, least privilege permissions management. In detail:

  • Each metric is implemented in an independent Python class.
  • All metrics inherit from a base class that implements core functionality, such as communication with the event bus.
  • All metrics also define, as class variable, the AWS Identity and Access Management (IAM) permissions they need to extract the information from the account.
  • When the solution is deploying in the spoke, the list of metrics to be monitored must be provided.
  • When the solution is deploying, the extraction function is given only the permissions it needs in order to extract the metrics that are requested.
  • At runtime, the extraction function loops over the metrics, emitting one event for each of them.

Fetching new data

To request new data from all spokes, the hub must emit to its own event bus an event with contents:

{ "source": "metric_extractor", "detail-type": "metric_extractor", "resources": [], "detail": "{}"
}

This event will be forwarded to all spokes, which are configured to start a new extraction upon its reception. The results of the extractions are sent back to the hub, again through Amazon EventBridge.

Information archive

The hub account receives events from all the spokes to which it is connected. It extracts the payload and stores it to an Amazon DynamoDB table. In this example, we use a simple schema for the event:

{ "source": "metric_extractor", "resources": [], "detail-type": "metric_extractor", "detail": { "MetricName": "aName", "MetricValue": "aValue", "ExtractionDate": "aTimeStamp", "Metadata": {"field1":"value1"}, "Environment": "dev", "ProjectName": "aProject" }
}

Each MetricValue will be identified by its MetricName and its ExtractionDate. Filtering by ProjectName is also possible. To support the case when one single project owns more accounts, the additional field Environment is also stored. This will typically refer to the stages of the CI/CD pipeline within a project (dev/int/prod).

An additional field is also supported to store metadata concerning this particular extraction.

The Amazon DynamoDB table in the hub account is using MetricName as primary key, and ExtractionDate as sort key.

Deployment

We use the AWS Cloud Development Kit (AWS CDK) to deploy the solution in both hub and spokes.

For the deployment, we will need two Amazon Web Services (AWS) accounts:

  • Hub account: The hub account will contain the DynamoDB, EventBridge rules, and associated Lambda instances to receive events from the spoke accounts.
  • Spoke account: We use one spoke account for the purposes of this demonstration, but this solution will scale to any number of spoke accounts.

Prerequisites

The following must be installed and set up:

To get started, download the code from GitHub on a local machine. Perform the following steps from the folder in which you downloaded the code.

Steps

First, prepare the local Python environment. The downloaded code includes a requirements.txt file with the necessary packages. In a terminal, run:

pip install -r requirements.txt

Next, you must be authenticated into the AWS account used as the hub account. For more information on how to authenticate into AWS accounts, refer to the documentation.

To deploy the hub account infrastructure, run the following command:

cdk deploy --app "python3 hub.py"

Approve any prompts for adding the IAM policies.

Next, authenticate to the account you want to use as spoke, and run the following:

cdk deploy -c \
metrics=TotalCompletedTrainingJobs,NumberEndPointsInService,CompletedTrainingJobs24h\ -c environment=dev \
-c project_name=Project1

This command has a -c flag (for context) and it is a way of passing in variables to the AWS CDK code. More information can be found in the Developer Guide.

We will use these variables for the following purposes:

  • metrics: The metrics variable is a comma-separated list that allows us to choose which metrics we want to retrieve from a spoke account. More metrics can be added. In this example, we will deploy:
    • TotalCompletedTrainingJobs
    • CompletedTrainingJobs24h
    • NumberEndPointsInService
  • project_name: This is a string used to identify one particular machine learning project.
  • environment: This variable is mapped to the deployment environment we may have—for example, development—pre-production or production. It is a string and can be any value we would like. We can use it in the case where a project owns more than one account, to identify each of them.

Once the hub and spoke are deployed, we must set up the connection between the two. We keep the connection step separated from deployment on purpose. The idea is to be able to add new spokes without having to redeploy resources.

The following script summarizes the commands we need:

# run this in each Spoke account
aws ssm put-parameter \
--name "/monitors/TestHub" \
--type "String" \
--value "HUB_ACCOUNT_ID" \
--overwrite # run this in the Hub account, once for each Spoke you want to connect
aws ssm put-parameter \ --name "/monitored_projects/TestProject/dev" \ --type "String" \ --value "SPOKE_ACCOUNT_ID" \ --overwrite

Now that the deployment is done and configuration data is stored, we can start the actual configuration of the accounts. The only issue here is that we cannot configure a rule to send events to another account if the receiving account has not permitted the sender to put events first. So we first must configure the cross-account events:PutEvent permission on both hub and spoke, then we can configure the event rule for forwarding on both:

# in the Hub
aws lambda invoke --function-name ds-dashboard-connection \ --payload "{ \"action\": \"EBPut\"}" lambda.out.json # in the Spoke aws lambda invoke --function-name ds-dashboard-connection \ --payload "{ \"action\": \"EBPut\"}" lambda.out.json
aws lambda invoke --function-name ds-dashboard-connection \ --payload "{ \"action\": \"EBRule\"}" lambda.out.json # in the hub, again, now we can create the event forward rule
aws lambda invoke --function-name ds-dashboard-connection \ --payload "{ \"action\": \"EBRule\"}" lambda.out.json

Implementing a new metric

In order to implement a new metric, we must add a class in the file metric.py. The new class must inherit from Metric, as defined in the same file. Here is the implementation for one of the example metrics we provide:

class NumberEndPointsInService(Metric): # this class variable defines the Action and Resource for the IAM # permissions needed for this metric _iam_permissions = Metric._iam_permissions + [ { "Action": "sagemaker:ListEndpoints", "Resource": "*" } ] # this internal method MUST be implemented. This is what computes returns the # actual value def _compute_value(self): eps = sagemaker_client.list_endpoints( StatusEquals='InService', )['Endpoints'] return len(eps)

As shown, the amount of code to be written is minimal because most of the operations are handled by the parent class. When specifying the IAM permissions for the metric, we are allowed to use **ACCOUNT_ID** and **REGION** as placeholders for the real account and region, which will only be known at deploy time.

In case you need more fine-grained placeholders (for example, a bucket name in the resource section), you can implement your own get_iam_permissions method in the new class to override the one provided by Metric.

Example dashboard

The technology to use for analysis and visualization of the collected data depends on the constraints of the specific setup (that is, which solutions are already available and in use within the environment). A detailed discussion is beyond the scope of this example.

Instead, we connected two spokes to the hub and ran a few training jobs, deploying one model to production. The Amazon DynamoDB table was connected to Amazon QuickSight. Following is a simple table visualization with two historical plots:

simple table visualization with two historical plots showing summary of metrics for DEV, Summary of metrics in PROD, and a table with project name, metric name, metric value, environment, and extraction date

Clean up

How to remove the resources created to avoid unnecessary costs.

In the terminal, assume a role in the hub account and run the following command to remove the hub stack:

cdk destroy --app "python3 hub.py"

Assume a role in the spoke account and run the following command to remove the spoke stack:

cdk destroy

Additionally, resources were created by the connection Lambda instance and must be removed:

  • In the hub and spokes, navigate to the Amazon EventBridge console and delete rules whose names start with forward.
  • In the hub and spoke, clean up the AWS Systems Manager Parameter Store.

Conclusion

With data science projects becoming increasingly complex, monitoring becomes an essential feature. We provide code that uses existing AWS services to deploy and operate a simple data collection and distribution mechanism, which can be used to generate dashboards.

We follow a fully modular approach, relying on the open source community to contribute more modules for the extraction of more quantities. A few examples of interesting modules include:

  • Training performance (validation and test)
  • Billable training time
  • Billable inference time

As far as the core functionality is concerned, new features that could be implemented include:

  • Support for multi-dimensional metrics
  • A richer schema for the Amazon DynamoDB table
  • Metrics with runtime arguments (for example, extract the value this parameter)
  • Improve the definition of the hub and spoke structure

In all these cases, maintainers welcome pull requests from contributors.

Resources

Categories: Open Source