When a model gets deployed to a production environment, inference speed matters. Models with fast inference speeds require less resources to run, which translates to cost savings, and applications that consume the models’ predictions benefit from the improved performance.

For example, let’s say your website uses a regression model to predict mortgage rates for aspiring home buyers to see what type of rate they could expect, based on inputs they provide such as the size of the down payment, their loan term, and the county in which they’re looking to buy. A model that can send a prediction back in 10 milliseconds versus 200 milliseconds for every time an input is updated makes a massive difference in terms of the website’s responsiveness and user experience.

Amazon SageMaker Neo allows you to unlock such performance improvements and cost savings in a matter of minutes. It does this by compiling models into optimized executables through various open-source libraries, which can then be hosted on supported devices on the edge or on Amazon SageMaker endpoints. Neo is compatible with eight different machine learning (ML) frameworks, and in the context of gradient boosted tree algorithms such as XGBoost, Neo uses Treelite to optimize model artifacts. Due to the popularity of XGBoost and its unique categorization as a more classical ML framework, we use it as our framework of choice throughout this post. A near 3x speedup will be demonstrated for the optimized XGBoost model compared to the unoptimized one. The Abalone dataset from UCI will be used to train the model. Please feel free to use your own model and dataset, however, and let us know in the comments what type of acceleration was achieved.

This post will take a deeper dive into compiling XGBoost model artifacts using Neo and will show you how to accurately measure and test the performance gains of these Neo-optimized models in general. By the end of this walkthrough, you’ll have your own framework for quickly training, deploying, and benchmarking XGBoost models. In turn, this can help you make data-driven decisions on what type of instance configurations best fit your unique cost profile and inference performance needs.

Solution overview

The following diagram visualizes the services we use for this solution and how they interact with one another.

1 2007 Arch

The steps to implement the solution are as follows:

  1. Download and process the popular Abalone dataset with a Jupyter notebook, and then run an XGBoost SageMaker training job on the processed data. We use a local mode SageMaker training job to produce the unoptimized XGBoost model, which can be faster and easier to prototype compared to a remote one.
  2. Deploy the unoptimized XGBoost model artifact to a SageMaker endpoint.
  3. Take the unoptimized artifact and optimize it with a Neo compilation job.
  4. Deploy the Neo-optimized XGBoost artifact to a SageMaker endpoint.
  5. Create an Amazon CloudWatch Dashboard from the SageMaker notebook to monitor inference speed and performance under heavy load of both endpoints.
  6. Deploy Serverless Artillery from the SageMaker notebook, which we use as our load testing tool. We set up Serverless Artillery entirely from the SageMaker notebook, and directly invoke your SageMaker endpoints from the internet through manually signed AWS Signature Version 4 requests—no need for Amazon API Gateway as an intermediary.
  7. Perform load tests against both endpoints.
  8. Analyze the performance of both endpoints under load in the CloudWatch dashboard, and look at how the optimized endpoint outperforms the unoptimized one.


Before getting started, you must have administrator access to an AWS account, and complete the following steps:

  1. Create an AWS Identity and Access Management (IAM) role for SageMaker that has the AmazonSageMakerFullAccess managed policy attached along with an inline policy that contains additional required permissions.

The following screenshot is an example of a properly configured role called NeoBlog.

2 2007 IAM

The AdditionalRequiredPermissionsForSageMaker inline policy contains the following JSON:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "cloudwatch:PutDashboard", "Resource": "arn:aws:cloudwatch::*:dashboard/NeoDemo" }, { "Effect": "Allow", "Action": [ "s3:CreateBucket", "s3:GetBucketLocation", "s3:GetObject", "s3:ListBucket", "s3:PutObject", "s3:DeleteBucket", "s3:DeleteObject", "s3:DeleteObjectVersion", "s3:PutLifeCycleConfiguration", "s3:GetEncryptionConfiguration", "s3:PutEncryptionConfiguration", "s3:PutBucketPolicy", "s3:DeleteBucketPolicy", "s3:GetBucketPolicy", "s3:GetBucketPolicyStatus" ], "Resource": "arn:aws:s3:::serverless-artillery-*" }, { "Effect": "Allow", "Action": [ "cloudformation:CreateStack", "cloudformation:UpdateStack", "cloudformation:DeleteStack", "cloudformation:DescribeStacks", "cloudformation:DescribeStackEvents", "cloudformation:DescribeStackResource", "cloudformation:DescribeStackResources", "cloudformation:ListStackResources" ], "Resource": "arn:aws:cloudformation:*:*:stack/serverless-artillery-*" }, { "Effect": "Allow", "Action": [ "cloudformation:ValidateTemplate" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "iam:GetRole", "iam:CreateRole", "iam:DeleteRolePolicy", "iam:PutRolePolicy", "iam:DeleteRole", "iam:PassRole" ], "Resource": "arn:aws:iam::*:role/serverless-artillery-*" }, { "Effect": "Allow", "Action": [ "sns:CreateTopic", "sns:DeleteTopic", "sns:GetTopicAttributes" ], "Resource": "arn:aws:sns:*:*:serverless-artillery-*" }, { "Effect": "Allow", "Action": [ "lambda:UpdateFunctionCode", "lambda:ListVersionsByFunction", "lambda:PublishVersion", "lambda:InvokeFunction", "lambda:GetFunction", "lambda:CreateFunction", "lambda:DeleteFunction", "lambda:GetFunctionConfiguration", "lambda:AddPermission" ], "Resource": "arn:aws:lambda:*:*:function:serverless-artillery-*" }, { "Effect": "Allow", "Action": [ "logs:DescribeLogGroups", "logs:CreateLogGroup" ], "Resource": "arn:aws:logs:*:*:log-group:serverless-artillery-*" }, { "Effect": "Allow", "Action": [ "logs:DeleteLogGroup", "lambda:RemovePermission" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "events:DescribeRule", "events:PutRule", "events:DeleteRule", "events:PutTargets", "events:RemoveTargets" ], "Resource": "arn:aws:events:*:*:rule/serverless-artillery-*" } ]

Our next step is to create a SageMaker notebook instance.

  1. On the SageMaker console, under Notebooks, choose Notebook instances.
  2. Choose Create notebook instance.
  3. For Notebook instance name, enter NeoBlog.
  4. For Notebook instance type, choose your instance (for this post, the default ml.t2.medium should be enough).
  5. For IAM role, choose the NeoBlog role that you created.
  6. In the Git repositories section, select Clone a public Git repository to this notebook instance only.
  7. For Git repository URL, enter https://github.com/aws-samples/amazon-sagemaker-neo-performance-gains.
  8. Choose Create notebook instance.
  9. After the notebook has reached a Running status, choose Open Jupyter to connect to your notebook instance.
  10. Navigate to the neo-blog repository in Jupyter and choose the NeoBlog.ipynb notebook to start it.

You’re now ready to walk through the remainder of this post and run the notebook’s contents.

Notebook walkthrough

The code snippets in this post match the code in the NeoBlog notebook. This post contains the most relevant commentary, and the notebook provides additional detail. When extra information is provided in the notebook, it’s called out accordingly. Let’s get started!

First, we must retrieve the Abalone dataset and split it into training and validation sets. We store the data in lightsvm format.

  1. Run the following two cells in the Jupyter notebook:
from pathlib import Path
import boto3 for p in ['raw_data', 'training_data', 'validation_data']: Path(p).mkdir(exist_ok=True) s3 = boto3.client('s3')
s3.download_file('sagemaker-sample-files', 'datasets/tabular/uci_abalone/abalone.libsvm', 'raw_data/abalone')

from sklearn.datasets import load_svmlight_file, dump_svmlight_file
from sklearn.model_selection import train_test_split X, y = load_svmlight_file('raw_data/abalone')
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1984, shuffle=True) dump_svmlight_file(x_train, y_train, 'training_data/abalone.train')
dump_svmlight_file(x_test, y_test, 'validation_data/abalone.test')

Now that we have our data shuffled and prepared, we can train an unoptimized XGBoost model. Refer to the commentary in the Jupyter notebook for details related to the container framework version, hyperparameters, and training mode being used.

  1. Train the model by running the following code cell:
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput bucket = Session().default_bucket()
role = sagemaker.get_execution_role() # initialize hyperparameters
hyperparameters = { "max_depth":"5", "eta":"0.2", "gamma":"4", "min_child_weight":"6", "subsample":"0.7", "verbosity":"1", "objective":"reg:squarederror", "num_round":"10000"
} # construct a SageMaker XGBoost estimator
# specify the entry_point to your xgboost training script
estimator = XGBoost(entry_point = "entrypoint.py", framework_version='1.2-1', # 1.x MUST be used hyperparameters=hyperparameters, role=role, instance_count=1, instance_type='local', output_path=f's3://{bucket}/neo-demo') # gets saved in bucket/neo-demo/job_name/model.tar.gz # define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput('file://training_data', content_type=content_type)
validation_input = TrainingInput('file://validation_data', content_type=content_type) # execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input}, logs=['Training'])

When the local training job finishes running (it should only take a few minutes), the next step is to deploy the XGBoost model artifact to a SageMaker endpoint. The Jupyter notebook contains additional information related to why we use the c5 instance family class, along with how the model artifact is saved in Amazon Simple Storage Service (Amazon S3).

  1. Deploy the model artifact by running the following cell:
from sagemaker.xgboost.model import XGBoostModel # grab the model artifact that was written out by the local training job
s3_model_artifact = estimator.latest_training_job.describe()['ModelArtifacts']['S3ModelArtifacts'] # we have to switch from local mode to remote mode
xgboost_model = XGBoostModel( model_data=s3_model_artifact, role=role, entry_point="entrypoint.py", framework_version='1.2-1',
) unoptimized_endpoint_name = 'unoptimized-c5' xgboost_model.deploy( initial_instance_count = 1, instance_type='ml.c5.large', endpoint_name=unoptimized_endpoint_name

After the unoptimized model is deployed (the cell has stopped running), we run a Neo compilation job to optimize the model artifact. In the following code, we use the c5 instance type family, choose the XGBoost framework, and include an input shape vector. The input shape is unused by Neo, but the compilation job throws an error if no value is provided. The compilation job also uses the 1.2.1 version of XGBoost by default, which again is why we specified the 1.2-1 framework version during model training.

  1. Run the Neo compilation job with the following code:
job_name = s3_model_artifact.split("/")[-2]
neo_model = xgboost_model.compile( target_instance_family="ml_c5", role=role, input_shape =f'{{"data": [1, {X.shape[1]}]}}', output_path =f's3://{bucket}/neo-demo/{job_name}', # gets saved in bucket/neo-demo/model-ml_c5.tar.gz framework = "xgboost", job_name=job_name # what it shows up as in console

  1. When the cell stops running and the compilation job is complete, we deploy the Neo-optimized model to its own separate SageMaker endpoint:
optimized_endpoint_name = 'neo-optimized-c5' neo_model.deploy( initial_instance_count = 1, instance_type='ml.c5.large', endpoint_name=optimized_endpoint_name

  1. Next, we validate that the endpoints are functioning as expected. When you run the following code block, you should see numerical predictions returned from both endpoints.
import boto3 smr = boto3.client('sagemaker-runtime') resp = smr.invoke_endpoint(EndpointName='neo-optimized-c5', Body=b'2,0.675,0.55,0.175,1.689,0.694,0.371,0.474', ContentType='text/csv')
print('neo-optimized model response: ', resp['Body'].read())
resp = smr.invoke_endpoint(EndpointName='unoptimized-c5', Body=b'2,0.675,0.55,0.175,1.689,0.694,0.371,0.474', ContentType='text/csv')
print('unoptimized model response: ', resp['Body'].read())

With both endpoints up and running, we can create the CloudWatch dashboard that we use to analyze endpoint performance. For this post, we monitor the metrics CPUUtilization, ModelLatency (which measures how long it takes for a model to return a prediction), and Invocations (which helps us monitor the progress of the load test against the endpoints).

  1. Run the following cell to create the dashboard:
import json cw = boto3.client('cloudwatch') dashboard_name = 'NeoDemo'
region = Session().boto_region_name # get region we're currently in body = { "widgets": [ { "type": "metric", "x": 0, "y": 0, "width": 24, "height": 12, "properties": { "metrics": [ [ "AWS/SageMaker", "Invocations", "EndpointName", optimized_endpoint_name, "VariantName", "AllTraffic", { "stat": "Sum", "yAxis": "left" } ], [ "...", unoptimized_endpoint_name, ".", ".", { "stat": "Sum", "yAxis": "left" } ], [ ".", "ModelLatency", ".", ".", ".", "." ], [ "...", optimized_endpoint_name, ".", "." ], [ "/aws/sagemaker/Endpoints", "CPUUtilization", ".", ".", ".", ".", { "yAxis": "right" } ], [ "...", unoptimized_endpoint_name, ".", ".", { "yAxis": "right" } ] ], "view": "timeSeries", "stacked": False, "region": region, "stat": "Average", "period": 60, "title": "Performance Metrics", "start": "-PT1H", "end": "P0D" } } ]
} cw.put_dashboard(DashboardName=dashboard_name, DashboardBody=json.dumps(body)) print('link to dashboard:')

After you run the cell, you can choose the output link to go to the dashboard, but you won’t see any meaningful data plotted just yet.

Now that the dashboard is created, we can proceed with setting up the Serverless Artillery CLI. To do this, we install Node.js, the Serverless Framework, and Serverless Artillery on our SageMaker notebook instance. The cell that installs Node.js can take a long time to run, which is normal.

  1. Run the following cell to install Node.js and the Serverless Framework:
%conda install -c conda-forge nodejs 

Next, we deploy Serverless Artillery. The code first changes directories into the directory that contains the code for our load generating AWS Lambda function. Then it installs the function’s dependencies and uses the Serverless Artillery CLI to package and deploy the load generating function into our account via the Serverless Framework. For more information on what Serverless Artillery is doing under the hood, refer to the Jupyter notebook.

We set up Serverless Artillery to directly hit our SageMaker endpoints with manually signed requests using the AWS Signature Version 4 algorithm. The benefit of this approach is that we get to directly hit and measure the performance of exclusively the endpoints during the load test. If we front our endpoints with intermediary services like a Lambda-backed API Gateway, the load test results capture the performance characteristics of the all three services together rather than just the SageMaker resources.

  1. Deploy Serverless Artillery with the following code:
!cd serverless_artillery && npm install && slsart deploy --stage dev

After running these cells, you should have Node.js version 12.4.0 or higher, Serverless Framework version 1.80.0, and Serverless Artillery version 0.4.9.

The next task is to create the load test definition, which we do by running two cells. The first cell defines a custom magic command, and the second cell creates the load test definition and saves it into script.yaml.

The test definition has six phases, each of which runs 2 minutes in length. The first phase begins with an arrival rate of 20 users per second, meaning that approximately 10 requests are generated and sent to each endpoint every second for two minutes. The next three phases scale by an additional 20 users per second, and the last two phases scale up by 40. Each request contains 125 rows for inference. The Artillery documentation (the tool that Serverless Artillery is based on) is a good resource for learning about the structure and additional features of load test definitions.

  1. Create the load test definition with the following code:
from IPython.core.magic import register_line_cell_magic @register_line_cell_magic
def writefilewithvariables(line, cell): with open(line, 'w') as f: f.write(cell.format(**globals())) # Get region that we're currently in
region = Session().boto_region_name

%%writefilewithvariables script.yaml config: variables: unoptimizedEndpointName: {unoptimized_endpoint_name} # the xgboost model has 10000 trees optimizedEndpointName: {optimized_endpoint_name} # the xgboost model has 10000 trees numRowsInRequest: 125 # Each request to the endpoint contains 125 rows target: 'https://runtime.sagemaker.{region}.amazonaws.com' phases: - duration: 120 arrivalRate: 20 # 1200 total invocations per minute (600 per endpoint) - duration: 120 arrivalRate: 40 # 2400 total invocations per minute (1200 per endpoint) - duration: 120 arrivalRate: 60 # 3600 total invocations per minute (1800 per endpoint) - duration: 120 arrivalRate: 80 # 4800 invocations per minute (2400 per endpoint... this is the max of the unoptimized endpoint) - duration: 120 arrivalRate: 120 # only the neo endpoint can handle this load... - duration: 120 arrivalRate: 160 processor: './processor.js' scenarios: - flow: - post: url: '/endpoints/{{{{ unoptimizedEndpointName }}}}/invocations' beforeRequest: 'setRequest' - flow: - post: url: '/endpoints/{{{{ optimizedEndpointName }}}}/invocations' beforeRequest: 'setRequest'

With the load test defined, we’re now ready to start it! Because there are six stages with each stage taking 2 minutes, the test runs for a total of 12 minutes. You can monitor the progression of the load test by clicking on the link generated by running the second cell. The link redirects you to the CloudWatch dashboard that you created earlier.

  1. Perform the load test with the following code:
!slsart invoke --stage dev --path script.yaml

print("Here's the link to the dashboard again:")

Review the CloudWatch metrics

After 12 minutes have passed, refresh the dashboard and look at the metrics that have been captured.

The plotted data should look similar to the following screenshot, which has several interesting observations to unpack.

First of all, even at the very beginning of the load test, when both endpoints were only handling about 10 requests per second (RPS), the model latency of the neo-optimized SageMaker endpoint was still almost three times lower than the unoptimized endpoint. This shows you the power of Neo—with one quick compilation job, we unlocked a performance improvement of nearly three times greater in our XGBoost model hosted on SageMaker!

3 2007 Graph

Secondly, by the end of the load test, the ModelLatency metric of the unoptimized model spiked to almost 1.5 seconds per request. The unoptimized model’s CPUUtilization metric also reaches 181%, which is close to the endpoint’s theoretical maximum of 200% given that the ml.c5.large instance type has 2 vCPUs. On the other hand, the optimized endpoint’s ModelLatency metric never crosses 10,000 microseconds, and the CPUUtilization metric stays well below capacity at under 50%. This indicates that the Neo-optimized endpoint could definitely handle even more load if needed, much more than the load test’s maximum of 80 requests per second.

4 2007 Graph

Looking at the following graph, we can also see that the unoptimized endpoint’s performance begins to drastically drop off around the 21:27 timestamp. To get a better idea of what’s going on, deselect the ModelLatency metric for the unoptimized endpoint (the green line) to get the graph of the subsequent image. Upon doing this, you can see that Invocations metrics confirm the story. Up till the 21:27 mark, both endpoints were handling almost the exact same number of requests from the load test (indicated by the blue and orange lines). Past the 21:27 mark when the number requests per second starts to go above 40, the unoptimized endpoint begins to struggle to keep up. This indicates that the maximum load that the unoptimized endpoint can sustain is around 40 RPS.

5 2007 Graph

The load test report generated by Serverless Artillery is also available to us by navigating to CloudWatch in the console, choosing Log groups under Logs, and searching for the log group that has serverless-artillery in its name. If you choose the log group and then choose the most recent log stream, you can see that the last entries comprise of a report that looks similar to the following image. This report’s metrics are an aggregate of the performances of both SageMaker endpoints, so in this case it’s not very useful to us. The one interesting thing to point out is that under the heavier arrival rates, the unoptimized endpoint started to return 400 Status response codes—a sign of it being overwhelmed.

7 2007 Chart

Clean up

With the load test completed and the results analyzed, all that’s left to do is to clean up the deployed resources by running the following two cells. The first cell deletes the two SageMaker endpoints (and their endpoint configurations) that were deployed, and the second cell destroys the Serverless Artillery resources.

# delete endpoints and endpoint configurations sm = boto3.client('sagemaker') for name in [unoptimized_endpoint_name, optimized_endpoint_name]: sm.delete_endpoint(EndpointName=name) sm.delete_endpoint_config(EndpointConfigName=name)

!slsart remove --stage dev

After you run the preceding cells, exit this notebook and stop or delete the notebook instance. To stop the notebook instance, on the SageMaker console, choose Notebook instances, select the NeoBlog notebook, and on the Actions menu, choose Stop.


Congratulations! You have successfully finished walking through this post. We were able to accomplish the following:

  • Optimize an XGBoost model artifact generated through a local training job with a Neo compilation job
  • Deploy both versions of the artifact to SageMaker endpoints
  • Deploy Serverless Artillery from our Jupyter notebook and configure the tool so that it directly invokes our SageMaker endpoints
  • Perform load tests against both endpoints with Serverless Artillery
  • Analyze our load test results and view how the Neo-optimized model outperforms the unoptimized model

The performance improvements gained through Neo can translate to significant cost savings. As a next step, you should look at your existing portfolio of models to evaluate them as potential candidates for optimization jobs. Creating Neo-optimized artifact versions allows you to achieve equivalent (if not better) performance metrics with less powerful resources, and it’s one of the easiest ways to save money on SageMaker endpoints.

Additionally, you can apply the load testing approach demonstrated in this post to any SageMaker endpoint. When used in tandem, Serverless Artillery and CloudWatch combine into a powerful framework for profiling the performance characteristics of your endpoints, which can then help you make data-driven decisions on what resource configurations best fit your needs. Simply deploy your models, update your load test definition, and start testing!

For more information about Neo, see Compile and Deploy Models with Neo. For other topics and services related to SageMaker, check out the AWS Machine Learning Blog.

About the Author

Adam KozdrowiczAdam Kozdrowicz is a Data and Machine Learning Engineer for AWS Professional Services. He specializes in bringing ML proof of concepts into production and automating the entire ML lifecycle. This includes data collection, data processing, model development and training, model deployments, and model monitoring. He also enjoys working with frameworks such as AWS Amplify, AWS SAM, and AWS CDK. During his free time, Adam likes to surf, travel, practice photography, and build machine learning models.