Machine learning (ML) is realized in inference. The business problem you want your ML model to solve is the inferences or predictions that you want your model to generate. Deployment is the stage in which a model, after being trained, is ready to accept inference requests. In this post, we describe the parameters that you can tune to maximize performance of both CPU-based and GPU-based Amazon SageMaker real-time endpoints. SageMaker is a managed, end-to-end service for ML. It provides data scientists and MLOps teams with the tools to enable ML at scale. It provides tools to facilitate each stage in the ML lifecycle, including deployment and inference.

SageMaker supports both real-time inference with SageMaker endpoints and offline and temporary inference with SageMaker batch transform. In this post, we focus on real-time inference for TensorFlow models.

Performance tuning and optimization

For model inference, we seek to optimize costs, latency, and throughput. In a typical application powered by ML models, we can measure latency at various time points. Throughput is usually bounded by latency. Costs are calculated based on instance usage, and price/performance is calculated based on throughput and SageMaker ML instance cost per hour. Finally, as we continue to advance rapidly in all aspects of ML including low-level implementations of mathematical operations in chip design, hardware-specific libraries will play a greater role in performance optimization. Rapid experimentation that SageMaker facilitates is the lynchpin in achieving business objectives in a cost-effective, timely, and performant manner.

Performance tuning and optimization is an empirical field. The number of parameters to tune is combinatorial such that each set of configuration parameter values aren’t independent of each other. Various factors such as payload size, network hops, nature of hops, model graph features, operators in the model, and the model’s CPU, GPU, memory, and I/O profiles affect the optimal parameter tuning. The distribution of these effects on performance is a vast unexplored space. Therefore, we begin by describing these different parameters and recommend an empirical approach to tune these parameters and understand their effects on your model performance.

Based on our past observations, the function of effect of these parameters on an inference workload is, approximately, plateau-shaped or Gaussian-uniform. The values to maximize the performance of an endpoint lie along the ascendant curve of this distribution, demarcated by latencies. Typically, latencies increase with an increase in throughput. Improvements in throughput levels out or plateaus at a point where respective increases in concurrent connections don’t result in any significant improvement in throughput. Certain cases may show a detrimental effect from increasing certain parameters, such that the throughput rapidly decreases as the system is saturated with overhead.

The following chart illustrates transactions per second demarcated by latencies.

1 1766

SageMaker TensorFlow Deep Learning Containers (DLCs) recently introduced new parameters to help with performance optimization of a CPU-based or GPU-based endpoint. As we discussed earlier, an ideal value of each of these parameters is subjective to factors such as model, model input size, batch size, endpoint instance type, and payload. What follows next is a description of these tunable parameters.

TensorFlow serving

We start with parameters related to TensorFlow serving.


For TensorFlow-based models, the tensorflow_model_server binary is the operational piece that is responsible for loading a model in memory, running inputs against a model graph, and deriving outputs. Typically, a single instance of this binary is launched to serve models in an endpoint. This binary is internally multi-threaded and spawns multiple threads to respond to an inference request. In certain instances, if you observe that the CPU is respectably employed (over 30% utilized) but the memory is underutilized (less than 10% utilization), increasing this parameter might help. We have observed in our experiments that increasing the number of tensorflow_model_servers available to serve typically increases the throughput of an endpoint.


This parameter governs the fraction of the available GPU memory to initialize CUDA/cuDNN and other GPU libraries. 0.2 means 20% of the available GPU memory is reserved for initializing CUDA/cuDNN and other GPU libraries, and 80% of the available GPU memory is allocated equally across the TF processes. GPU memory is pre-allocated unless the allow_growth option is enabled.

Deep learning operators

Operators are nodes in a deep learning graph that perform mathematical operations on data. These nodes can be independent of each other and therefore can be run in parallel. In addition, you can internally parallelize nodes for operators such as tf.matmul() and tf.reduce_sum(). Next, we describe two parameters to control running these two operators using the TensorFlow threadpool.


This ties back to the inter_op_parallelism_threads variable. This variable determines the number of threads used by independent non-blocking operations. 0 means that the system picks an appropriate number.


This ties back to the intra_op_parallelism_threads variable. This determines the number of threads that can be used for certain operations like matrix multiplication and reductions for speedups. A value of 0 means that the system picks an appropriate number.

Architecture for serving an inference request over HTTP

Before we look at the next set of parameters, let’s understand the typical arrangement when we deploy Nginx and Gunicorn to frontend tensorflow_model_server. Nginx is responsible for listening on port 8080; it accepts a connection and forwards it to Gunicorn, which serves as a Python HTTP Web Server Gateway Interface. Gunicorn is responsible for replying to /ping and forwarding /invocations to tensorflow_model_server. While replying to /invocations, Gunicorn invokes tensorflow_model_server with the payload.

The following diagram illustrates the anatomy of a SageMaker endpoint.

2 1766


This governs the number of worker processes that Gunicorn is requested to spawn for handling requests. This value is used in combination with other parameters to derive a set that maximizes inference throughput. In addition to this, SAGEMAKER_GUNICORN_WORKER_CLASS governs the type of workers spawned, typically async, typically gevent.

OpenMP (Open Multi-Processing)

OpenMP is an implementation of multithreading, a method of parallelizing whereby a primary thread (a series of instructions ran consecutively) forks a specified number of sub-threads and the system divides a task among them. The threads then run concurrently, with the runtime environment allocating threads to different processors. Various parameters control the behavior of this library; in this post, we explore the impact of changing one of these many parameters. For a full list of parameters available and their intended use, refer to Environment Variables.


Python internally uses OpenMP for implementing multithreading within processes. Typically, threads equivalent to the number of CPU cores are spawned. But when implemented on top of a Simultaneous Multi Threading (SMT) such Intel’s HypeThreading, a certain process might oversubscribe a particular core by spawning twice the threads as the number of actual CPU cores. In certain cases, a Python binary might end up spawning up to four times the threads as available actual processor cores. Therefore, an ideal setting for this parameter, if you have oversubscribed available cores using worker threads, is 1 or half the number of CPU cores on a SMT-enabled CPU.

In our experiments, we changed the values of these parameters as a tuple and not independently. Therefore, all the results and guidance assume the preceding scenario. As the results illustrate, we observed an increase of over 1,900% to over 87% in some models.

The following table shows an increase in TPS by adjusting parameters for a retrieval type model on an ml.c5.9xlarge instance.

Number of workersNumber of TFSOMP_NUM_THREADInter Op ParallelizationIntra Op ParallelizationTPS

The following table shows an increase in TPS by adjusting parameters for a Single Shot Detector type model on an ml.p3.2xlarge instance.

Number of workersNumber of TFSOMP_NUM_THREADInter Op ParallelizationIntra Op ParallelizationTPS

The following diagram shows the resultant increase in TPS by adjusting parameters.

3 1766

Observe results in your own environments

Now that you know about these various parameters, how can you try them out in your environments? We first discuss how to set up these parameters, then describe a tool and methodology to test it and observe variations in latency and throughput.

Set up an endpoint with custom parameters

When you create a SageMaker endpoint, you can set values of these parameters by passing them in a dictionary for the env parameter in sagemaker.model.Model. See the following example code:

sagemaker_model = Model(image_uri=image_uri, model_data=model_location_in_s3, role=get_execution_role(), env={'SAGEMAKER_GUNICORN_WORKERS': ’10’, 'SAGEMAKER_TFS_INSTANCE_COUNT': ’20’, 'OMP_NUM_THREADS': '1', 'SAGEMAKER_TFS_INTER_OP_PARALLELISM': '4', 'SAGEMAKER_TFS_INTRA_OP_PARALLELISM': '1'})
predictor = sagemaker_model.deploy(initial_instance_count=1,instance_type=test_instance_type, wait=True, endpoint_name=endpoint_name)

Test for success

Now that our parameters are set up, how do we test for success? How do we standardize a test that is uniform across our runs? We recommend the open-source tool Locust. In its simplest form, it allows us to control the number of concurrent connections being sent across to a target (in this case, SageMaker endpoints). Via one concurrent connection, we’re invoking inference (using invoke_endpoint) as fast as possible, sequentially. So, although the connections (users in Locust parlance) are concurrent, the invocations against the endpoint requesting inference are sequential.

The following graph shows invocations tracked with respect to Locust users peaking at about over 45,000 (with 95 TFS server spawned).

4 1766

The following graph shows invocations for same instance peaking at around 11,000 (with 1 TFS server spawned).

5 1766

This allows us to observe, as an output of this Locust command, the end-to-end P95 latency and TPS for the duration of test. So roughly speaking, lower latency and higher TPS (users) is better. As we tune our parameters, we observe TPS ascend delta between two users (n and n+1), and test for it to reach a point at which with every respective increase in users, the TPS stays constant. At such a point, past a certain number of users, the latency usually explodes due to resource contention in the endpoint. The point just before this latency explosion is where we have an endpoint at its functional best.

Although we observe this increase in TPS and decrease in latency while we tune parameters, you should also focus on two other metrics: average CPU utilization and average memory utilization. When you’re adjusting the number of SAGEMAKER_GUNICORN_WORKERS and SAGEMAKER_TFS_INSTANCE_COUNT, your aim should be to drive both CPU and memory to the maximum and treat that as a soft threshold to understand the high watermark for the throughput of this particular endpoint. The hard threshold is the latency that you can tolerate.

The following graph tracks an increase in ModelLatency with respect to increased load.

6 1766

The following graph tracks an increase in CPUUtilization with respect to increased load.

7 1766

The following graph tracks in increase in MemoryUtilization with respect to increased load.

8 1766

Other optimizations to consider

You should consider a few other optimizations to further maximize the performance of your endpoint:

  • To further enhance performance, optimize the model graph by compiling, pruning, fusing, and so on.
  • You can also export models to an intermediate representation such as ONNX and use ONNX runtime for inference.
  • Inputs can be batched, serialized, compressed, and passed over the wire in binary format to save bandwidth and maximize utilization.
  • You can compile the TensorFlow Model Server binary to use hardware-specific optimizations (such as Intel optimizations like AVX-512 and MKL) or model optimizations such as compilation provided by SageMaker Neo. You can also use an optimized inference chip such as AWS Inferentia to further improve performance.
  • In SageMaker, you can gain an additional performance boost by deploying models with automatic scaling.


In this post, we explored a few parameters that you can use to maximize the performance of a TensorFlow-based SageMaker real-time endpoint. These parameters are in essence overprovisioning serving processes and adjusting their parallel processing capabilities. As we saw in the tables, this overprovisioning and adjustment leads to better utilization of resources and higher throughput, sometimes an increase as much as 1,000%.

Although the best way to derive the correct values is through experimentation, by observing the combinations of different parameters and its effect on performance across ML models and SageMaker ML instances, you can start to build empirical knowledge on performance tuning and optimization.

SageMaker provides the tools to remove the undifferentiated heavy lifting from each stage of the ML lifecycle, thereby facilitating rapid experimentation and exploration needed to fully optimize your model deployments.

For more information, see Maximize TensorFlow* Performance on CPU: Considerations and Recommendations for Inference Workloads, Meaning of inter_op_parallelism_threads and intra_op_parallelism_threads, and the model SageMaker inference API.

About the Authors

Chaitanya Hazarey Chaitanya Hazarey is a Senior ML Architect with the Amazon SageMaker team. He focuses on helping customers design, deploy, and scale end-to-end ML pipelines in production on AWS. He is also passionate about improving explainability, interpretability, and accessibility of AI solutions.



Karan KothariKaran Kothari is a software engineer at Amazon Web Services. He is on the Elastic Inference team working on building Model Server focused towards low latency inference workloads.




Liang MaLiang Ma is a software engineer at Amazon Web Services and is fascinated with enabling customers on their AI/ML journey in the cloud to become AWSome. He is also passionate about serverless architectures, data visualization, and data systems.



Santosh BhavaniSantosh Bhavani is a Senior Technical Product Manager with the Amazon SageMaker Elastic Inference team. He focuses on helping SageMaker customers accelerate model inference and deployment. In his spare time, he enjoys traveling, playing tennis, and drinking lots of Pu’er tea.



Aaron Keller 1Aaron Keller is a Senior Software Engineer at Amazon Web Services. He works on the real-time inference platform for Amazon SageMaker. In his spare time, he enjoys video games and amateur astrophotography.