Running machine learning (ML) inference on large datasets is a challenge faced by many companies. There are several approaches and architecture patterns to help you tackle this problem. But no single solution may deliver the desired results for efficiency and cost effectiveness. In this blog post, we will outline a few factors that can help you arrive at the most optimal approach for your business. We will illustrate a use case and architecture pattern with Amazon SageMaker to perform batch inference at scale.

ML inference can be done in real time on individual records, such as with a REST API endpoint. Inference can also be done in batch mode as a processing job on a large dataset. While both approaches push data through a model, each has its own target goal when running inference at scale.

With real-time inference, the goal is usually to optimize the number of transactions per second that the model can process. With batch inference, the goal is usually tied to time constraints and the service-level agreement (SLA) for the job. Table 1 shows the key attributes of real-time, micro-batch, and batch inference scenarios.

Real TimeMicro BatchBatch
Execution Mode
Prediction Latency
SubsecondSeconds to minutesIndefinite
Data BoundsUnbounded/streamBoundedBounded
Execution Frequency
Invocation Mode
Continuous stream/API callsEvent-basedEvent-based/scheduled
ExamplesReal-time REST API endpointData analyst running a SQL UDFScheduled inference job

Table 1. Key characteristics of real-time, micro-batch, and batch inference scenarios

Key considerations for batch inference jobs

Batch inference tasks are usually good candidates for horizontal scaling. Each worker within a cluster can operate on a different subset of data without the need to exchange information with other workers. AWS offers multiple storage and compute options that enable horizontal scaling. Table 2 shows some key considerations when architecting for batch inference jobs.

  • Model type and ML framework. Models built with frameworks such as XGBoost and SKLearn require smaller compute instances. Those built with deep learning frameworks, such as TensorFlow and PyTorch require larger ones.
  • Complexity of the model. Simple models can run on CPU instances while more complex ensemble models and large-scale deep learning models can benefit from GPU instances.
  • Size of the inference data. While all approaches work on small datasets, larger datasets come with a unique set of challenges. The storage system must provide sufficient throughput and I/O to reliably run the inference workload.
  • Inference frequency and job concurrency. The volume of jobs within a fixed interval of time is an important consideration to address Service Quotas. The frequency and SLA requirements also proportionally impact the number of concurrent jobs. This might create additional pressure on the underlying Service Quotas.
ML FrameworkModel Complexity
Inference Data Size
Inference Frequency
Job Concurrency
  • Traditional
    • XGBoost
    • SKLearn
  • Deep Learning
    • Tensorflow
    • PyTorch
  • Low (linear models)
  • Medium (complex ensemble models)
  • High (large scale DL models)
  • Small (<1 GB)
  • Medium (<100 GB)
  • Large (<1 TB)
  • Hyperscale (>1 TB)
  • Hourly
  • Daily
  • Weekly
  • Monthly
  • 1
  • <10
  • <100
  • >100

Table 2. Key considerations when architecting for batch inference jobs

Real world Batch Inference use case and architecture

Often customers in certain domains such as advertising and marketing or healthcare must make predictions on hyperscale datasets. This requires deploying an inference pipeline that can complete several thousand inference jobs on extremely large datasets. The individual models used are typically of low complexity from a compute perspective. They could include a combination of various algorithms implemented in scikit-learn, XGBoost, and TensorFlow, for example. Most of the complexity in these use cases stems from large volumes of data and the number of concurrent jobs that must run to meet the service level agreement (SLA).

The batch inference architecture for these requirements typically is composed of three layers:

  • Orchestration layer. Manages the submission, scheduling, tracking, and error handling of individual jobs or multi-step pipelines
  • Storage layer. Stores the data that will be inferenced upon
  • Compute layer. Runs the inference job

There are several AWS services available that can be used for each of these architectural layers. The architecture in Figure 1 illustrates a real world implementation. Amazon SageMaker Processing and training services are used for compute layer and Amazon S3 for the storage layer. Amazon Managed Workflows for Apache Airflow (MWAA) and Amazon DynamoDB are used for the orchestration and job control layer.

Figure 1. Architecture for batch inference at scale with Amazon SageMaker

Figure 1. Architecture for batch inference at scale with Amazon SageMaker

Orchestration and job control layer. Apache Airflow is used to orchestrate the training and inference pipelines with job metadata captured into DynamoDB. At each step of the pipeline, Airflow updates the status of each model run. A custom Airflow sensor polls the status of each pipeline. It advances the pipeline with the successful completion of each step, or resubmits a job in case of failure.

Compute layer. SageMaker processing is used as the compute option for running the inference workload. SageMaker has a purpose-built batch transform feature for running batch inference jobs. However, this feature often requires additional pre and post-processing steps to get the data into the appropriate input and output format. SageMaker Processing offers a general purpose managed compute environment to run a custom batch inference container with a custom script. In the architecture, the processing script takes the input location of the model artifact generated by a SageMaker training job and the location of the inference data, and performs pre and post-processing along with model inference.

Storage layer. Amazon S3 is used to store the large input dataset and the output inference data. The ShardedByS3Key data distribution strategy distributes the files across multiple nodes within a processing cluster. With this option enabled, SageMaker Processing will automatically copy a different subset of input files into each node of the processing job. This way you can horizontally scale batch inference jobs by requesting a higher number of instances when configuring the job.

One caveat of this approach is that while many ML algorithms utilize multiple CPU cores during training, only one core is utilized during inference. This can be rectified by using Python’s native concurrency and parallelism frameworks such concurrent.futures. The following pseudo-code illustrates how you can distribute the inference workload across all instance cores. This assumes the SageMaker Processing job has been configured to copy the input files into the /opt/ml/processing/input directory.

from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import cpu_count
import os
from glob import glob
import pandas as pd

def inference_fn(model_dir, file_path, output_dir):

model = joblib.load(f"{model_dir}/model.joblib")
data = pd.read_parquet(file_path)
data["prediction"] = model.predict(data)

output_path = f"{output_dir}/{os.path.basename(file_path)}"


return output_path

input_files = glob("/opt/ml/processing/input/*")
model_dir = "/opt/ml/model"
output_dir = "/opt/ml/output"

with ProcessPoolExecutor(max_workers=cpu_count()) as executor:
futures = [executor.submit(inference_fn, model_dir, file_path, output_dir) for file in input_files]

results =[]
for future in as_completed(futures):


In this blog post, we described ML inference options and use cases. We primarily focused on batch inference and reviewed key challenges faced when performing batch inference at scale. We provided a mental model of some key considerations and best practices to consider as you make various architecture decisions. We illustrated these considerations with a real world use case and an architecture pattern to perform batch inference at scale. This pattern can be extended to other choices of compute, storage, and orchestration services on AWS to build large-scale ML inference solutions.

More information: