This post is contributed by Kinnar Sen – Sr. Specialist Solutions Architect, EC2 Spot

TensorFlow (TF) is a popular choice for machine learning research and application development. It’s a machine learning (ML) platform, which is used to build (train) and deploy (serve) machine learning models. TF Serving is a part of TF framework and is used for deploying ML models in production environments. TF Serving can be containerized using Docker and deployed in a cluster with Kubernetes. It is easy to run production grade workloads on Kubernetes using Amazon Elastic Kubernetes Service (Amazon EKS), a managed service for creating and managing Kubernetes clusters. To cost optimize the TF serving workloads, you can use Amazon EC2 Spot Instances. Spot Instances are spare EC2 capacity available at up to a 90% discount compared to On-Demand Instance prices.

In this post I will illustrate deployment of TensorFlow Serving using Kubernetes via Amazon EKS and Spot Instances to build a scalable, resilient, and cost optimized machine learning inference service.


About TensorFlow Serving (TF Serving)

TensorFlow Serving is the recommended way to serve TensorFlow models. A flexible and a high-performance system for serving models TF Serving enables users to quickly deploy models to production environments. It provides out-of-box integration with TF models and can be extended to serve other kinds of models and data. TF Serving deploys a model server with gRPC/REST endpoints and can be used to serve multiple models (or versions). There are two ways that the requests can be served, batching individual requests or one-by-one. Batching is often used to unlock the high throughput of hardware accelerators (if used for inference) for offline high volume inference jobs.

Amazon EC2 Spot Instances

Spot Instances are spare Amazon EC2 capacity that enables customers to save up to 90% over On-Demand Instance prices. The price of Spot Instances is determined by long-term trends in supply and demand of spare capacity pools. Capacity pools can be defined as a group of EC2 instances belonging to particular instance family, size, and Availability Zone (AZ). If EC2 needs capacity back for On-Demand usage, Spot Instances can be interrupted by EC2 with a two-minute notification. There are many graceful ways to handle the interruption to ensure that the application is well architected for resilience and fault tolerance. This can be automated via the application and/or infrastructure deployments. Spot Instances are ideal for stateless, fault tolerant, loosely coupled and flexible workloads that can handle interruptions.

TensorFlow Serving (TF Serving) and Kubernetes

Each pod in a Kubernetes cluster runs a TF Docker image with TF Serving-based server and a model. The model contains the architecture of TensorFlow Graph, model weights and assets. This is a deployment setup with configurable number of replicas. The replicas are exposed externally by a service and an External Load Balancer that helps distribute the requests to the service endpoints. To keep up with the demands of service, Kubernetes can help scale the number of replicated pods using Kubernetes Replication Controller.


There are a couple of goals that we want to achieve through this solution.

  • Cost optimization – By using EC2 Spot Instances
  • High throughput – By using Application Load Balancer (ALB) created by Ingress Controller
  • Resilience – Ensuring high availability by replenishing nodes and gracefully handling the Spot interruptions
  • Elasticity – By using Horizontal Pod Autoscaler, Cluster Autoscaler, and EC2 Auto Scaling

This can be achieved by using the following components.

ComponentRoleDetailsDeployment Method
Cluster AutoscalerScales EC2 instances automatically according to pods running in the clusterOpen sourceA deployment on On-Demand Instances
EC2 Auto Scaling groupProvisions and maintains EC2 instance capacityAWSAWS CloudFormation via eksctl
AWS Node Termination HandlerDetects EC2 Spot interruptions and automatically drains nodesOpen sourceA DaemonSet on Spot and On-Demand Instances
AWS ALB Ingress ControllerProvisions and maintains Application Load BalancerOpen sourceA deployment on On-Demand Instances

You can find more details about each component in this AWS blog. Let’s go through the steps that allow the deployment to be elastic.

  1. HTTP requests flows in through the ALB and Ingress object.
  2. Horizontal Pod Autoscaler (HPA) monitors the metrics (CPU / RAM) and once the threshold is breached a Replica (pod) is launched.
  3. If there are sufficient cluster resources, the pod starts running, else it goes into pending state.
  4. If one or more pods are in pending state, the Cluster Autoscaler (CA) triggers a scale up request to Auto Scaling group.
    1. If HPA tries to schedule pods more than the current size of what the cluster can support, CA can add capacity to support that.
  5. Auto Scaling group provision a new node and the application scales up
  6. A scale down happens in the reverse fashion when requests start tapering down.


AWS ALB Ingress controller and ALB

We will be using an ALB along with an Ingress resource instead of the default External Load Balancer created by the TF Serving deployment. The open source AWS ALB Ingress controller triggers the creation of an ALB and the necessary supporting AWS resources whenever a Kubernetes user declares an Ingress resource in the cluster. The Ingress resource uses the ALB to route HTTP(S) traffic to different endpoints within the cluster. ALB is ideal for advanced load balancing of HTTP and HTTPS traffic. ALB provides advanced request routing targeted at delivery of modern application architectures, including microservices and container-based applications. This allows the deployment to maintain a high throughput and improve load balancing.

Spot Instance interruptions

To gracefully handle interruptions, we will use the AWS node termination handler. This handler runs a DaemonSet (one pod per node) on each host to perform monitoring and react accordingly. When it receives the Spot Instance 2-minute interruption notification, it uses the Kubernetes API to cordon the node. This is done by tainting it to ensure that no new pods are scheduled there, then it drains it, removing any existing pods from the ALB.

One of the best practices for using Spot is diversification where instances are chosen from across different instance types, sizes, and Availability Zone. The capacity-optimized allocation strategy for EC2 Auto Scaling provisions Spot Instances from the most-available Spot Instance pools by analyzing capacity metrics, thus lowering the chance of interruptions.


Set up the cluster

We are using eksctl to create an Amazon EKS cluster with the name k8-tf-serving in combination with a managed node group. The managed node group has two On-Demand t3.medium nodes and it will bootstrap with the labels lifecycle=OnDemand and intent=control-apps. Be sure to replace <YOUR REGION> with the Region you are launching your cluster into.

eksctl create cluster --name=TensorFlowServingCluster --node-private-networking --managed --nodes=3 --alb-ingress-access --region=<YOUR REGION> --node-type t3.medium --node-labels="lifecycle=OnDemand,intent=control-apps" --asg-access

Check the nodes provisioned by using kubectl get nodes.

Create the NodeGroups now. You create the eksctl configuration file first. Copy the nodegroup configuration below and create a file named spot_nodegroups.yml. Then run the command using eksctl below to add the new Spot nodes to the cluster.

kind: ClusterConfig
metadata: name: TensorFlowServingCluster region: <YOUR REGION>
nodeGroups: - name: prod-4vcpu-16gb-spot minSize: 0 maxSize: 15 desiredCapacity: 10 instancesDistribution: instanceTypes: ["m5.xlarge", "m5d.xlarge", "m4.xlarge","t3.xlarge","t3a.xlarge","m5a.xlarge","t2.xlarge"] onDemandBaseCapacity: 0 onDemandPercentageAboveBaseCapacity: 0 spotAllocationStrategy: capacity-optimized labels: lifecycle: Ec2Spot intent: apps "true" tags: Ec2Spot apps iam: withAddonPolicies: autoScaler: true albIngress: true - name: prod-8vcpu-32gb-spot minSize: 0 maxSize: 15 desiredCapacity: 10 instancesDistribution: instanceTypes: ["m5.2xlarge", "m5n.2xlarge", "m5d.2xlarge", "m5dn.2xlarge","m5a.2xlarge", "m4.2xlarge"] onDemandBaseCapacity: 0 onDemandPercentageAboveBaseCapacity: 0 spotAllocationStrategy: capacity-optimized labels: lifecycle: Ec2Spot intent: apps "true" tags: Ec2Spot apps iam: withAddonPolicies: autoScaler: true albIngress: true
eksctl create nodegroup -f spot_nodegroups.yml

A few points to note here, for more technical details refer to the EC2 Spot workshop.

  • There are two diversified node groups created with a fixed vCPU:Memory ratio. This adheres to the Spot best practice of diversifying instances, and helps the Cluster Autoscaler function properly.
  • Capacity-optimized Spot allocation strategy is used in both the node groups.

Once the nodes are created, you can check the number of instances provisioned using the command below. It should display 20 as we configured each of our two node groups with a desired capacity of 10 instances.

kubectl get nodes --selector=lifecycle=Ec2Spot | expr $(wc -l) - 1

The cluster setup is complete.

Install the AWS Node Termination Handler

kubectl apply -f

This installs the Node Termination Handler to both Spot Instance and On-Demand Instance nodes. This helps the handler responds to both EC2 maintenance events and Spot Instance interruptions.

Deploy Cluster Autoscaler

For additional detail, see the Amazon EKS page here. Next, export the Cluster Autoscaler into a configuration file:

curl -o cluster_autoscaler.yml

Open the file created and edit.

Add AWS Region and the cluster name as depicted in the screenshot below.


Run the commands below to deploy Cluster Autoscaler.

kubectl apply -f cluster_autoscaler.yml

Use this command to see into the Cluster Autoscaler (CA) logs to find NodeGroups auto-discovered. Use Ctrl + C to abort the log view.

kubectl logs -f deployment/cluster-autoscaler -n kube-system --tail=10

Deploy TensorFlow Serving

TensorFlow Model Server is deployed in pods and the model will load from the model stored in Amazon S3.

Amazon S3 access

We are using Kubernetes Secrets to store and manage the AWS Credentials for S3 Access.

Copy the following and create a file called kustomization.yml. Add the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY details in the file.

namespace: default
generatorOptions: disableNameSuffixHash: true

Create the secret file and deploy.

kubectl kustomize . > secret.yaml
kubectl apply -f secret.yaml

We recommend to use Sealed Secret for production workloads, Sealed Secret provides a mechanism to encrypt a Secret object thus making it more secure. For further details please take a look at the AWS workshop here.

ALB Ingress Controller

Deploy RBAC Roles and RoleBindings needed by the AWS ALB Ingress controller.

kubectl apply -f

Download the AWS ALB Ingress controller YAML into a local file.

curl -sS "" &gt; alb-ingress-controller.yaml

Change the –cluster-name flag to ‘TensorFlowServingCluster’ and add the Region details under – –aws-region. Also add the lines below just before the ‘serviceAccountName’.

nodeSelector: lifecycle: OnDemand


Deploy the AWS ALB Ingress controller and verify that it is running.

kubectl apply -f alb-ingress-controller.yaml
kubectl logs -n kube-system $(kubectl get po -n kube-system | grep alb-ingress | awk '{print $1}')

Deploy the application

Next, download a model as explained in the TF official documentation, then upload in Amazon S3.

mkdir /tmp/resnet curl -s | \
tar --strip-components=2 -C /tmp/resnet -xvz RANDOM_SUFFIX=$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 10 | head -n 1) S3_BUCKET="resnet-model-k8serving-${RANDOM_SUFFIX}"
aws s3 mb s3://${S3_BUCKET}
aws s3 sync /tmp/resnet/1538687457/ s3://${S3_BUCKET}/resnet/1/

Copy the following code and create a file named tf_deployment.yml. Don’t forget to replace <AWS_REGION> with the AWS Region you plan to use.

A few things to note here:

  • NodeSelector is used to route the TF Serving replica pods to Spot Instance nodes.
  • ServiceType LoadBalancer is used.
  • model_base_path is pointed at Amazon S3. Replace the <S3_BUCKET> with the S3_BUCKET name you created in last instruction set.
apiVersion: v1
kind: Service
metadata: labels: app: resnet-service name: resnet-service
spec: ports: - name: grpc port: 9000 targetPort: 9000 - name: http port: 8500 targetPort: 8500 selector: app: resnet-service type: LoadBalancer
apiVersion: apps/v1
kind: Deployment
metadata: labels: app: resnet-service name: resnet-v1
spec: replicas: 25 selector: matchLabels: app: resnet-service template: metadata: labels: app: resnet-service version: v1 spec: nodeSelector: lifecycle: Ec2Spot containers: - args: - --port=9000 - --rest_api_port=8500 - --model_name=resnet - --model_base_path=s3://<S3_BUCKET>/resnet/ command: - /usr/bin/tensorflow_model_server env: - name: AWS_REGION value: <AWS_REGION> - name: S3_ENDPOINT value: s3.<AWS_REGION> - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: s3-credentials key: AWS_ACCESS_KEY_ID - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: name: s3-credentials key: AWS_SECRET_ACCESS_KEY image: tensorflow/serving imagePullPolicy: IfNotPresent name: resnet-service ports: - containerPort: 9000 - containerPort: 8500 resources: limits: cpu: "4" memory: 4Gi requests: cpu: "2" memory: 2Gi

Deploy the application.

kubectl apply -f tf_deployment.yml

Copy the code below and create a file named ingress.yml.

apiVersion: extensions/v1beta1
kind: Ingress
metadata: name: "resnet-service" namespace: "default" annotations: alb internet-facing labels: app: resnet-service
spec: rules: - http: paths: - path: "/v1/models/resnet:predict" backend: serviceName: "resnet-service" servicePort: 8500

Deploy the ingress.

kubectl apply -f ingress.yml

Deploy the Metrics Server and Horizontal Pod Autoscaler, which scales up when CPU/Memory exceeds 50% of the allocated container resource.

kubectl apply -f
kubectl autoscale deployment resnet-v1 --cpu-percent=50 --min=20 --max=100

Load testing

Download the Python helper file written for testing the deployed application.

curl -o

Get the address of the Ingress using the command below.

kubectl get ingress resnet-service

Install a Python Virtual Env. and install the library requirements.

pip3 install virtualenv
virtualenv venv
source venv/bin/activate
pip3 install tqdm
pip3 install requests

Run the following command to warm up the cluster after replacing the Ingress address. You will be running a Python application for predicting the class of a downloaded image against the ResNet model, which is being served by the TF Serving rest API. You are running multiple parallel processes for that purpose. Here “p” is the number of processes and “r” the number of requests for each process.

python -p 100 -r 100 -u 'http://<INGRESS ADDRESS>:80/v1/models/resnet:predict'

You can use the command below to observe the scaling of the cluster.

kubectl get hpa -w

We ran the above again with 10,000 requests per process as to send 1 million requests to the application. The results are below:

The deployment was able to serve ~400 requests per second with an average latency of ~200 ms per requests.



Now that you’ve successfully deployed and ran TensorFlow Serving using Ec2 Spot it’s time to cleanup your environment. Remove the ingress, deployment, ingress-controller.

kubectl delete -f ingress.yml
kubectl delete -f tf_deployment.yml
kubectl delete -f alb-ingress-controller.yaml

Remove the model files from Amazon S3.

aws s3 rb s3://${S3_BUCKET}/ --force 

Delete the node groups and the cluster.

eksctl delete nodegroup -f spot_nodegroups.yml --approve
eksctl delete cluster --name TensorFlowServingCluster


In this blog, we demonstrated how TensorFlow Serving can be deployed onto Spot Instances based on a Kubernetes cluster, achieving both resilience and cost optimization. There are multiple optimizations that can be implemented on TensorFlow Serving that will further optimize the performance. This deployment can be extended and used for serving multiple models with different versions. We hope you consider running TensorFlow Serving using EC2 Spot Instances to cost optimize the solution.