Scalable Serving with Kubernetes and Seldon Core: A tutorial

12 minute read

Overview

In most ML applications, deploying trained models to production is a crucial stage. It’s where the models demonstrate their values by giving predictions for customers or other systems.

Deploying a model can be as simple as implementing a Flask server and then exporting its endpoints for users to call. Yet, It’s not easy to build a system that can robustly and reliably serve a large number of requests with strict response time or throughput requirements.

For medium and large businesses, the systems must be able to scale to process heavier workloads without significantly changing the codebase. Perhaps the corporation is expanding and needs a scalable system to handle the growing number of requests (this characteristic is scalability). The business needs the system to be able to adapt to traffic fluctuations (this characteristic is elasticity). These characteristics can be achieved if the systems are capable of autoscaling based on traffic volume.

In this tutorial, we’re going to learn how to deploy ML models in Kubernetes clusters with Seldon Core. We’ll also learn to implement autoscaling for our deployment with HPA and KEDA. The code for this tutorial can be found in this repo.

Train a PyTorch model

To go through the deployment process, we’ll need a model. We use the model from this tutorial from the official PyTorch website. It’s a simple image classification model that can run with CPU easily, so we can test the whole deployment process on local machines like your laptop.

Assume you’re in the toy-model folder of this repo. You can train the model on the CIFAR10 dataset with:

python train.py
# Output: model.pth -> the trained weights of the model

Seldon Core uses Triton Inference Server to serve PyTorch models, so we need to prepare the model in a format that it can be served with Triton. First, we need to export the model to TorchScript (it’s also possible to serve PyTorch models with Triton’s python backend, but it’s generally less efficient and more complicated to deploy).

Tracing and scripting are the two methods for exporting a model to TorchScript.The choice between them is still debatable, this article explores the benefits and drawbacks of both methods. We’ll use the tracing method to export the model:

python export_to_ts.py -c model.pth -o model.ts
# Output: model.ts 
# -> serialized model containing both trained weights and the model's architecture

Triton loads models from a model repository. It must contain information that the server needs to serve a model such as the model’s input/output information, backend to use… A model repository must follow the following structure:

<model-repository-path>/
    <model-name>/
      [config.pbtxt]
      [<output-labels-file> ...]
      <version>/
        <model-definition-file>
      <version>/
        <model-definition-file>
      ...
    <model-name>/
      [config.pbtxt]
      [<output-labels-file> ...]
      <version>/
        <model-definition-file>
      <version>/
        <model-definition-file>
      ...
    ...

In our case, we only have one model. Let’s call the model cifar10-pytorch, our model repo should have the following structure:

cifar10-model
└── cifar10-pytorch
    ├── 1
    │   └── model.ts
    └── config.pbtxt

cifar10-model is the repository’s name, cifar10-pytorch is the model name and model.ts is the TorchScript model we just exported. config.pdtxt defines how the model should be served with Triton:

platform: "pytorch_libtorch"
default_model_filename: "model.ts"
max_batch_size: 0
input [
  {
    name: "image__0"
    data_type: TYPE_UINT8
    dims: [-1, 3, -1, -1]
  }
]
output [
  {
    name: "probs__0"
    data_type: TYPE_FP32
    dims: [-1, 10]
  }
]

You can find the final repository here. Triton supports several features that may be used to tune the model’s performance. You can also group multiple steps or multiple models into an inference pipeline to implement your business logic. However, I deliberately keep the model config simple to illustrate the entire process of model deployment instead of focusing on performance.

If you want to see a more realistic example of how to export and serve a PyTorch model with Triton, have a look at this post. It demonstrates how to use Triton to serve the MaskRCNN model from Detectron2, a popular model for instance segmentation and used in many real-world computer vision systems.

Triton can access models from local filesystem or cloud storage services (e.g. S3, Google Storage, or Azure Storage). As we’re going to deploy the model in Kubernetes, using a cloud storage service is more convenient because all nodes in the Kubernetes cluster can access the same models. We’ll use AWS S3 as the model repository in this tutorial. Assume that you already have an AWS account, let’s create an S3 bucket and upload the folder we have prepared:

aws s3 cp --recursive cifar10-model s3://<YOUR_BUCKET>/cifar10-model

Replace <YOUR_BUCKET> to the name of your bucket. We now have the model repository on AWS S3, we can start to deploy the model.

Deploy models with Seldon Core

We’ll deploy the model to a Kubernetes cluster with Seldon Core, a framework specializing in ML model deployment and monitoring. Let’s create a local Kubernetes cluster, so we can test the deployment process using our local machine.

Kind can be used to create local clusters. At the time of writing, Seldon Core has an issue with k8s ≥ 1.25, so we have to use version 1.24 or older. To specify the k8s version with Kind, just choose the image with the corresponding version to start a cluster. The following command creates a local cluster named kind-seldon with k8s==1.24.7:

kind create cluster --name seldon --image kindest/node:v1.24.7

Also make sure you have docker, kubectl, helm installed on your local machine. Switching the context of kubectl to kind-seldon instructs kubectl to connect to the newly created cluster by default:

kubectl cluster-info --context kind-seldon

Install Seldon Core

We’ll use Istio as the cluster’s Ingress and Seldon Core as the serving platform. You can find the installation instruction here. After installing Istio and Seldon Core, run these commands to check if they are all correctly installed:

kubectl get svc -n istio-system
# NAME                   TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)                                                                      AGE
# istio-egressgateway    ClusterIP      10.96.90.103   <none>        80/TCP,443/TCP                                                               3m29s
# istio-ingressgateway   LoadBalancer   10.96.229.8    <pending>     15021:30181/TCP,80:32431/TCP,443:30839/TCP,31400:32513/TCP,15443:32218/TCP   3m28s
# istiod                 ClusterIP      10.96.195.7    <none>        15010/TCP,15012/TCP,443/TCP,15014/TCP                                        8m48s

Check if the Istio gateway is running:

kubectl get gateway -n istio-system
# NAME             AGE
# seldon-gateway   5m17s

Check if the Seldon controller is running:

kubectl get pods -n seldon-system
# NAME                                        READY   STATUS    RESTARTS   AGE
# seldon-controller-manager-b74d66684-qndf6   1/1     Running   0          4m18Create an Istio gateway to manage the cluster’s traffic:

If you haven’t done it, make sure the label istio-injection is enabled:

kubectl label namespace default istio-injection=enabled

The Istio gateway is running on port 80 in the cluster, we need to forward a port from your local machine to that port so we can access it externally:

kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80

Serving with Seldon Core

If your model repository is stored in a private bucket, you need to grant permission to access your bucket from within the cluster. It can be done by creating a secret and then referring to it when creating a deployment. This is a template to create secrets for S3 buckets:

apiVersion: v1
kind: Secret
metadata:
  name: seldon-rclone-secret
type: Opaque
stringData:
  RCLONE_CONFIG_S3_TYPE: s3
  RCLONE_CONFIG_S3_PROVIDER: aws
  RCLONE_CONFIG_S3_ENV_AUTH: "false"
  RCLONE_CONFIG_S3_ACCESS_KEY_ID: "<AWS_ACCESS_KEY_ID>"
  RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: "<AWS_SECRET_ACCESS_KEY>"

Replace AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY with your actual AWS access key ID and secret access key. Create the secret:

kubectl apply -f secret.yaml

We can deploy the model with this manifest, notice the created secret is referred to in in the manifest with the envSecretRefName key. Make sure that spec.predictors[].graph.name matches the model name you uploaded to your model repository. Apply the manifest to create a deployment:

kubectl apply -f cifar10-deploy.yaml

If this is your first deployment in this cluster, it’ll take a while to download the necessary docker images. Check if the model is successfully deployed:

kubectl get deploy
# NAME                                READY   UP-TO-DATE   AVAILABLE   AGE
# cifar10-default-0-cifar10-pytorch   1/1     1            1           41m

I created a script using Locust to test the deployed models. You need to install the requirements needed for the script to run first:

pip install -r requirements.txt

Given that localhost:8080 has been port-forwarded to the cluster’s gateway, run the following command to send requests to models deployed with Seldon:

locust -f test.py --headless -u 100 -r 10 --run-time 180 -H http://localhost:8080

If your deployment name or model names are different, you can adjust the deployment’s URL accordingly in the script. The URL for a deployed model follows Seldon Core’s inference protocol:

/seldon/{namespace}/{model_repo}/v2/models/{model_name}/versions/{model_version}/infer

We’ve deployed our custom model with Seldon Core and tested it by sending inference requests to the model. In the next section, we’ll explore how to scale the deployment to handle more users or higher traffic.

Pod Autoscaling with HPA

When it comes to scalability, Kubernetes offers HPA (Horizontal Pod Autoscaling). When certain metrics reach their thresholds for a resource (e.g. CPU or memory), HPA can add more pods to process heavier workloads.

Install Metrics Server

HPA needs to fetch metrics from an aggregated API, which is usually provided through a Metrics Server. You can install a metrics server for your cluster with:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

If your cluster is local, you also need to disable certificate validation by passing the argument -kubelet-insecure-tls to the server:

kubectl patch -n kube-system deployment metrics-server --type=json \
  -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'

Deploy models with HPA

We can enable HPA in the deployment manifest by adding hpaSpec for the corresponding component:

hpaSpec:
  maxReplicas: 2
  metrics:
  - resource:
      name: cpu  # This can be either "cpu" or "memory"
      targetAverageUtilization: 50
    type: Resource
  minReplicas: 1

The HPA spec tells the deployment to scale up when the current metric value (CPU usage in this case) is higher than 50% of the desired value, and the maximum replicas that the deployment can possibly have are 2.

Apply this manifest to create a deployment with HPA, make sure you replace <YOUR_BUCKET> with your bucket name and the secret for accessing the bucket (as mentioned in the previous section) has been created:

kubectl apply -f cifar10-deploy-hpa.yaml

You can see the current metric value with:

kubectl get hpa
# NAME                                REFERENCE                                      TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
# cifar10-default-0-cifar10-pytorch   Deployment/cifar10-default-0-cifar10-pytorch   0%/50%    1         2         1          98m

Let’s check the running pods. You should see a running pod for the deployed model:

kubectl get pods
# NAME                                                 READY   STATUS    RESTARTS   AGE
# cifar10-default-0-cifar10-pytorch-7744fdc4dd-vzx4x   3/3     Running   0          101m

Now we can test the deployed model with our test script (that was mentioned in the previous section):

locust -f test.py --headless -u 100 -r 10 --run-time 180 -H http://localhost:8080

Monitoring the current metric value with kubectl get hpa -w, you can see after a while the metric value exceeds the threshold, and HPA will trigger the creation of a new pod:

kubectl get pods
# NAME                                                 READY   STATUS    RESTARTS   AGE
# cifar10-default-0-cifar10-pytorch-7744fdc4dd-pgpxm   3/3     Running   0          10s
# cifar10-default-0-cifar10-pytorch-7744fdc4dd-vzx4x   3/3     Running   0          108m

If the current metric value is lower than the threshold for a certain period (it’s 5 minutes by default), HPA will scale down the deployment. The period can be configured with the argument --horizontal-pod-autoscaler-downscale-stabilization flag to kube-controller-manager:

kubectl get pods
# NAME                                                 READY   STATUS        RESTARTS   AGE
# cifar10-default-0-cifar10-pytorch-7744fdc4dd-pgpxm   3/3     Terminating   0          7m3s
# cifar10-default-0-cifar10-pytorch-7744fdc4dd-vzx4x   3/3     Running       0          114m

In this section, we’ve learned how to scale the number of pods up and down based on CPU usage. In the next section, we’ll use KEDA to scale our deployment more flexibly based on custom metrics.

Pod Autoscaling with KEDA

KEDA can fetch metrics from many sources (they are called scalers), see the list of supported scalers here. We’ll set up KEDA to fetch metrics from a Prometheus server, and monitor the metrics to trigger pod scaling. The Prometheus server collects metrics from Seldon deployments in the cluster.

Install Seldon Monitoring and KEDA

Follow this instruction to install Seldon’s stack for monitoring, which includes a Prometheus server. The following pods should now be present in the seldon-monitoring namespace:

kubectl get pods -n seldon-monitoring
# NAME                                                    READY   STATUS    RESTARTS     AGE
# alertmanager-seldon-monitoring-alertmanager-0           2/2     Running   3 (8h ago)   26h
# prometheus-seldon-monitoring-prometheus-0               2/2     Running   2 (8h ago)   26h
# seldon-monitoring-blackbox-exporter-dbbcd845d-qszj8     1/1     Running   1 (8h ago)   26h
# seldon-monitoring-kube-state-metrics-7588b77796-nrd9g   1/1     Running   1 (8h ago)   26h
# seldon-monitoring-node-exporter-fmlh6                   1/1     Running   1 (8h ago)   26h
# seldon-monitoring-operator-6dc8898f89-fkwx8             1/1     Running   1 (8h ago)   26h

Also check if the pod monitor for Seldon Core has been created:

kubectl get PodMonitor -n seldon-monitoring
# NAME                AGE
# seldon-podmonitor   26h

Run the following command to enable KEDA in Seldon Core:

helm upgrade seldon-core seldon-core-operator \
    --repo https://storage.googleapis.com/seldon-charts \
    --set keda.enabled=true \
    --set usageMetrics.enabled=true \
    --set istio.enabled=true \
    --namespace seldon-system

Install KEDA to the cluster, and make sure that the previously installed KEDA (if any) is completely uninstalled:

kubectl delete -f https://github.com/kedacore/keda/releases/download/v2.9.1/keda-2.9.1.yaml
kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.9.1/keda-2.9.1.yaml

Deploy models with KEDA

We’ve had everything set up. Let’s create a Seldon deployment with KEDA. Similar to HPA, to enable KEDA in a deployment, we only need to include kedaSpec in the deployment’s manifest. Consider the following spec:

kedaSpec:
  pollingInterval: 15
  minReplicaCount: 1
  maxReplicaCount: 2
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://seldon-monitoring-prometheus.seldon-monitoring.svc.cluster.local:9090
      metricName: access_frequency
      threshold: '20'
      query: avg(rate(seldon_api_executor_client_requests_seconds_count{deployment_name=~"cifar10"}[1m]))

serverAddress is the address of the Prometheus server within the cluster, it should be the URL of the Prometheus service, we can check the service with kubectl get svc -n seldon-monitoring. When the metric value surpasses threshold, the scaling will be triggered. The query is the average number of requests per second across running replicas, which is the metric we want to monitor.

Apply this manifest to deploy the model:

kubectl apply -f cifar10-deploy-keda.yaml

Let’s trigger the autoscaling by sending requests to the deployment:

locust -f test.py --headless -u 100 -r 10 --run-time 180 -H http://localhost:8080

After a few seconds, you can see a new pod created:

kubectl get pods
# NAME                                                 READY   STATUS     RESTARTS      AGE
# cifar10-default-0-cifar10-pytorch-5dc484599c-2zrv8   3/3     Running    3 (18m ago)   35h
# cifar10-default-0-cifar10-pytorch-5dc484599c-ljk74   0/3     Init:0/2   0             9s

Similar to HPA, downscaling will be triggered after a certain period (5 minutes by default) of low traffic.

kubectl get pods -w
# NAME                                                 READY   STATUS    RESTARTS      AGE
# cifar10-default-0-cifar10-pytorch-5dc484599c-2zrv8   3/3     Running   3 (22m ago)   35h
# cifar10-default-0-cifar10-pytorch-5dc484599c-ljk74   3/3     Running   0             3m55s
# cifar10-default-0-cifar10-pytorch-5dc484599c-ljk74   3/3     Terminating   0             5m49s
# cifar10-default-0-cifar10-pytorch-5dc484599c-ljk74   2/3     Terminating   0             6m
# cifar10-default-0-cifar10-pytorch-5dc484599c-ljk74   1/3     Terminating   0             6m1s
# cifar10-default-0-cifar10-pytorch-5dc484599c-ljk74   0/3     Terminating   0             6m3s

Conclusion

We’ve learned how to deploy machine learning models to Kubernetes clusters with Seldon Core. Although we mainly focused on deploying PyTorch models, the procedures shown in this guide may be used to deploy models from other frameworks.

We’ve also made the deployments scalable using HPA and KEDA. Compared to HPA, KEDA provides more flexible ways to scale the system based on Prometheus metrics (or other supported scalers from KEDA). Technically, we can implement any scaling rules from metrics that can be fetched from the Prometheus server.