Deploy a model with Kubernetes and Helm
Scalability is an essential part of deploying a model. You need to make sure that your application has the resources it needs to meet the demands of incoming inferencing requests.
This is where MAX comes in. MAX includes a state-of-the-art graph compiler and runtime library that executes models from PyTorch and with incredible inference speed on a wide range of hardware.
In this tutorial, you’ll deploy a model using AWS Elastic Kubernetes Service, a managed Kubernetes service provided by Amazon Web Services (AWS). You'll build this deployment using a Helm, a package manager for Kubernetes. At the end of the tutorial, you'll have created a complete deployment stack that combines MAX Engine with AWS Elastic Kubernetes Service.
Previous experience with Kubernetes and Helm are not required; we've created a template specifically for this tutorial. We'll guide you through each step!
About Kubernetes
Kubernetes is an open-source container orchestration platform designed to automate the deployment, scaling, and management of containerized applications. Kubernetes allows you to efficiently manage clusters of containers, ensuring high availability and fault tolerance. Kubernetes provides features such as load balancing, service discovery, automated rollouts and rollbacks, and secret and configuration management, making it a powerful tool for maintaining robust and scalable microservices architectures.
About Helm
Helm is a package manager for Kubernetes, which simplifies the deployment and management of applications on Kubernetes clusters. Often referred to as the "Kubernetes package manager," Helm allows users to define, install, and upgrade even the most complex Kubernetes applications. It uses a packaging format called charts, which are collections of files that describe a related set of Kubernetes resources. Helm helps manage Kubernetes applications by streamlining the configuration process, enabling version control, and making it easier to share and reuse Kubernetes applications across different environments.
Prerequisites
To complete this tutorial, make sure you have the following utilities installed.
Utility | Description | Homebrew Command | Link |
---|---|---|---|
kubectl | Kubernetes command-line tool used for interacting with Kubernetes clusters. | brew kubetcl | |
awscli | Command-line interface for Amazon Web Services (AWS), enabling users to manage various AWS services. | brew awscli | https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html |
eksctl | Command-line utility for managing Amazon Elastic Kubernetes Service (EKS) clusters. | brew eksctl | |
helm | Package manager for Kubernetes, facilitating the deployment and management of applications on Kubernetes clusters through charts. | brew helm |
Get started
Your first step in deploying a model is to define your deployment environment. For this tutorial, this environment includes:
- the name of your AWS region
- the name of your Kubernetes cluster
- the namespace of your Kubernetes cluster
- the name of the service account that your deployment uses to manage resources
Let's make things easier for ourselves and create the following environment variables.
AWS_REGION=us-east-1
CLUSTER_NAME=max-deploy-demo
NAMESPACE_NAME=max-deploy-demo
SERVICE_ACCOUNT_NAME=max-deploy-demo-sa
AWS_REGION=us-east-1
CLUSTER_NAME=max-deploy-demo
NAMESPACE_NAME=max-deploy-demo
SERVICE_ACCOUNT_NAME=max-deploy-demo-sa
Your next task is to sign in to AWS using the aws cli
tool. We're using this
tool because, as this is a tutorial, we aren't exposing any endpoints to the
internet.
To sign in to AWS, use the following command:
aws sso login
aws sso login
Configure the Kubernetes cluster
Now you're ready to create an AWS Elastic Kubernetes (EKS) cluster. This resource is a Kubernetes cluster that dynamically scales as workloads and other demands require.
eksctl create cluster \
--name $CLUSTER_NAME \
--region $AWS_REGION \
--node-type c5.4xlarge \
--nodes 1
eksctl create cluster \
--name $CLUSTER_NAME \
--region $AWS_REGION \
--node-type c5.4xlarge \
--nodes 1
To deploy your cluster, you need to associate the OpenID Connect (OIDC) provider for the EKS cluster with AWS Identity Access Management. This step handles the authentication needed so the pods in your EKS cluster can assume IAM roles and access AWS APIs.
eksctl utils associate-iam-oidc-provider \
--region $AWS_REGION \
--cluster $CLUSTER_NAME \
--approve
eksctl utils associate-iam-oidc-provider \
--region $AWS_REGION \
--cluster $CLUSTER_NAME \
--approve
Next, define an EKS namespace to contain your EKS cluster. This namespace allows you to better organize the resources your cluster contains.
kubectl create namespace $NAMESPACE_NAME
kubectl create namespace $NAMESPACE_NAME
Last, let's create an AWS IAM role and associate it with your Kubernetes service account. With this IAM service account, your Kubernetes pods gain read-only access to Amazon S3.
eksctl create iamserviceaccount \
--name $SERVICE_ACCOUNT_NAME \
--namespace $NAMESPACE_NAME \
--cluster $CLUSTER_NAME \
--region $AWS_REGION \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
--approve \
--override-existing-serviceaccounts
eksctl create iamserviceaccount \
--name $SERVICE_ACCOUNT_NAME \
--namespace $NAMESPACE_NAME \
--cluster $CLUSTER_NAME \
--region $AWS_REGION \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
--approve \
--override-existing-serviceaccounts
Deploy the model using a Helm chart
At this point, you can now deploy your model! You'll use Helm to install a pre-built Kubernetes chart.
helm install max-deploy oci://public.ecr.aws/modular/max-serving-chart \
--version 24.4.0 \
--namespace $NAMESPACE_NAME \
--set serviceAccountName=$SERVICE_ACCOUNT_NAME \
--set image.modelRepositoryPath=s3://max-serving-models-$AWS_REGION-public/kubernetes/bert/model-repository \
--wait \
--timeout 15m
helm install max-deploy oci://public.ecr.aws/modular/max-serving-chart \
--version 24.4.0 \
--namespace $NAMESPACE_NAME \
--set serviceAccountName=$SERVICE_ACCOUNT_NAME \
--set image.modelRepositoryPath=s3://max-serving-models-$AWS_REGION-public/kubernetes/bert/model-repository \
--wait \
--timeout 15m
This command takes between 5 and 10 minutes to complete. When the deployment finishes, you should see output similar to the following.
NAME: max-deploy
LAST DEPLOYED: Tue Apr 16 15:51:24 2024
NAMESPACE: max-deploy-demo
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
export POD_NAME=$(kubectl get pods --namespace $NAMESPACE_NAME -l "app.kubernetes.io/name=max-serving-chart,app.kubernetes.io/instance=max-deploy" -o jsonpath="{.items[0].metadata.name}")
export CONTAINER_PORT=$(kubectl get pod --namespace $NAMESPACE_NAME $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
echo "The application is available at the following DNS name from within your cluster:"
echo "max-deploy.max-deploy-demo.svc.cluster.local:$CONTAINER_PORT"
echo "Or use the following command to forward ports and visit it locally at http://127.0.0.1:8000"
echo "kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace max-deploy-demo"
NAME: max-deploy
LAST DEPLOYED: Tue Apr 16 15:51:24 2024
NAMESPACE: max-deploy-demo
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
export POD_NAME=$(kubectl get pods --namespace $NAMESPACE_NAME -l "app.kubernetes.io/name=max-serving-chart,app.kubernetes.io/instance=max-deploy" -o jsonpath="{.items[0].metadata.name}")
export CONTAINER_PORT=$(kubectl get pod --namespace $NAMESPACE_NAME $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
echo "The application is available at the following DNS name from within your cluster:"
echo "max-deploy.max-deploy-demo.svc.cluster.local:$CONTAINER_PORT"
echo "Or use the following command to forward ports and visit it locally at http://127.0.0.1:8000"
echo "kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace max-deploy-demo"
To access your deployment, set the following environment variables:
export POD_NAME=$(kubectl get pods --namespace $NAMESPACE_NAME -l "app.kubernetes.io/name=max-serving-chart,app.kubernetes.io/instance=max-deploy" -o jsonpath="{.items[0].metadata.name}")
export POD_NAME=$(kubectl get pods --namespace $NAMESPACE_NAME -l "app.kubernetes.io/name=max-serving-chart,app.kubernetes.io/instance=max-deploy" -o jsonpath="{.items[0].metadata.name}")
export CONTAINER_PORT=$(kubectl get pod --namespace $NAMESPACE_NAME $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
export CONTAINER_PORT=$(kubectl get pod --namespace $NAMESPACE_NAME $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
Now run the following command:
kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace $NAMESPACE_NAME
kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace $NAMESPACE_NAME
The following message appears on your terminal.
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
This command uses port forwarding so you can access your cluster from your local machine.
Test your deployment
You are now ready to test your deployment. This tutorial uses NVIDIA’s Triton client to send text to a Bert model.
-
Open a new terminal window.
-
Install the required dependencies for the test script.
python3 -m venv venv && source venv/bin/activate
python3 -m venv venv && source venv/bin/activate
pip install transformers tritonclient[http]
pip install transformers tritonclient[http]
-
Create the following python script,
client.py
.# suppress extraneous logging
import os
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import numpy as np
import tritonclient.http as httpclient
from transformers import AutoTokenizer
text = "Paris is the [MASK] of France."
# Create a triton client
triton_client = httpclient.InferenceServerClient(url="127.0.0.1:8000")
# Preprocess input statement
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(
text,
return_tensors="np",
return_token_type_ids=True,
padding="max_length",
truncation=True,
max_length=128,
)
# Set the input data
triton_inputs = [
httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT32"),
httpclient.InferInput("attention_mask", inputs["attention_mask"].shape, "INT32"),
httpclient.InferInput("token_type_ids", inputs["token_type_ids"].shape, "INT32"),
]
triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int32))
triton_inputs[1].set_data_from_numpy(inputs["attention_mask"].astype(np.int32))
triton_inputs[2].set_data_from_numpy(inputs["token_type_ids"].astype(np.int32))
# Executing
output = triton_client.infer("bert-base-uncased", triton_inputs)
# Executing
output = triton_client.infer("bert-base-uncased", triton_inputs)
# Post-processing
masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
logits = output.as_numpy("result0")[0, masked_index, :]
predicted_token_ids = logits.argmax(axis=-1)
predicted_text = tokenizer.decode(predicted_token_ids)
output_text = text.replace("[MASK]", predicted_text)
print(output_text)# suppress extraneous logging
import os
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import numpy as np
import tritonclient.http as httpclient
from transformers import AutoTokenizer
text = "Paris is the [MASK] of France."
# Create a triton client
triton_client = httpclient.InferenceServerClient(url="127.0.0.1:8000")
# Preprocess input statement
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(
text,
return_tensors="np",
return_token_type_ids=True,
padding="max_length",
truncation=True,
max_length=128,
)
# Set the input data
triton_inputs = [
httpclient.InferInput("input_ids", inputs["input_ids"].shape, "INT32"),
httpclient.InferInput("attention_mask", inputs["attention_mask"].shape, "INT32"),
httpclient.InferInput("token_type_ids", inputs["token_type_ids"].shape, "INT32"),
]
triton_inputs[0].set_data_from_numpy(inputs["input_ids"].astype(np.int32))
triton_inputs[1].set_data_from_numpy(inputs["attention_mask"].astype(np.int32))
triton_inputs[2].set_data_from_numpy(inputs["token_type_ids"].astype(np.int32))
# Executing
output = triton_client.infer("bert-base-uncased", triton_inputs)
# Executing
output = triton_client.infer("bert-base-uncased", triton_inputs)
# Post-processing
masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[1]
logits = output.as_numpy("result0")[0, masked_index, :]
predicted_token_ids = logits.argmax(axis=-1)
predicted_text = tokenizer.decode(predicted_token_ids)
output_text = text.replace("[MASK]", predicted_text)
print(output_text) -
Run the example script to see its output.
python client.py
python client.py
The script sends the text Paris is the [MASK] of France
. The output of the
script reads:
Paris is the capital of France.
Paris is the capital of France.
Clean up
We've now wrapped up the tasks we wanted to accomplish in this tutorial! To avoid incurring additional costs for AWS resources, we recommend you delete the resources you’ve built.
To delete tutorial resources:
-
Uninstall MAX serve.
helm uninstall max-deploy --namespace $NAMESPACE_NAME
helm uninstall max-deploy --namespace $NAMESPACE_NAME
-
Delete the Kubernetes namespace.
kubectl delete namespace $NAMESPACE_NAME
kubectl delete namespace $NAMESPACE_NAME
-
Delete the service account.
eksctl delete iamserviceaccount \
--name $SERVICE_ACCOUNT_NAME \
--namespace $NAMESPACE_NAME \
--cluster $CLUSTER_NAME \
--region $AWS_REGIONeksctl delete iamserviceaccount \
--name $SERVICE_ACCOUNT_NAME \
--namespace $NAMESPACE_NAME \
--cluster $CLUSTER_NAME \
--region $AWS_REGION -
Delete the Kubernetes cluster.
eksctl delete cluster \
--name $CLUSTER_NAME \
--region $AWS_REGIONeksctl delete cluster \
--name $CLUSTER_NAME \
--region $AWS_REGION
Next steps
In this tutorial, you've leveraged a Helm chart to deploy MAX Engine to an AWS Elastic Kubernetes Cluster. This deployment used MAX engine to handle inference requests for a BERT model. The deployment took a text input, analyzed the input, and returned what the model predicted the sentiment for that input.
We encourage you to use what you learned here to deploy other models, and extend this tutorial as needed to explore other MAX features.
Here are some other topics to explore next:
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!