Deploy a model with Amazon SageMaker and AWS CloudFormation

Dave Shevitz

Technical Writer

18 min read

aws

cloudformation

sagemaker

MAX runs on GPU!

MAX continues to evolve and we have new tutorials to help you experience its power and capabilities firsthand. Check out Deploy Llama3 with MAX Serve on GPU and Deploy a PyTorch model from Hugging Face. Be sure to let us know what you think!

The point of a trained model is to put it to use, to connect its inferencing power to the rest of your application and put its capabilities in the hands of your users.

To help you achieve that goal, we built MAX. MAX includes a state-of-the-art graph compiler and runtime library that executes models from PyTorch and with incredible inference speed on a wide range of hardware.

In this tutorial, you'll explore firsthand how to combine MAX Engine with AWS SageMaker. You'll use MAX Engine to handle inference requests using a previously trained BERT model, and you'll AWS SageMaker to deploy the model. You'll then test the deployment by sending an inference request to an AWS SageMaker endpoint.

About Amazon SageMaker

Amazon SageMaker is a fully managed service provided by Amazon Web Services (AWS) that enables you to build, train, and deploy machine learning models at scale. It simplifies the machine learning workflow by offering integrated tools for every step of the process, from data preparation and model building to training and deployment. SageMaker supports a variety of algorithms and frameworks, making it versatile for different use cases. Additionally, it provides features for model monitoring and automatic scaling, ensuring robust and efficient operations.

About AWS CloudFormation

AWS CloudFormation is a service from AWS that enables you to model, provision, and manage AWS and third-party resources by treating infrastructure as code. It allows you to create and update a collection of related AWS resources in a predictable and orderly fashion through templates. This approach simplifies the orchestration of complex environments, ensuring consistent configuration and deployment. CloudFormation also supports automated rollbacks and dependency management, enhancing reliability and ease of use.

Prerequisites

Before you get started with this tutorial, you should make sure you have the appropriate credentials to log into your AWS account. In addition, you need to have an Identity and Access Management (IAM) role and policy that allows you to create and deploy resources using AWS SageMaker.

Build the AWS CloudFormation stack

Your first step is to use a previously-created AWS CloudFormation template to define the various AWS resources you need. A set of resources built from CloudFormation is referred to as a stack.

Sign in to the AWS Console.
In a separate browser tab, open this link to create a stack using our example template
Check the I acknowledge that AWS CloudFormation might create IAM resources checkbox.
Click Create stack.

AWS builds out the resources defined in the CloudFormation template. This process takes up to 10 minutes to complete.

When AWS finishes building the stack, it displays an event in the Events tab that says CREATE_COMPLETE.
Click the Outputs tab and copy the EndpointName value.

Test the deployment endpoint

At this point, you have now created a deployment that connects a model using MAX Engine to AWS SageMaker. This deployment includes a number of AWS compute and network resources that AWS SageMaker creates automatically to handle inferencing requests. To test this deployment, you'll create a small Python application to send an inferencing request to an AWS SageMaker endpoint, then process and display the response.

In this tutorial you need to sign into AWS using the aws cli tool. This step is necessary because the AWS SageMaker configuration you've created does not expose the endpoint_name to the internet.

To sign in to AWS from the command line, we recommend you use the AWS SSO token provider configuration. You can create this configuration by running aws configure sso. This command requires an SSO Start Url and an SSO Region. The values for these parameters depends on your AWS configuration. To learn more, see Configure the AWS CLI to use AWS IAM Identity Center.

Open a terminal.
Sign in to AWS.
```
aws sso login
```
```
aws sso login
```

Create a Python virtual environment and install the required dependencies.

python3 -m venv max-aws-deploy && source venv/bin/activate
python3 -m venv max-aws-deploy && source venv/bin/activate

pip install boto3 transformers
pip install boto3 transformers

pip install torch
pip install torch

Create a file called client.py and paste in the following code.

caution

Make sure to update the endpoint_name variable with the name of your actual endpoint.

If you didn't write down the endpoint_name, you can find it by opening your AWS Console and selecting CloudFormation, then clicking the Outputs tab.

# suppress extraneous logging
import os
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"

import json
import boto3
import transformers
from botocore.config import Config
import numpy as np

config = Config(region_name="us-east-1")
client = boto3.client("sagemaker-runtime", config=config)

# NOTE: Paste your endpoint here
endpoint_name = "YOUR-ENDPOINT-GOES-HERE"

text = "The quick brown fox jumped over the lazy dog."

tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(text, padding="max_length", max_length=128, return_tensors="pt")

# Convert tensor inputs to list for payload
input_ids = inputs["input_ids"].tolist()[0]
attention_mask = inputs["attention_mask"].tolist()[0]
token_type_ids = inputs["token_type_ids"].tolist()[0]

payload = {
   "inputs": [
      {
            "name": "input_ids",
            "shape": [1, 128],
            "datatype": "INT32",
            "data": input_ids,
      },
      {
            "name": "attention_mask",
            "shape": [1, 128],
            "datatype": "INT32",
            "data": attention_mask,
      },
      {
            "name": "token_type_ids",
            "shape": [1, 128],
            "datatype": "INT32",
            "data": token_type_ids,
      },
   ]
}

http_response = client.invoke_endpoint(
   EndpointName=endpoint_name,
   ContentType="application/octet-stream",
   Body=json.dumps(payload),
)
response = json.loads(http_response["Body"].read().decode("utf8"))
outputs = response["outputs"]

def softmax(logits):
   exp_logits = np.exp(logits - np.max(logits))
   return exp_logits / exp_logits.sum(axis=-1, keepdims=True)

# Process the output
for output in outputs:
   logits = output['data']
   logits = np.array(logits).reshape(output['shape'])

   print(f"Logits shape: {logits.shape}")

   if len(logits.shape) == 3:  # Shape [batch_size, sequence_length, num_classes]
      token_probabilities = softmax(logits)
      predicted_classes = np.argmax(token_probabilities, axis=-1)

      print(f"Predicted classes shape: {predicted_classes.shape}")
      print(f"Predicted class indices range: {np.min(predicted_classes)}, {np.max(predicted_classes)}")

      # Map predicted indices to tokens
      predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_classes[0])

      # Pair each input token with its predicted token
      input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
      token_pairs = list(zip(input_tokens, predicted_tokens))

      print("Predicted Token Pairs:")
      print("-" * 45)
      print("| {:<20} | {:<18} |".format("Input Token", "Predicted Token"))
      print("-" * 45)
      for input_token, predicted_token in token_pairs:
            if input_token != '[PAD]':  # Exclude padding tokens
               print("| {:<20} | {:<18} |".format(input_token, predicted_token))
      print("-" * 45)
# suppress extraneous logging
import os
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"

import json
import boto3
import transformers
from botocore.config import Config
import numpy as np

config = Config(region_name="us-east-1")
client = boto3.client("sagemaker-runtime", config=config)

# NOTE: Paste your endpoint here
endpoint_name = "YOUR-ENDPOINT-GOES-HERE"

text = "The quick brown fox jumped over the lazy dog."

tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(text, padding="max_length", max_length=128, return_tensors="pt")

# Convert tensor inputs to list for payload
input_ids = inputs["input_ids"].tolist()[0]
attention_mask = inputs["attention_mask"].tolist()[0]
token_type_ids = inputs["token_type_ids"].tolist()[0]

payload = {
   "inputs": [
      {
            "name": "input_ids",
            "shape": [1, 128],
            "datatype": "INT32",
            "data": input_ids,
      },
      {
            "name": "attention_mask",
            "shape": [1, 128],
            "datatype": "INT32",
            "data": attention_mask,
      },
      {
            "name": "token_type_ids",
            "shape": [1, 128],
            "datatype": "INT32",
            "data": token_type_ids,
      },
   ]
}

http_response = client.invoke_endpoint(
   EndpointName=endpoint_name,
   ContentType="application/octet-stream",
   Body=json.dumps(payload),
)
response = json.loads(http_response["Body"].read().decode("utf8"))
outputs = response["outputs"]

def softmax(logits):
   exp_logits = np.exp(logits - np.max(logits))
   return exp_logits / exp_logits.sum(axis=-1, keepdims=True)

# Process the output
for output in outputs:
   logits = output['data']
   logits = np.array(logits).reshape(output['shape'])

   print(f"Logits shape: {logits.shape}")

   if len(logits.shape) == 3:  # Shape [batch_size, sequence_length, num_classes]
      token_probabilities = softmax(logits)
      predicted_classes = np.argmax(token_probabilities, axis=-1)

      print(f"Predicted classes shape: {predicted_classes.shape}")
      print(f"Predicted class indices range: {np.min(predicted_classes)}, {np.max(predicted_classes)}")

      # Map predicted indices to tokens
      predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_classes[0])

      # Pair each input token with its predicted token
      input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
      token_pairs = list(zip(input_tokens, predicted_tokens))

      print("Predicted Token Pairs:")
      print("-" * 45)
      print("| {:<20} | {:<18} |".format("Input Token", "Predicted Token"))
      print("-" * 45)
      for input_token, predicted_token in token_pairs:
            if input_token != '[PAD]':  # Exclude padding tokens
               print("| {:<20} | {:<18} |".format(input_token, predicted_token))
      print("-" * 45)

Run the script.
```
python client.py
```
```
python client.py
```

You should see output similar to the following.

Logits shape: (1, 128, 30522)
Predicted classes shape: (1, 128)
Predicted class indices range: 1010, 13971
Predicted Token Pairs:
---------------------------------------------
| Input Token          | Predicted Token      |
---------------------------------------------
| [CLS]                | .                    |
| the                  | the                  |
| quick                | quick                |
| brown                | brown                |
| fox                  | fox                  |
| jumped               | jumped               |
| over                 | over                 |
| the                  | the                  |
| lazy                 | lazy                 |
| dog                  | dog                  |
| .                    | .                    |
| [SEP]                | .                    |
---------------------------------------------
Logits shape: (1, 128, 30522)
Predicted classes shape: (1, 128)
Predicted class indices range: 1010, 13971
Predicted Token Pairs:
---------------------------------------------
| Input Token          | Predicted Token      |
---------------------------------------------
| [CLS]                | .                    |
| the                  | the                  |
| quick                | quick                |
| brown                | brown                |
| fox                  | fox                  |
| jumped               | jumped               |
| over                 | over                 |
| the                  | the                  |
| lazy                 | lazy                 |
| dog                  | dog                  |
| .                    | .                    |
| [SEP]                | .                    |
---------------------------------------------

Clean up

That's it! You've now deployed a model using MAX Engine, AWS CloudFormation, and Amazon SageMaker! To avoid incurring additional costs for AWS resources, we recommend you delete the resources you've built.

To delete tutorial resources:

From the CloudFormation console, select Stacks.
Select the stack that you created for this tutorial.
Click Delete.

Next steps

In this tutorial, you've leveraged an AWS CloudFormation template to build out a complete AWS SageMaker deployment. This deployment used MAX engine to handle inference requests for a BERT model. The deployment took a text input, analyzed each token in the input, and returned what the model predicted the next token would be.

We encourage you to use what you learned here to deploy other models, and extend this tutorial as needed to explore other MAX features.

Here are some other topics to explore next:

Deploy a model with Kubernetes and Helm

Learn how to deploy a model using MAX Engine and Kubernetes.

Modular pricing

Learn about the licensing and support options for developers and enterprises.

About Amazon SageMaker​

About AWS CloudFormation​

Prerequisites​

Build the AWS CloudFormation stack​

Test the deployment endpoint​

Clean up​

Next steps​