Skip to main content

Run inference with Python

The Python API for MAX Engine enables you to upgrade your runtime performance for PyTorch and ONNX models, on a wide range of hardware, with just three lines of code (not counting the import):

from max import engine

# Load your model:
session = engine.InferenceSession()
model = session.load(model_path)

# Prepare the inputs, then run an inference:
outputs = model.execute(**inputs)

# Process the output here.
from max import engine

# Load your model:
session = engine.InferenceSession()
model = session.load(model_path)

# Prepare the inputs, then run an inference:
outputs = model.execute(**inputs)

# Process the output here.

That's all you need! Everything else is the usual code to prepare your inputs and process the outputs.

But, it's always nice to see a fully working example. So the rest of this page shows how to run an inference using a version of RoBERTa from Cardiff NLP, which is a language model trained on tweets to perform sentiment analysis.

This example uses is a PyTorch model (converted to TorchScript format), and it's just as easy to load a model from ONNX.

Set up the project environment

After you install Magic, create a new Python project and install the dependencies:

magic init roberta-project && cd roberta-project
magic init roberta-project && cd roberta-project

Add MAX and NumPy from conda:

magic add max "numpy<2.0"
magic add max "numpy<2.0"

Add PyTorch and Transformers from PyPI:

magic add --pypi "torch==2.2.2" "transformers==4.40.1"
magic add --pypi "torch==2.2.2" "transformers==4.40.1"

Now you can start a shell in the environment and see your MAX version:

magic shell
magic shell
python3 -c 'from max import engine; print(engine.__version__)'
python3 -c 'from max import engine; print(engine.__version__)'

Import Python modules

To start coding, we need the libraries that help us get the model and process the input/output data.

from pathlib import Path

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

from max import engine
from max.dtype import DType
from pathlib import Path

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

from max import engine
from max.dtype import DType

Download the model

Now we download the RoBERTa model from HuggingFace and save it in the PyTorch TorchScript format.

HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
hf_model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)

# Converting model to TorchScript
model_path = Path("roberta.torchscript")
batch = 1
seqlen = 128
inputs = {
"input_ids": torch.zeros((batch, seqlen), dtype=torch.int64),
"attention_mask": torch.zeros((batch, seqlen), dtype=torch.int64),
}
with torch.no_grad():
traced_model = torch.jit.trace(
hf_model, example_kwarg_inputs=dict(inputs), strict=False
)

torch.jit.save(traced_model, model_path)
HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
hf_model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)

# Converting model to TorchScript
model_path = Path("roberta.torchscript")
batch = 1
seqlen = 128
inputs = {
"input_ids": torch.zeros((batch, seqlen), dtype=torch.int64),
"attention_mask": torch.zeros((batch, seqlen), dtype=torch.int64),
}
with torch.no_grad():
traced_model = torch.jit.trace(
hf_model, example_kwarg_inputs=dict(inputs), strict=False
)

torch.jit.save(traced_model, model_path)

Load the model

Then, we load and compile the model in MAX Engine using an InferenceSession.

Define input specs (TorchScript only)

If you're using a PyTorch model (it must be in TorchScript format), you need to specify the input specifications for each of the model inputs before you can compile the model.

To define the input specs, you need to create a list of TorchInputSpec objects (one for each input tensor), and pass the list to InferenceSession.load().

For example, here's how to declare the input specs for the RoBERTa TorchScript model:

# We use the same `inputs` that we used above to trace the model
input_spec_list = [
engine.TorchInputSpec(shape=tensor.size(), dtype=DType.int64)
for tensor in inputs.values()
]
# We use the same `inputs` that we used above to trace the model
input_spec_list = [
engine.TorchInputSpec(shape=tensor.size(), dtype=DType.int64)
for tensor in inputs.values()
]

Then pass input_specs to load() along with the model path, below.

Load and compile the model

Now we instantiate an InferenceSession and load the model (if you're loading an ONNX model, you don't need the input_specs argument):

session = engine.InferenceSession()
model = session.load(model_path, input_specs=input_spec_list)
session = engine.InferenceSession()
model = session.load(model_path, input_specs=input_spec_list)

That's two lines down, just one to go.

Prepare the input

This part is your usual pre-processing. For the RoBERTa model, we need to process the text input into a sequence of tokens, so we'll do that with transformers.AutoTokenizer.

First, let's take a look at the model's inputs:

for tensor in model.input_metadata:
print(f'name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}')
for tensor in model.input_metadata:
print(f'name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}')
name: input_ids, shape: [1, 128], dtype: DType.int64 name: attention_mask, shape: [1, 128], dtype: DType.int64

This tells us the model needs 2 inputs. (If your model shows a dimension size is None, that means it's dynamic.)

INPUT="There are many exciting developments in the field of AI Infrastructure!"

tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(INPUT, return_tensors="pt", padding='max_length', truncation=True, max_length=seqlen)
print(inputs)
INPUT="There are many exciting developments in the field of AI Infrastructure!"

tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(INPUT, return_tensors="pt", padding='max_length', truncation=True, max_length=seqlen)
print(inputs)
{'input_ids': tensor([[ 0, 970, 32, 171, 3571, 5126, 11, 5, 882, 9, 4687, 13469, 328, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

Run an inference

Now for that third line of code, we pass the inputs to execute(). This function requires all inputs as keyword arguments, so we'll unpack the inputs dictionary as we pass it through:

outputs = model.execute(**inputs)
print(outputs)
outputs = model.execute(**inputs)
print(outputs)
{'result0': {'logits': array([[-3.7987795 , 0.49929366, -4.2877274 , -2.586396 , 2.9503963 , -2.112092 , 2.507424 , -4.4121118 , -4.9013515 , -2.147359 , -0.5741746 ]], dtype=float32)}}

That's it!

The output from execute() is a dictionary of output tensors, each in an ndarray. Let's now figure out what they say.

Process the outputs

Again, we'll use some help from the transformers library to convert the output ids to labels:

# Extract class prediction from output
predicted_class_id = outputs["result0"]["logits"].argmax(axis=-1)[0]
classification = hf_model.config.id2label[predicted_class_id]

print(f"The sentiment is: {classification}")
# Extract class prediction from output
predicted_class_id = outputs["result0"]["logits"].argmax(axis=-1)[0]
classification = hf_model.config.id2label[predicted_class_id]

print(f"The sentiment is: {classification}")
The sentiment is: joy

Ta-da! 🎉

For more details about the inferencing API, see the Python API reference.

For more example code, see our GitHub repo.