Skip to main content

Model inference

The foundation of the MAX platform is its ability to accelerate AI models using a next-generation model compiler and runtime called MAX Engine. Using our Python API, it takes just a few lines of code to drastically speed up your PyTorch, ONNX, and other GenAI models.

The procedure to execute a model (run inference) is pretty simple:

  1. Create an InferenceSession with the max.engine API.
  2. Load the model into the InferenceSession.
  3. Pass your input into the loaded model.

With our Python API, that's just 3 lines of code. You can also run inference with our C or Mojo APIs.

How it works

When you load your model into MAX, the MAX model compiler inspects, analyzes, and optimizes the model's graph to provide the best performance on a wide range of hardware. MAX compiles the model on the same hardware where it will execute (”just in time” or JIT), which allows MAX to optimize the model in ways that are specialized for that hardware's capabilities.

Figure 1. A simplified diagram of how your application uses MAX APIs to accelerate your AI models.

But MAX is much more than just a compiler and runtime—it's a complete toolkit for AI developers and deployers. MAX also includes a Python API to build and optimize GenAI pipelines, called MAX Graph, and a serving interface to deploy your models on the cloud provider of your choice, called MAX Serve.

Also notice that figure 1 shows Mojo as a layer below MAX Engine. That's because we built MAX with Mojo, a new programming language that provides you a native interface to write custom ops and GPU kernels for your models.

Get started

Learn more