Model inference
The foundation of the MAX platform is its ability to accelerate AI models using a next-generation model compiler and runtime called MAX Engine. Using our Python API, it takes just a few lines of code to drastically speed up your PyTorch, ONNX, and other GenAI models.
The procedure to execute a model (run inference) is pretty simple:
- Create an
InferenceSession
with themax.engine
API. - Load the model into the
InferenceSession
. - Pass your input into the loaded model.
With our Python API, that's just 3 lines of code. You can also run inference with our C or Mojo APIs.
How it works
When you load your model into MAX, the MAX model compiler inspects, analyzes, and optimizes the model's graph to provide the best performance on a wide range of hardware. MAX compiles the model on the same hardware where it will execute (”just in time” or JIT), which allows MAX to optimize the model in ways that are specialized for that hardware's capabilities.
But MAX is much more than just a compiler and runtime—it's a complete toolkit for AI developers and deployers. MAX also includes a Python API to build and optimize GenAI pipelines, called MAX Graph, and a serving interface to deploy your models on the cloud provider of your choice, called MAX Serve.
Also notice that figure 1 shows Mojo as a layer below MAX Engine. That's because we built MAX with Mojo, a new programming language that provides you a native interface to write custom ops and GPU kernels for your models.
Get started
Learn more
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!