Skip to main content

Model serving

MAX simplifies the process to own your AI endpoint. With MAX, you don't have to worry about which combination of model, runtime, serving framework, cloud provider, and hardware provides the best performance-to-cost tradeoff. MAX is designed to deliver state-of-the-art inference speeds for a wide range of models on a wide range of hardware, coupled with a quick-to-deploy serving layer called MAX Serve.

MAX Serve works with PyTorch, ONNX, and GenAI models, so you can deploy off-the-shelf models or highly-optimized GenAI models built with MAX Graph. In either case, our MAX Serve container includes everything you need to serve your model on the cloud provider of your choice, with a familiar endpoint API.

How it works

MAX Serve is available as a ready-to-deploy container with MAX Engine and our high-performance serving layer. MAX Engine provides our next-generation graph compiler and runtime to accelerate PyTorch, ONNX, and other GenAI models, while the serving layer provides a high-speed interface for large-language models (LLMs) that's compatible with the OpenAI REST API.

Figure 1. A simplified diagram of how your application can request inference from the MAX Serve container.

The container is pre-configured and all you need to do is specify the model you want to serve. You can specify the name of a PyTorch or ONNX LLM from HuggingFace or, for state-of-the-art performance, select one of our LLMs built with MAX Graph.

Get started

Learn more