What is MAX
The Modular Accelerated Xecution (MAX) platform simplifies the process to build and deploy your own GenAI endpoint. It includes our high-performance inference engine, serving library, and hardware-agnostic GPU programming library. You can use your own data and your own model on the hardware of your choice, with the best performance-to-cost tradeoff on both CPUs and GPUs.
The foundation of MAX is our next-generation graph compiler and runtime that accelerates your GenAI models without vendor-specific hardware libraries, enabling it to scale effortlessly across a wide range of hardware. On top of that is our serving framework that delivers consistent and reliable performance at scale with an OpenAI-compatible endpoint.
MAX also includes a fully programmable interface for model development and GPU programming, so you can customize and optimize your GenAI model pipeline.


What MAX offers
-
High-speed GenAI inference: When you need to scale your workloads and reduce your costs, MAX provides out-of-the-box acceleration for GenAI models on CPUs and GPUs.
-
Hardware portability: We built MAX from the ground up to be independent of vendor-specific hardware libraries, enabling it to scale effortlessly across a wide range of CPUs and GPUs so you can select the best hardware for your use case.
-
Model extensibility: When the off-the-shelf models don't provide the performance you need, MAX includes a Python API to build high-performance GenAI models such as large language models (LLMs) and to extend other models with custom operations.
-
Seamless deployment: MAX minimizes your migration effort because it integrates with existing cloud infrastructure and our REST endpoint supports OpenAI APIs so you don't have to rewrite your client code. It's all available in a ready-to-deploy container that works with Kubernetes out of the box.
MAX enables all of this with a rich set of Python APIs, backed by vendor-agnostic GPU kernels written in Mojo. Mojo is a new GPU programming language that looks and feels like Python and integrates with Python code, but it provides the performance, control, and safety of languages like C++, Rust, and Swift.
How to use MAX
To create an OpenAI-compatible endpoint for your GenAI model, you can use
max-pipelines
to immediately start a local endpoint, or
use our MAX container to deploy MAX to the cloud provider of
your choice. Either way, simply provide a model name from Hugging Face and MAX
will optimize it for execution on a wide range of hardware.
To optimize your model's performance, you can write custom ops that MAX can analyze, optimize, and fuse into the graph (coming soon). Or, you can build your model with the MAX Graph Python API and load pre-trained weights from Hugging Face to unlock even more performance for GenAI models.
For more information about how MAX accelerates GenAI models from Hugging Face, read Model support. You can also browse the most popular models that work with MAX in the MAX model repository.
Get started
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!