What is MAX
The Modular Accelerated Xecution (MAX) platform is a unified set of APIs and tools that simplify the process of building and deploying your own high-performance AI endpoint. MAX provides complete flexibility, so you can use your own data and your own model on the hardware of your choice, with the best performance-to-cost tradeoff.
The foundation of MAX is our next-generation graph compiler and runtime that accelerates your GenAI models without vendor-specific hardware libraries, enabling it to scale effortlessly across a wide range of CPUs and GPUs. But MAX is much more than a fast and portable runtime. It also includes a quick-to-deploy serving layer—called MAX Serve—that orchestrates inference inputs and outputs between your model and client application.
MAX also includes a fully programmable interface for model development and GPU programming, so you can customize and optimize your GenAI models.
We built MAX because there wasn't a single solution that could support all the needs of AI developers today while also helping them scale into the future. Developers need a solution that supports their full inference workflow, from exploration of new use cases to deployment of high-performance cloud services. This requires a tool that provides world-class out-of-the-box performance and portability, that's tightly integrated with AI ecosystem tools such as Python, PyTorch, and Hugging Face, and that's fully extensible for new ideas.
What MAX offers
-
Unparalleled GenAI performance: When you need to scale your workloads and reduce your costs, MAX provides unparalleled out-of-the-box speed-ups for PyTorch and GenAI models on CPUs and GPUs.
-
Hardware portability: We built MAX from the ground up to be independent of vendor-specific hardware libraries, enabling it to scale effortlessly across a wide range of CPUs and GPUs so you can select the best hardware for your use case.
-
Model extensibility: When the off-the-shelf models don't provide the performance you need, MAX includes a Python API to build high-performance GenAI models such as large language models (LLMs) and to extend other models with custom operations.
-
Seamless deployment: MAX integrates with existing tools and cloud infrastructure to minimize migration effort. Our serving library called MAX Serve is available in a ready-to-deploy container, provides an OpenAI API endpoint, and works with Kubernetes.
MAX enables all of this with a rich set of Python APIs, backed by vendor-agnostic GPU kernels written in Mojo. Mojo is a new GPU programming language that looks and feels like Python and integrates with Python code, but it provides the performance, control, and safety of languages like C++, Rust, and Swift.
How to use MAX
To try MAX Serve and create an OpenAI-compatible API endpoint for an LLM, you
have a couple options. You can use max-pipelines
to
immediately start a local endpoint for your LLM, or use our MAX
container to deploy MAX Serve to the cloud provider of your
choice. Either way, simply provide the model name from Hugging Face and MAX
will optimize it for execution on a wide range of hardware.
To optimize your model's performance, you can write custom ops that MAX can analyze, optimize, and fuse into the graph (coming soon). Or, you can build your model with the MAX Graph Python API and load pre-trained weights from Hugging Face to unlock even more performance for GenAI models.
For more information about how MAX accelerates GenAI models from Hugging Face, read Model support. You can also browse the most popular models that work with MAX in the MAX model repository.
Get started
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!