What's new in MAX
Here's everything you should know about what's changed in each release.
v25.1.1 (2025-02-19)
Fix performance issues in autoregressive models with paged attention
by setting sensible default values for --max-num-steps
that are
platform-specific.
v25.1 (2025-02-13)
✨ Highlights
-
Custom ops for GPUs
Our new custom op API allows you to extend MAX Engine with new graph operations written in Mojo that execute on either CPU or GPU, providing full composability and extensibility for your models. See more in the section about GPU programming.
-
Enhanced support for agentic workflows
MAX Serve now supports function calling, which allows you to instruct your model to interact with other systems, such as retrieve data and execute external tasks. Learn more about function calling and tool use.
MAX Serve now supports structured output (also known as constrained decoding) for MAX models on GPU. This allows you to enforce the output format from a model using an input schema that defines the output structure. Learn more about structured output.
-
Extended model architecture support
-
MAX Serve now supports multimodal models that take both text and image inputs. For example, see how to deploy Llama 3.2 Vision.
-
MAX Serve now supports text embedding models. Learn how to deploy a text embedding model.
-
-
New
max-pipelines
CLI toolInstead of cloning our GitHub repo to access our latest GenAI models, you can instead install the
max-pipelines
CLI tool and quickly run an inference or deploy an endpoint. Learn more in themax-pipelines
docs.
Documentation
New tutorials:
Other docs:
MAX Serve
-
The
/v1/completions
REST endpoint now supports:-
Pre-tokenized prompts.
-
Image inputs for multimodal models such as
Llama-3.2-11B-Vision-Instruct
. For an example, see how to generate image descriptions with Llama 3.2 Vision.Known issue: You might receive faulty results because some parts of the text prompt get ignored for certain input combinations. We've identified the problem and will have a fix in a subsequent nightly release.
-
Function calling and tool use, which allows you to instruct your model to interact with other systems, such as retrieve data and execute external tasks. Learn more about function calling and tool use.
-
Structured output (also known as constrained decoding), which allows you to enforce the output format from a model using a JSON schema and the
response_format
field. To enable constrained decoding pass--enable-structured-output
when running the server. However, this feature currently works for MAX models on GPU only (support for PyTorch models and CPU is in progress). Learn more about structured output.
-
-
Added support for the
/v1/embeddings
API endpoint, allowing you to generate vector representations using embedding models. See how to deploy a text embedding model. -
Max Serve can evict requests when the number of available pages in the PagedAttention KVCache is limited. Before, the KV manager would throw an OOM error when a batch that cannot fit in the cache was scheduled.
MAX models
-
Added the
max-pipelines
CLI tool that simplifies the process to run inference with GenAI models (specified with a Hugging Face repo ID) and deploy them to a local endpoint with MAX Serve.Previously, running or serving these models required cloning the modular/max GitHub repo and then running commands such as
magic run llama3
.These model-specific commands like
llama3
andreplit
commands have been removed. They're now standardized and subsumed by flags like--model-path
in themax-pipelines
tool. Arguments such as--max-length
and--weight-path
are also still supported bymax-pipelines
.To view a list of supported model architectures from Hugging Face, run
max-pipelines list
. -
Added support for PagedAttention, which improves memory efficiency by partitioning the KV cache into smaller blocks, reducing fragmentation and enabling larger inference batches. You can enable it with
--cache-strategy=paged
and--kv-cache-page-size
with a value that's a multiple of 128. -
Added support for prefix caching in all cases where PagedAttention is supported. This allows for more efficient usage of KVCache and improved prefill performance for workloads with common prefixes. You can enable it by setting
--enable-prefix-caching
. For more information, see Prefix caching with PagedAttention. -
Batch size and max length are now inferred from available memory and the HF Models' default values for max length, respectively. If a configuration leads to an OOM, then we provide recommendations (to the best of our ability) to the user to fit the model into memory.
-
Added support for heterogeneous KV caches for multi-modal models, such as Llama Vision, which cache different KV states for self and cross attention layers.
-
Added support for embedding models, starting with MPNet. For example:
max-pipelines generate \
--model-path=sentence-transformers/all-mpnet-base-v2 \
--prompt="Encode this sentence."max-pipelines generate \
--model-path=sentence-transformers/all-mpnet-base-v2 \
--prompt="Encode this sentence."Also see how to deploy a text embedding model.
-
Added support for image and text multimodal models:
-
max-pipelines generate
now accepts image input with--image_url
. -
Added an experimental Pixtral pipeline you can run as follows:
max-pipelines generate \
--model-path=mistral-community/pixtral-12b \
--prompt="What is in this image? [IMG]" \
--image_url=/images/artwork/max-serve-cloud.pngmax-pipelines generate \
--model-path=mistral-community/pixtral-12b \
--prompt="What is in this image? [IMG]" \
--image_url=/images/artwork/max-serve-cloud.pngThe pipeline is automatically used for all models implementing the
LlavaForConditionalGeneration
architecture.The implementation currently has a limit of one image. We plan support an arbitrary number of images of mixed sizes soon.
-
Added an experimental Llama Vision pipeline you can run as follows:
max-pipelines generate \
--model-path=meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt="<|image|><|begin_of_text|>What is in this image?" \
--image_url=/images/artwork/max-serve-cloud.pngmax-pipelines generate \
--model-path=meta-llama/Llama-3.2-11B-Vision-Instruct \
--prompt="<|image|><|begin_of_text|>What is in this image?" \
--image_url=/images/artwork/max-serve-cloud.pngThe pipeline is automatically used for all models implementing the
MllamaForConditionalGeneration
architecture.Note: This model is gated and requires that you set the
HF_TOKEN
environment variable. See Llama-3.2-11B-Vision-Instruct. -
See how to generate image descriptions with Llama 3.2 Vision.
-
-
Added support for the
Qwen2ForCausalLM
model architecture (such asQwen/Qwen2.5-7B-Instruct
). For example:max-pipelines generate \
--model-path=Qwen/Qwen2.5-7B-Instruct \
--prompt="Write bubble sort in python" \
--quantization-encoding bfloat16max-pipelines generate \
--model-path=Qwen/Qwen2.5-7B-Instruct \
--prompt="Write bubble sort in python" \
--quantization-encoding bfloat16 -
Added support for offline batched inference for text-based LLMs, allowing you to load a model and run inference with a batch of inputs directly from Python, instead of relying on an HTTP interface. For an example, see
examples/offline-inference/basic.py
. -
The
--max-cache-batch-size
flag has been deprecated in favor of--max-batch-size
. Using--max-cache-batch-size
now emits a deprecation warning and will stop working in a future release. -
The
--use-gpu
flag has been deprecated in favor of--devices=cpu
,--devices=gpu
, or--devices=gpu-0,gpu-1,...
. If the device isn't specified, the model runs on the first available GPU, or CPU if no GPUs are available.
MAX Engine
-
Improved internal kernel compilation speed 1.5 - 4X across different models.
We've revamped our GPU compilation process so that all kernels in a program are compiled together into a single LLVM module, then split into separate kernels afterward. This ensures shared code between kernel entry points is only compiled once. For example, we observe a 3.7x speed up for Llama3.1-8b GPU startup time.
-
Improved initial model execution speed on NVIDIA GPUs.
Instead of compiling to PTX and performing just-in-time compilation during runtime, we now generate CUBIN binaries directly. While this increases initial compilation time, it significantly improves execution speed.
-
The kernels have been further tuned for performance on NVIDIA A100 GPUs.
Graph APIs
-
You can now write custom operations (ops) in Mojo, and add them to a graph constructed in Python, using
custom()
andinplace_custom()
.For more detail, see the section below about GPU programming.
-
Cached compiled MAX graphs that make use of custom operations now get invalidated when the implementation of the custom operations change.
-
Graph.add_weight()
now takes an explicitdevice
argument. This enables explicitly passing GPU-resident weights tosession.load()
via the weights registry to initialize the model. -
max.graph.Weight
now inherits fromTensorValue
, allowing you to callweight.cast()
orweight.T
. As such, theTensorValue
no longer acceptsWeight
for thevalue
argument.
Pipeline APIs
-
TextTokenizer.new_context()
now supports tool definitions passed through itsrequest
argument (viaTokenGeneratorRequest.tools
).It also now supports JSON schemas passed through its
request
argument (viaTokenGeneratorRequest.response_format
). -
Removed the default
num_steps
value forTokenGenerator.next_token()
, ensuring users pass a value, reducing the potential for silent errors. -
KVCacheStrategy
now defaults toMODEL_DEFAULT
.As opposed to the previous setting which always used the "continuous" caching strategy, KV caching strategy is now defaulted on an architecture-specific basis to ensure the most optimized caching strategy is used.
-
The
Linear
layer now has acreate()
class method that automatically creates specializations ofLinear
for non-quantized, k-quant, or GPTQ layers. -
Added
nn.Conv1D
for audio models like Whisper.
GPU programming
This release includes all new APIs to program on GPUs. The way to write code for GPUs is to create custom operations with GPU functions that you can load into a MAX graph. This foundational API includes a few key components:
-
Mojo APIs to write custom op functions:
-
The
@compiler.register
decorator is applied to a Mojo struct that implements a custom op in anexecute()
function—for either CPU or GPU—and ashape()
function that defines the custom op's output tensor. -
The
max.tensor
package adds essential Mojo APIs for writing custom ops, such as:-
The
foreach()
function, which efficiently executes an element-wise computation in parallel on either a GPU or CPU. -
The
ManagedTensorSlice
type defines the input and output tensors for the custom op.
-
-
-
Python APIs to load custom ops into a model:
-
The
custom()
andinplace_custom()
functions allow you to add the previously-defined Mojo custom op to a MAX graph written in Python. -
The
InferenceSession
constructor accepts the custom op implementation as a Mojo package in thecustom_extensions
argument.
-
For more detail, see the tutorial to build custom ops for GPUs, or check out this simple example of a custom op.
Additionally, we've added a new gpu
package to the Mojo
standard library that provides low-level programming constructs for working
with GPUs. These APIs let you do things that you can’t currently do with the
high-level foreach()
abstraction above. The Mojo gpu
APIs allow you to
manually manage interaction between the CPU host and GPU device, manage memory
between devices, synchronize threads, and more. For some examples, see
vector_addition.mojo
and
top_k.mojo
.
Mojo
Mojo is a crucial component of the MAX stack that enables all of MAX’s performance-oriented code across hardware. For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.
v24.6 (2024-12-17)
This is a huge update that offers a first look at our serving library for MAX on GPUs!
Also check out our blog post introducing MAX 24.6.
✨ Highlights
-
MAX Engine on GPUs preview
We’re excited to share a preview of MAX Engine on GPUs. We’ve created a few tutorials that demonstrate MAX’s ability to run GenAI models with our next-generation MAX graph compiler on NVIDIA GPU architectures (including A100, A10, L4, and L40 GPUs). You can experience it today by deploying Llama 3 on an A100 GPU.
-
MAX Serve preview
This release also includes an all-new serving interface called MAX Serve. It's a Python-based serving layer that supports both native MAX models when you want a high-performance deployment, and off-the-shelf PyTorch LLMs from Hugging Face when you want to explore and experiment—all with GPU support. It provides an OpenAI-compatible REST endpoint for inference requests, and a Prometheus-compatible metrics endpoint. You can use a
magic
command to start a local server , or use our ready-to-deploy MAX container to start an endpoint in the cloud. Try it now with an LLM from Hugging Face. -
Upgraded MAX models
As we continue to build our Python-based MAX Graph API that allows you to build high-performance GenAI models, we’ve made a ton of performance improvements to the existing models and added a few new models to our GitHub repo. All the Python-based MAX models now support GPUs and broad model architectures. For example,
llama3
adds compatibility for the LlamaForCausalLM family, which includes over 20,000 model variants and weights on Hugging Face.
Documentation
New tutorials:
Other new docs:
Also, our documentation is now available for MAX nightly builds! If you’re building with a MAX nightly release, you can switch to see the nightly docs using a toggle to the right of the search bar.
MAX Serve
This release includes a preview of our Python-based serving library called MAX Serve. It simplifies the process to deploy your own inference server with consistent and reliable performance.
MAX Serve currently includes the following features:
-
Deploys locally and to the cloud with our MAX container image, or with the
magic
CLI. -
An OpenAI-compatible server with streaming
/chat/completion
and/completion
endpoints for LLM inference requests. -
Prometheus-compatible metrics endpoint with LLM KPIs (TTFT and ITL) for monitoring and evaluating performance.
-
Supports most
TextGeneration
Hugging Face Hub models. -
Multiprocess HTTP/model worker architecture to maximize CPU core utilization by distributing multiple incoming requests across multiple processes, ensuring both high throughput and responsiveness.
-
Continuous heterogeneous batching to combine multiple incoming requests into a single inference (no waiting to fill a batch size) and improve total throughput.
There’s much more still in the works for MAX Serve, but you can try it today with our tutorials to Deploy Llama 3 on GPU with MAX Serve and Deploy a PyTorch model from Hugging Face.
Known issues:
-
While this release is enough to support typical chatbot applications, this release does not yet support the function-calling portion of the OpenAI API specification needed to enable robust agentic workflows.
-
Sampling is still limited and doesn’t currently respect temperature or other sampling-related API request input.
-
Structured generation is not supported.
-
Support for multi-modal models is still nascent.
MAX models
All of our Python-based GenAI models on GitHub now support GPUs!
As we add more models, we’re also building a robust set of libraries and
infrastructure that make it easier to build and deploy a growing library of
LLMs. Some of which is available in a new
max.pipelines
package and some of it is
alongside the models on
GitHub.
Here are just some of the highlights:
-
Deep integration with the Hugging Face ecosystem for a quick-to-deploy experience, such as using HF Model Hub tools to fetch config files, support for weights in safetensor format, support for HF tokenizers, and more. (We also support GGUF weight formats.)
-
Expanded set of model abstractions for use by different LLM architectures:
-
Attention layers (including highly optimized implementations with configurable masking, like
AttentionWithRope
). The optimized attention layers include variants that accept an attention mask. More memory-efficient variants that don’t take a mask instead take a “mask functor” argument to the kernel, which implements masking without materializing a mask by computing a mask value from input coordinates on the fly. -
Transformers such as
Transformer
andTransformerBlock
. These include an initial implementation of ragged tensors—tensors for which each dimension can have a different size, avoiding the use of padding tokens by flattening a batch of sequences of differing lengths. -
Common layers such as
RMSNorm
,Embedding
, andSequential
. -
KV cache management helpers, like
ContinuousBatchingKVCacheManager
. -
Low-level wrappers over optimized kernels like
fused_qk_ragged_rope
. These are custom fused kernels that update the KV cache in place. Although they are custom, they reuse the underlying kernel implementation by passing in lambda functions used to retrieve inputs and write to outputs in place.
-
-
Added generalized interfaces for text generation such as
TokenGenerator
andPipelineModel
, which provide modularity within the models and serving infrastructure. Also added a plug-in mechanism (PipelineRegistry
) to more quickly define new models, tokenizers, and other reusable components. For example, anything that conforms toTokenGenerator
can be served using the LLM infrastructure within MAX Serve. We then used this interface to create the following:-
An optimized
TextGenerationPipeline
that can be combined with any compatible graph and has powerful performance features like graph-based multi-step scheduling, sampling, KV cache management, ragged tensor support, and more. -
A generic
HFTextGenerationPipeline
that can run any Hugging Face model for which we don’t yet have an optimized implementation in eager mode.
-
-
Models now accept weights via a weights registry, which is passed to the
session.load()
method’sweights_registry
argument. The decoupling of weights and model architecture allows implementing all of the different fine-tunes for a given model with the same graph. Furthermore, because the underlying design is decoupled, we can later expose the ability to compile a model once and swap weights out on the fly, without re-compiling the model. -
Added generic implementations of common kernels, which allow you to plug-in different batching strategies (ragged or padded), KV cache management approaches (continuous batching), masking (causal, sliding window, etc.), and position encoding (RoPE or ALIBI) without having to re-write any kernel code. (More about this in a future release.)
-
Multi-step scheduling to run multiple token-generation steps on GPU before synchronizing to the CPU.
Updated models:
- Significant performance upgrades for Llama
3,
and expanded compatibility with the
LlamaForCausalLM
models family. For example, it also supports Llama 3.2 1B and 3B text models.
New models:
-
Mistral NeMo (and other
MistralForCausalLM
models)
Known issues:
-
The Q4 quantized models currently work on CPU only.
-
Using a large setting for
top-k
with the Llama 3.1 model may lead to segmentation faults for certain workloads when run on NVIDIA GPUs. This should be resolved in the latest nightly MAX builds. -
The models currently use a smaller default context window than the
max_seq_len
specified in the Hugging Face configuration files for a given model. This can be manually adjusted by setting the--max-length
parameter to the desired context length when serving a model. -
Some variants of the supported core models (like
LlamaForCausalLM
with different number of heads, head sizes, etc.) might not be fully optimized yet. We plan to fully generalize our implementations in a future release.
MAX Engine
MAX Engine includes a lot of the core infrastructure that enables MAX to accelerate AI models on any hardware, such as the graph compiler, runtime, kernels, and the APIs to interact with it all, and it all works without external dependencies such as PyTorch or CUDA.
This release includes a bunch of performance upgrades to our graph compiler and runtime. We’ve added support for NVIDIA GPU architectures (including A100, A10, L4, and L40 GPUs), and built out new infrastructure so we can quickly add support for other GPU hardware.
Engine API changes:
-
InferenceSession
now accepts acustom_extensions
constructor argument, same asload()
, to specify model extension libraries. -
The
Model
object is now callable to run an inference.
Breaking changes:
-
Model.execute()
signature changed to support GPUs.-
The
execute()
function currently doesn’t accept keyword arguments. Instead you can pass tensors as adriver.Tensor
,int
,float
,bool
,np.generic
, orDLPackArray
(DLPack). Note that both PyTorch and NumPy arrays implement the DLPack protocol, which means you can also pass either of those types toexecute()
. -
execute_legacy()
preserves the semantics ofexecute()
with support for keyword arguments to help with migration, but will be removed in a future release.execute_legacy()
doesn't support GPUs. -
Calling
execute()
with positional arguments still works the same.
-
Driver APIs
MAX Driver (the max.driver
module) is a new
component of MAX Engine that’s still a work in progress. It provides primitives
for working with heterogeneous hardware systems (GPUs and CPUs), such as to
allocate on-device memory, transfer data between host and device, query device
stats, and more. It’s a foundation on which other components of MAX Engine
operate (for example, InferenceEngine
now uses
driver.Tensor
to handle model
inputs and outputs).
Driver API changes:
-
Added
CUDA()
device to open an NVIDIA GPU. -
Added support for fp16 and bfloat16 dtypes.
-
Expanded functionality for
max.driver.Device
, with new class methods and properties. We are still working on building this out to support more accelerator features. -
driver.Tensor
(and theInferenceSession.load()
argumentweights_registry
) now supports zero-copy interoperability with NumPy arrays and PyTorch tensors, using DLPack /DLPackArray
. -
driver.Tensor
has new methods, such asfrom_dlpack()
,element_size()
,to()
,to_numpy()
,view()
,zeros()
, and more.
MAX Driver APIs are still changing rapidly and not yet ready for general use. We’ll publish more documentation in a future release.
Known issues:
-
MAX Driver is currently limited to managing just one NVIDIA GPU at a time (it does not yet support multi-GPU). It also does not yet support remote devices.
-
DLPack support is not complete. For example, streams are not yet supported.
Graph compiler
When you load a model into MAX Engine, the graph compiler is the component that inspects and optimizes all graph operations (ops) to deliver the best run time performance on each device.
This release includes various graph compiler improvements:
-
Major extensions to support NVIDIA GPUs (and other devices in the future), including async copies and caching of JIT’d kernels.
-
The runtime now performs scheduling to enable GPU compute overlap with the CPU.
-
New transformations to the Mojo kernels to enable a number of optimizations, including specialization on tensor dimensions, specialization on target hardware, specialization on non-tensor dimension input to kernels, automatic kernel fusion between operators, and more.
-
New algebraic simplifications and algorithms for ops such as horizontal fusion of matrix multiplications.
-
New CPU-side primitives for device management that are automatically transformed and optimized to reduce overhead (MAX does not need to use things like CUDA Graphs).
-
Updated memory planning to preallocate device memory (hoist computation from inference runtime to initialization time) and reduce per-inference overhead.
Graph APIs
The graph compiler is also exposed through the MAX Graph APIs (the
max.graph
package), which allow you to build
high-performance GenAI models in Python.
Graph API changes:
-
Python stack traces from model execution failures now include a trace to the original op-creation, allowing for easier debugging during development.
-
The
max.graph
APIs now include preliminary support for symbolic algebraic expressions usingAlgebraicDim
, enabling more powerful support for checked dynamic shapes. This allows-Dim("x") - 4
. Furthermore, the algebraic expressions simplify to a canonical form, so that for example-Dim("x") - 4 == -(Dim("x") + 4)
holds. -
More advanced dtype promotion now allows
TensorValue
math operators to just work when used with NumPy arrays and python primitives. -
TensorValue
has new methods, such asbroadcast_to()
,cast()
,flatten()
,permute()
, and more. -
Added
BufferValue
, which allows for device-resident tensors that are read and mutated within the graph. -
DType
has new methods/properties,align
,size_in_bytes
, andis_float()
. -
Value
constructor accepts more types forvalue
. -
TensorValue
constructor accepts more types forvalue
. -
TensorValue.rebind()
accepts a newmessage
argument.
Breaking changes:
-
Graph.add_weight()
now acceptsWeight
and returnsTensorValue
.Weight
is essentially a named placeholder for a tensor that knows its name, dtype, shape, and optionally device and quantization encoding.Graph.add_weight()
stages an op in the graph that is populated by a named weight in the weights registry passed tosession.load
. -
The
Weight
constructor arguments changed; addedalign
,dtype
, andshape
; removedassign
,filepath
,offset
, andvalue
. -
The
ops.scalar()
method was removed along with theis_static()
andis_symbolic()
methods from allgraph.type
objects.-
Instead of
ops.scalar()
, useops.constant()
. -
Instead of
is_static()
andis_symbolic()
, useisinstance(dim, SymbolicDim)
andisinstance(dim, StaticDim)
.
-
The MAX Graph APIs are not ready for general use but you can experiment with it now by following this tutorial. We'll add more documentation when we finish some API redesigns.
Custom op registration
Although the APIs to write custom operators (ops) isn’t ready for general use, this release includes a significant redesign that lays the groundwork. You might notice some associated APIs in this release and more APIs in the nightlies, so here’s a little about the work in progress:
-
The custom op APIs will allow you to extend MAX Engine with new ops written in Mojo, providing full composability and extensibility for your models. It’s the exact same API we use to write MAX Engine’s built-in ops such as
matmul
. That means your custom ops can benefit from all our compiler optimization features such as kernel fusion—your ops are treated the same as all the ops included “in the box.” -
The new API requires far less adornment at the definition site to enable the MAX model compiler to optimize custom ops along with the rest of the graph (compared to our previous version that used
NDBuffer
). -
Custom ops support “destination passing style” for tensors.
-
The design composes on top of Mojo’s powerful meta programming, as well as the kernel libraries abstractions for composable kernels.
We’ll publish more documentation when the custom op API is ready for general
use. Check out the MAX repo’s nightly
branch to see the latest custom op
examples.
Known issues:
- Custom ops don't have type or lifetime checking. They also don't reason about mutability. Expect lots of sharp corners and segfaults if you hold them wrong while we improve this!
Numeric kernels
The GPU kernels for MAX Engine are built from the ground up in Mojo with no dependencies on external vendor code or libraries. This release includes the following kernel improvements:
-
AttenGen: a novel way to express attention pattern that’s able to express different attention masks, score functions, as well as caching strategies.
-
State-of-the-art matrix multiplication algorithms with optimizations such as the following:
-
Pipelining and double-buffering to overlap data transfer and computation and to hide memory access latency (for both global and shared memory).
-
Thread swizzling to avoid shared memory bank conflicts associated with tensor core layouts.
-
Block swizzling to increase L2 cache locality.
-
-
SplitK/StreamK GEMM algorithms: divides the computation along the shared K dimension into smaller matrices which can then be executed independently on streaming multiprocessors (such as CUDA cores). These algorithms are ideal for matrices with large K dimension but small M dimension.
-
Large context length MHA: uses SplitK/StreamK to implement the attention mechanism and eliminate the need of a huge score matrix, which drastically reduces memory usage/traffic to enable large context length.
-
DualGemm: accelerates the multi-layer perceptron (MLP) layers where the left-hand side (LHS) is shared between two matrix multiplications.
Known issues:
-
The MAX kernels are optimized for bfloat16 on GPUs.
-
Convolution on GPU is not performance optimized yet.
-
Although v24.6 technically runs on H100, it doesn’t include performance-optimized kernels for that device yet and it isn’t recommended.
Mojo
Mojo is a crucial component of the MAX stack that enables all of MAX’s performance-oriented code across hardware. For all the updates to the Mojo language, standard library, and tools, see the Mojo changelog.
v24.5 (2024-09-13)
✨ Highlights
-
Mojo and MAX are magical! We've created a new package and virtual environment manager,
magic
, for MAX and Mojo. Check it out! -
New Llama3.1 pipeline built with the new MAX Graph Python API.
-
We have not one, but two new Python APIs that we're introducing in this release:
⭐️ New
-
Added
repeat_interleave
graph op. -
Added caching for MAX graph models. This means that graph compilation is cached and the executable model is retrieved from cache on the 2nd and subsequent runs. Note that the model cache is architecture specific and isn't portable across different targets.
-
Support for Python 3.12.
MAX Graph Python API
This Python API will ultimately provide the same low-level programming interface for high-performance inference graphs as the Mojo API. As with the Mojo API, it's an API for graph-building only, and it does not implement support for training.
You can take a look at how the API works in the MAX Graph Python API reference.
MAX Driver Python API
The MAX Driver API allows you to interact with devices (such as CPUs and GPUs) and allocate memory directly onto them. With this API, you interact with this memory as tensors.
Note that this API is still under development, with support for non-host devices, such as GPUs, planned for a future release.
To learn more, check out the MAX Driver Python APIreference.
MAX C API
New APIs for adding torch metadata libraries:
M_setTorchMetadataLibraryPath
M_setTorchMetadataLibraryPtr
🦋 Changed
MAX Engine performance
- Compared to v24.4, MAX Engine v24.5 generates tokens for Llama an average of 15%-48% faster.
MAX C API
Simplified the API for adding torch library paths, which now only takes one path per API call, but can be called multiple times to add paths to the config:
M_setTorchLibraries
->M_setTorchLibraryPath
⚠️ Deprecated
- The
max
command line tool is no longer supported and will be removed in a future release.
❌ Removed
- Dropped support for Ubuntu 20.04. If you're using Ubuntu, we currently support Ubuntu 22.04 LTS only.
- Dropped support for Python 3.8.
- Removed built-in PyTorch libraries from the max package. See the FAQ for information on supported torch versions.
v24.4 (2024-06-07)
🔥 Legendary
-
MAX is now available on macOS! Try it now.
-
New quantization APIs for MAX Graph. You can now build high-performance graphs in Mojo that use the latest quantization techniques, enabling even faster performance and more system compatibility for large models.
Learn more in the guide to quantize your graph weights.
⭐️ New
MAX Mojo APIs
-
Added AI pipeline examples in the
max
repo, with Mojo implementations for common transformer layers, including quantization support.-
New Llama3 pipeline built with MAX Graph.
-
New Replit Code pipeline built with MAX Graph.
-
New TinyStories pipeline (based on TinyLlama) that offers a simple demo of the MAX Graph quantization API.
-
-
Added Mojo API inference example with the TorchScript BERT model.
-
Added
max.graph.checkpoint
package to save and load model weights.All weights are stored in a
TensorDict
. You can save and load aTensorDict
to disk withsave()
andload()
functions. -
Added MAX Graph quantization APIs:
- Added quantization encodings
BFloat16Encoding
,Q4_0Encoding
,Q4_KEncoding
, andQ6_KEncoding
. - Added the
QuantizationEncoding
trait so you can build custom quantization encodings. - Added
Graph.quantize()
to create a quantized tensor node. - Added
qmatmul()
to perform matrix-multiplication with a float32 and a quantized matrix.
- Added quantization encodings
-
Added some MAX Graph ops:
-
Added a
layer()
context manager andcurrent_layer()
function to aid in debugging during graph construction. For example:with graph.layer("foo"):
with graph.layer("bar"):
print(graph.current_layer()) # prints "foo.bar"
x = graph.constant[DType.int64](1)
graph.output(x)with graph.layer("foo"):
with graph.layer("bar"):
print(graph.current_layer()) # prints "foo.bar"
x = graph.constant[DType.int64](1)
graph.output(x)This adds a path
foo.bar
to the added nodes, which will be reported during errors. -
Added
format_system_stack()
function to format the stack trace, which we use to print better error messages fromerror()
. -
Added
TensorMap.keys()
to get all the tensor key names.
MAX C API
Miscellaneous new APIs:
M_cloneCompileConfig()
M_copyAsyncTensorMap()
M_tensorMapKeys()
andM_deleteTensorMapKeys()
M_setTorchLibraries()
🦋 Changed
MAX Mojo API
-
EngineNumpyView.data()
andEngineTensorView.data()
functions that return a type-erased pointer were renamed tounsafe_ptr()
. -
TensorMap
now conforms toCollectionElement
trait to be copyable and movable. -
custom_nv()
was removed, and its functionality moved intocustom()
as an function overload, so it can now output a list of tensor symbols.
v24.3 (2024-05-02)
🔥 Legendary
-
You can now write custom ops for your models with Mojo!
Learn more about MAX extensibility.
🦋 Changed
-
Added support for named dynamic dimensions. This means you can specify when two or more dimensions in your model's input are dynamic but their sizes at run time must match each other. By specifying each of these dimension sizes with a name (instead of using
None
to indicate a dynamic size), the MAX Engine compiler can perform additional optimizations. See the notes below for the corresponding API changes that support named dimensions. -
Simplified all the APIs to load input specs for models, making them more consistent.
MAX Engine performance
- Compared to v24.2, MAX Engine v24.3 shows an average speedup of 10% on PyTorch models, and an average 20% speedup on dynamically quantized ONNX transformers.
MAX Graph API
The max.graph
APIs are still changing
rapidly, but starting to stabilize.
See the updated guide to build a graph with MAX Graph.
-
AnyMoType
renamed toType
,MOTensor
renamed toTensorType
, andMOList
renamed toListType
. -
Removed
ElementType
in favor of usingDType
. -
Removed
TypeTuple
in favor of usingList[Type]
. -
Removed the
Module
type so you can now start building a graph by directly instantiating aGraph
. -
Some new ops in
max.ops
, including support for custom ops.See how to create a custom op in MAX Graph.
MAX Engine Python API
-
Redesigned
InferenceSession.load()
to replace the confusingoptions
argument with acustom_ops_path
argument for use when loading a custom op, and aninput_specs
argument for use when loading TorchScript models.As a result,
CommonLoadOptions
,TorchLoadOptions
, andTensorFlowLoadOptions
have all been removed. -
TorchInputSpec
now supports named dynamic dimensions (previously, dynamic dimension sizes could be specified only asNone
). This lets you tell MAX which dynamic dimensions are required to have the same size, which helps MAX better optimize your model.
MAX Engine Mojo API
-
InferenceSession.load_model()
was renamed toload()
. -
Redesigned
InferenceSession.load()
to replace the confusingconfig
argument with acustom_ops_path
argument for use when loading a custom op, and aninput_specs
argument for use when loading TorchScript models.Doing so removed
LoadOptions
and introduced the newInputSpec
type to define the input shape/type of a model (instead ofLoadOptions
). -
New
ShapeElement
type to allow for named dynamic dimensions (inInputSpec
). -
max.engine.engine
module was renamed tomax.engine.info
.
MAX Engine C API
M_newTorchInputSpec()
now supports named dynamic dimensions (via newdimNames
argument).
❌ Removed
-
Removed TensorFlow support in the MAX SDK, so you can no longer load a TensorFlow SavedModel for inference. However, TensorFlow is still available for enterprise customers.
We removed TensorFlow because industry-wide TensorFlow usage has declined significantly, especially for the latest AI innovations. Removing TensorFlow also cuts our package size by over 50% and accelerates the development of other customer-requested features. If you have a production use-case for a TensorFlow model, please contact us.
-
Removed the Python
CommonLoadOptions
,TorchLoadOptions
, andTensorFlowLoadOptions
classes. See note above aboutInferenceSession.load()
changes. -
Removed the Mojo
LoadOptions
type. See the note above aboutInferenceSession.load()
changes.
v24.2.1 (2024-04-11)
-
You can now import more MAX Graph functions from
max.graph.ops
instead of usingmax.graph.ops.elementwise
. For example:from max.graph import ops
var relu = ops.relu(matmul)from max.graph import ops
var relu = ops.relu(matmul)
v24.2 (2024-03-28)
-
MAX Engine now supports TorchScript models with dynamic input shapes.
No matter what the input shapes are, you still need to specify the input specs for all TorchScript models.
-
The Mojo standard library is now open source!
Read more about it in this blog post.
-
And, of course, lots of Mojo updates, including implicit traits, support for keyword arguments in Python calls, a new
List
type (previouslyDynamicVector
), some refactoring that might break your code, and much more.For details, see the Mojo changelog.
v24.1.1 (2024-03-18)
This is a minor release that improves error reports.
v24.1 (2024-02-29)
The first release of the MAX platform is here! 🚀
This is a preview version of the MAX platform. That means it is not ready for production deployment and designed only for local development and evaluation.
Because this is a preview, some API libraries are still in development and subject to change, and some features that we previously announced are not quite ready yet. But there is a lot that you can do in this release!
This release includes our flagship developer tools, currently for Linux only:
-
MAX Engine: Our state-of-the-art graph compiler and runtime library that executes models from PyTorch and ONNX, with incredible inference speed on a wide range of hardware.
-
API libraries in Python, C, and Mojo to run inference with your existing models. See the API references.
-
The
max benchmark
tool, which runs MLPerf benchmarks on any compatible model without writing any code. -
The
max visualize
tool, which allows you to visualize your model in Netron after partially lowering in MAX Engine. -
An early look at the MAX Graph API, our low-level library for building high-performance inference graphs.
-
-
MAX Serving: A preview of our serving wrapper for MAX Engine that provides full interoperability with existing AI serving systems (such as Triton) and that seamlessly deploys within existing container infrastructure (such as Kubernetes).
- A Docker image that runs MAX Engine as a backend for NVIDIA Triton Inference Server. Try it now.
-
Mojo: The world's first programming language built from the ground-up for AI developers, with cutting-edge compiler technology that delivers unparalleled performance and programmability for any hardware.
-
The latest version of Mojo, the standard library, and the
mojo
command line tool. These are always included in MAX, so you don't need to download any separate packages. -
The Mojo changes in each release are often quite long, so we're going to continue sharing those in the existing Mojo changelog.
-
Additionally, we've started a new GitHub repo for MAX, where we currently share a bunch of code examples for our API libraries, including some large model pipelines. You can also use this repo to report issues with MAX.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!