Quantization

MAX supports both executing pre-quantized models and quantizing models through the Graph API. Quantization is an optimization technique that reduces the numeric precision of weights in a model using various quantization encodings. Our API is designed for low-level graph engineers who want to quantize specific weights in a model. This API does not quantize an entire model. Like the MAX Graph API, this is a low-level API meant for engineers who want to build high-performance graphs in a systems programming language—specifically, in Mojo.

For example, models trained with float32 weights can use lower precision types such as int8 or int4. That is, instead of storing each scalar value with 32-bits, you can use just 8 or 4 bits. This reduces the computational and memory demands during inference, which makes the model faster and compatible with more systems.

This is post-training quantization. The Graph API does not support model training, so you must import your model weights, load them as Tensor values, and then quantize them.

Custom quantization with the MAX Graph API

When used properly, quantization does not significantly affect the model accuracy. There are several different quantization encodings that provide different levels of precision and encoding formats, each with its own trade-offs that may work well for some models or graph operations ("ops") but not others. Some models also work great with a mixture of quantization types, so that only certain ops perform low-precision calculations while others retain high precision.

To support this mixed-precision strategy, the quantization API in MAX Graph is declarative. That means you can quantize the weights in your model explicitly as you see fit, rather than pick one quantization format for the whole model. You can quantize different weights with different encodings, write custom ops that understand your quantizations, and even implement your own quantization encodings.

The API provides several options for implementing your quantization strategy:

Mix quantization formats within the same model
Apply custom quantization encodings
Use pre-built encodings (Q4_0, Q4_K, Q6_K)

For example, to quantize your model weights, follow these steps:

Import your trained model weights
Choose your quantization encodings
Apply quantization using Graph.quantize()

Quantization options with MAX Serve

Quantization encoding formats are automatically detected when serving a model from Hugging Face with MAX. When using the max-pipelines command and loading models from a Hugging Face repository, explicitly providing a quantization encoding is optional.

max-pipelines serve --model-path="modularai/Llama-3.1-8B-Instruct-GGUF" --quantization-encoding q4_k
max-pipelines serve --model-path="modularai/Llama-3.1-8B-Instruct-GGUF" --quantization-encoding q4_k

If no quantization encoding is specified, MAX Serve automatically detects and uses the first encoding option from the repository. If a quantization encoding is provided when serving a model with MAX, it must align with the available encoding options of the repository.

If the repository contains multiple quantization formats, be sure to specify which encoding type you want to use.

Learn More

You can learn more about the MAX Graph quantization API by reading more about our API reference or the Sample implementation to see how we quantized a 15M parameter Llama 2 model.

Custom quantization with the MAX Graph API​

Quantization options with MAX Serve​

Learn More​

Custom quantization with the MAX Graph API

Quantization options with MAX Serve

Learn More