Skip to main content

Python module

quantization

APIs to quantize graph tensors.

This package includes a generic quantization encoding interface and some quantization encodings that conform to it, such as bfloat16 and Q4_0 encodings.

The main interface for defining a new quantized type is QuantizationEncoding.quantize(). This takes a full-precision tensor represented as float32 and quantizes it according to the encoding. The resulting quantized tensor is represented as a bytes tensor. For that reason, the QuantizationEncoding must know how to translate between the tensor shape and its corresponding quantized buffer shape.

Quantization support for MAX Graph.

BlockParameters

class max.graph.quantization.BlockParameters(elements_per_block: int, block_size: int)

block_size

block_size*: int*

elements_per_block

elements_per_block*: int*

QuantizationEncoding

class max.graph.quantization.QuantizationEncoding(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Quantization encodings supported by MAX Graph.

Q4_0

Q4_0 = 'Q4_0'

Q4_K

Q4_K = 'Q4_K'

Q5_K

Q5_K = 'Q5_K'

Q6_K

Q6_K = 'Q6_K'

block_parameters

property block_parameters*: BlockParameters*

block_size

property block_size*: int*

Number of bytes in encoded representation of block.

All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of bytes resulting after encoding a single block.

elements_per_block

property elements_per_block*: int*

Number of elements per block.

All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of elements gathered into a block.