Python module

quantization

APIs to quantize graph tensors.

This package includes a generic quantization encoding interface and some quantization encodings that conform to it, such as bfloat16 and Q4_0 encodings.

The main interface for defining a new quantized type is QuantizationEncoding.quantize(). This takes a full-precision tensor represented as float32 and quantizes it according to the encoding. The resulting quantized tensor is represented as a bytes tensor. For that reason, the QuantizationEncoding must know how to translate between the tensor shape and its corresponding quantized buffer shape.

Quantization support for MAX Graph.

`BlockParameters`

class max.graph.quantization.BlockParameters(elements_per_block: int, block_size: int)

`block_size`

block_size*: int*

`elements_per_block`

elements_per_block*: int*

`QuantizationEncoding`

class max.graph.quantization.QuantizationEncoding(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Quantization encodings supported by MAX Graph.

`Q4_0`

Q4_0 = 'Q4_0'

`Q4_K`

Q4_K = 'Q4_K'

`Q5_K`

Q5_K = 'Q5_K'

`Q6_K`

Q6_K = 'Q6_K'

`block_parameters`

property block_parameters*: BlockParameters*

`block_size`

property block_size*: int*

Number of bytes in encoded representation of block.

All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of bytes resulting after encoding a single block.

`elements_per_block`

property elements_per_block*: int*

Number of elements per block.

BlockParameters​

block_size​

elements_per_block​

QuantizationEncoding​

Q4_0​

Q4_K​

Q5_K​

Q6_K​

block_parameters​

block_size​

elements_per_block​

`BlockParameters`

`block_size`

`elements_per_block`

`QuantizationEncoding`

`Q4_0`

`Q4_K`

`Q5_K`

`Q6_K`

`block_parameters`

`block_size`

`elements_per_block`