Python module
quantization
APIs to quantize graph tensors.
This package includes a generic quantization encoding interface and some quantization encodings that conform to it, such as bfloat16 and Q4_0 encodings.
The main interface for defining a new quantized type is QuantizationEncoding.quantize(). This takes a full-precision tensor represented as float32 and quantizes it according to the encoding. The resulting quantized tensor is represented as a bytes tensor. For that reason, the QuantizationEncoding must know how to translate between the tensor shape and its corresponding quantized buffer shape.
Quantization support for MAX Graph.
BlockParameters
class max.graph.quantization.BlockParameters(elements_per_block: int, block_size: int)
block_size
block_size*: int*
elements_per_block
elements_per_block*: int*
QuantizationEncoding
class max.graph.quantization.QuantizationEncoding(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Quantization encodings supported by MAX Graph.
Q4_0
Q4_0 = 'Q4_0'
Q4_K
Q4_K = 'Q4_K'
Q5_K
Q5_K = 'Q5_K'
Q6_K
Q6_K = 'Q6_K'
block_parameters
property block_parameters*: BlockParameters*
block_size
property block_size*: int*
Number of bytes in encoded representation of block.
All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of bytes resulting after encoding a single block.
elements_per_block
property elements_per_block*: int*
Number of elements per block.
All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of elements gathered into a block.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!