Mojo struct
BFloat16Encoding
The bfloat16 quantization encoding.
Like float32, the bfloat16 encoding uses 8 bits to store the exponent value, so it has the same numeric range as float32. However, it has just 7 bits for the mantissa (compared to 23 bits available in float32), so it has less precision for the fractional part. This is often a better trade-off for ML applications, compared to traditional float16, which has less numeric range because it uses only 5 bits to store the exponent (though it has better precision with 10 bits for the mantissa).
Because this holds the quantized data in a special packing format, it currently does not print float values at runtime—it's just a bag of bits in uint8 format.
Implemented traits
AnyType
,
QuantizationEncoding
Methods
quantize
static quantize(tensor: Tensor[float32]) -> Tensor[uint8]
Quantizes the full-precision input tensor to bfloat16.
Only supports quantizing from float16 and float32, using a direct elementwise cast.
Args:
- tensor (
Tensor[float32]
): Full-precision tensor to quantize to bfloat16.
Returns:
Quantized bfloat16 tensor. The tensor datatype is uint8
because this is simply a byte buffer. Each scalar is actually encoded into two bytes (16-bits).
id
static id() -> String
Identifier for the bfloat16 quantized encoding.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!