Mojo function

load_matrix_b_amd

load_matrix_b_amd[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[Float32], tile_row: Int, tile_col: Int, ldm: Int) -> Float32

Loads a tile of matrix B from memory to registers for AMD FP32 tensor core operations.

Parameters:

m (Int): Number of rows in the output matrix tile.
n (Int): Number of columns in the output matrix tile.
k (Int): Inner dimension for matrix multiplication.

Args:

b_ptr (UnsafePointer): Pointer to matrix B data in memory.
tile_row (Int): Starting row index of the tile.
tile_col (Int): Starting column index of the tile.
ldm (Int): Leading dimension of matrix B (stride between rows).

Returns:

Float32: SIMD vector containing 1 FP32 value loaded from matrix B.

load_matrix_b_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](b_ptr: UnsafePointer[Float16], tile_row: Int, tile_col: Int, ldm: Int, tile_loops: Int = 1) -> SIMD[DType.float16, 4]

Loads a tile of matrix B from memory to registers for AMD FP16 tensor core operations.

This function loads 4 consecutive FP16 values per thread from matrix B in a pattern optimized for AMD GPU tensor core operations. Each thread loads values based on its position within the warp.

Performance:

Optimized for AMD GPU memory access patterns.
Uses thread ID to determine which elements to load.
Loads 4 consecutive elements per thread for efficient vectorization.

Parameters:

m (Int): Number of rows in the output matrix tile.
n (Int): Number of columns in the output matrix tile.
k (Int): Inner dimension for matrix multiplication.
n_blocks (Int): Number of blocks.

Args:

b_ptr (UnsafePointer): Pointer to matrix B data in memory (FP16 format).
tile_row (Int): Starting row index of the tile.
tile_col (Int): Starting column index of the tile.
ldm (Int): Leading dimension of matrix B (stride between rows).
tile_loops (Int): Number of tile loops across matrix B's row dimension.

Returns:

SIMD: SIMD vector containing 4 FP16 values loaded from matrix B.

load_matrix_b_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](b_ptr: UnsafePointer[BFloat16], tile_row: Int, tile_col: Int, ldm: Int, tile_loops: Int = 1) -> SIMD[DType.bfloat16, 4]

Loads a tile of matrix B from memory to registers for AMD BF16 tensor core operations.

This function loads 4 consecutive BF16 values per thread from matrix B in a pattern optimized for AMD GPU tensor core operations. Each thread loads values based on its position within the warp.

Performance:

Optimized for AMD GPU memory access patterns.
Uses thread ID to determine which elements to load.
Loads 4 consecutive elements per thread for efficient vectorization.

Parameters:

m (Int): Number of rows in the output matrix tile.
n (Int): Number of columns in the output matrix tile.
k (Int): Inner dimension for matrix multiplication.
n_blocks (Int): Number of blocks.

Args:

b_ptr (UnsafePointer): Pointer to matrix B data in memory (BF16 format).
tile_row (Int): Starting row index of the tile.
tile_col (Int): Starting column index of the tile.
ldm (Int): Leading dimension of matrix B (stride between rows).
tile_loops (Int): Number of tile loops across matrix B's row dimension.

Returns:

SIMD: SIMD vector containing 4 BF16 values loaded from matrix B.