Mojo module
mma_util
This module provides abstractions for doing matrix-multiply-accumulate (mma) using tensor cores. PTX Documentation => https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-1688 AMD Documentation => https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/
Functions
-
load_matrix_a
: For shape m16n8k8 type tf32 loads matrix A tile from memory to registers in specific order to be used by tensor cores to perform a warp sync mma op. -
load_matrix_a_amd
: -
load_matrix_b
: For shape m16n8k8 & type tf32 loads matrix B tile from memory to registers in specific order to be used by tensor cores to perform a warp sync mma op. -
load_matrix_b_amd
: -
store_matrix_d
: Stores matrix D tile from registers to memory in specific order after performing tensor core based warp sync mma op.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!