Skip to main content
Log in

Mojo function

matmul_kernel

matmul_kernel[c_type: DType, a_type: DType, b_type: DType, tile_size: Int, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](Index[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](Index[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c_ptr: UnsafePointer[SIMD[c_type, 1]], a_ptr: UnsafePointer[SIMD[a_type, 1]], b_ptr: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)

Matrix Multiplication using shared memory. This version loads blocks of size tile_size x tile_size from A and B and updates a tile_size x tile_size in C. The thread block should have shape (tile_size, tile_size, 1). Each thread is mapped one element in C. The grid should have shape (N/tile_size, M/tile_size, 1). N is the first dimension for coalesced access.