Mojo function

matmul_kernel

matmul_kernel[c_type: DType, a_type: DType, b_type: DType, tile_size: Int, elementwise_lambda_fn: OptionalReg[fn[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None] = OptionalReg[fn[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[c_type]()](c_ptr: UnsafePointer[Scalar[c_type]], a_ptr: UnsafePointer[Scalar[a_type]], b_ptr: UnsafePointer[Scalar[b_type]], m: Int, n: Int, k: Int)

Matrix Multiplication using shared memory. This version loads blocks of size tile_size x tile_size from A and B and updates a tile_size x tile_size in C. The thread block should have shape (tile_size, tile_size, 1). Each thread is mapped one element in C. The grid should have shape (N/tile_size, M/tile_size, 1). N is the first dimension for coalesced access.