Mojo function
all_reduce_naive
all_reduce_naive[type: DType, rank: Int, ngpus: Int, //](ctxs: List[DeviceContext], list_of_in_bufs: StaticTuple[NDBuffer[type, rank], ngpus], list_of_out_bufs: StaticTuple[NDBuffer[type, rank], ngpus])
Performs all-reduce across GPUs without using peer-to-peer access.
Arguments: ctxs: List of device contexts for participating GPUs list_of_in_bufs: Input buffers from each GPU list_of_out_bufs: Output buffers for each GPU
This implementation copies all data to each GPU and performs local reduction. Used as fallback when P2P access is not available.
Parameters:
- type (
DType
): DType - The data type of tensor elements. - rank (
Int
): Int - Number of dimensions in input tensors. - ngpus (
Int
): Int - Number of GPUs participating in all-reduce.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!