Mojo function

all_reduce_naive

all_reduce_naive[type: DType, rank: Int, ngpus: Int, //](ctxs: List[DeviceContext], list_of_in_bufs: StaticTuple[NDBuffer[type, rank], ngpus], list_of_out_bufs: StaticTuple[NDBuffer[type, rank], ngpus])

Performs all-reduce across GPUs without using peer-to-peer access.

Arguments: ctxs: List of device contexts for participating GPUs list_of_in_bufs: Input buffers from each GPU list_of_out_bufs: Output buffers for each GPU

This implementation copies all data to each GPU and performs local reduction. Used as fallback when P2P access is not available.

Parameters:

type (DType): DType - The data type of tensor elements.
rank (Int): Int - Number of dimensions in input tensors.
ngpus (Int): Int - Number of GPUs participating in all-reduce.