Mojo function

all_reduce_p2p

all_reduce_p2p[type: DType, rank: Int, ngpus: Int, //](ctxs: List[DeviceContext], list_of_in_bufs: StaticTuple[NDBuffer[type, rank], ngpus], list_of_out_bufs: StaticTuple[NDBuffer[type, rank], ngpus], rank_sigs: StaticTuple[UnsafePointer[Signal], 8])

Performs all-reduce using peer-to-peer access between GPUs.

Arguments: ctxs: List of device contexts for participating GPUs list_of_in_bufs: Input buffers from each GPU list_of_out_bufs: Output buffers for each GPU rank_sigs: Signal pointers for synchronization

Launches P2P reduction kernel on each GPU to perform direct reduction.

Parameters:

type (DType): DType - Data type of tensor elements.
rank (Int): Int - Number of dimensions in tensors.
ngpus (Int): Int - Number of GPUs participating.