Mojo struct
TMABarrier
@register_passable(trivial)
struct TMABarrier
A memory barrier for synchronizing Tensor Memory Accelerator (TMA) operations.
TMABarrier provides a mechanism for coordinating asynchronous memory operations between threads in a CUDA context. It implements a memory barrier that can be used to ensure all TMA operations have completed before proceeding.
This struct wraps NVIDIA's memory barrier primitives, providing a higher-level interface for common synchronization patterns in GPU tensor operations.
Fields
- mbar (
UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3)]
): Pointer to shared memory location used for the barrier state.
Implemented traits
AnyType
,
CollectionElement
,
Copyable
,
ExplicitlyCopyable
,
Movable
,
UnknownDestructibility
Methods
__init__
__init__() -> Self
Initialize a TMABarrier
with a new stack-allocated barrier.
Allocates an 8-byte aligned memory location in shared memory for the barrier state, following NVIDIA's PTX documentation requirements.
__init__(addr: UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3), alignment=8]) -> Self
Initialize a TMABarrier
with an existing shared memory location.
Args:
- addr (
UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3), alignment=8]
): Pointer to an 8-byte aligned shared memory location to use for the barrier state.
init
init(self, num_threads: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](1))
Initialize the barrier state with the expected number of threads.
Sets up the barrier to expect arrivals from the specified number of threads before it can be satisfied.
Args:
- num_threads (
SIMD[int32, 1]
): Number of threads that must arrive at the barrier before it is satisfied. Defaults to 1.
expect_bytes
expect_bytes(self, bytes: SIMD[int32, 1])
Configure the barrier to expect a specific number of bytes to be transferred.
Used with TMA operations to indicate the expected size of data transfer. The barrier will be satisfied when the specified number of bytes has been transferred.
Args:
- bytes (
SIMD[int32, 1]
): Number of bytes expected to be transferred.
wait
wait(self, phase: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0))
Wait until the barrier is satisfied.
Blocks the calling thread until the barrier is satisfied, either by the expected number of threads arriving or the expected data transfer completing.
Note: Minimizes thread divergence during synchronization.
Args:
- phase (
SIMD[uint32, 1]
): The phase value to check against. Defaults to 0.
arrive_cluster
arrive_cluster(self, cta_id: SIMD[uint32, 1], count: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1))
Signal arrival at the barrier from a specific CTA (Cooperative Thread Array) in a cluster.
This method is used in multi-CTA scenarios to coordinate barrier arrivals across different CTAs within a cluster.
Args:
- cta_id (
SIMD[uint32, 1]
): The ID of the CTA (Cooperative Thread Array) that is arriving. - count (
SIMD[uint32, 1]
): The number of arrivals to signal. Defaults to 1.
arrive
arrive(self) -> Int
Signal arrival at the barrier and return the arrival count.
This method increments the arrival count at the barrier and returns the updated count.
Returns:
The updated arrival count after this thread's arrival.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!