Mojo struct

TMABarrier

@register_passable(trivial) struct TMABarrier

A memory barrier for synchronizing Tensor Memory Accelerator (TMA) operations.

TMABarrier provides a mechanism for coordinating asynchronous memory operations between threads in a CUDA context. It implements a memory barrier that can be used to ensure all TMA operations have completed before proceeding.

This struct wraps NVIDIA's memory barrier primitives, providing a higher-level interface for common synchronization patterns in GPU tensor operations.

Fields

mbar (UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3)]): Pointer to shared memory location used for the barrier state.

Implemented traits

AnyType, CollectionElement, Copyable, ExplicitlyCopyable, Movable, UnknownDestructibility

Methods

`init`

__init__() -> Self

Initialize a TMABarrier with a new stack-allocated barrier.

Allocates an 8-byte aligned memory location in shared memory for the barrier state, following NVIDIA's PTX documentation requirements.

__init__(addr: UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3), alignment=8]) -> Self

Initialize a TMABarrier with an existing shared memory location.

Args:

addr (UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3), alignment=8]): Pointer to an 8-byte aligned shared memory location to use for the barrier state.

`init`

init(self, num_threads: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](1))

Initialize the barrier state with the expected number of threads.

Sets up the barrier to expect arrivals from the specified number of threads before it can be satisfied.

Args:

num_threads (SIMD[int32, 1]): Number of threads that must arrive at the barrier before it is satisfied. Defaults to 1.

`expect_bytes`

expect_bytes(self, bytes: SIMD[int32, 1])

Configure the barrier to expect a specific number of bytes to be transferred.

Used with TMA operations to indicate the expected size of data transfer. The barrier will be satisfied when the specified number of bytes has been transferred.

Args:

bytes (SIMD[int32, 1]): Number of bytes expected to be transferred.

`wait`

wait(self, phase: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0))

Wait until the barrier is satisfied.

Blocks the calling thread until the barrier is satisfied, either by the expected number of threads arriving or the expected data transfer completing.

Note: Minimizes thread divergence during synchronization.

Args:

phase (SIMD[uint32, 1]): The phase value to check against. Defaults to 0.

`arrive_cluster`

arrive_cluster(self, cta_id: SIMD[uint32, 1], count: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1))

Signal arrival at the barrier from a specific CTA (Cooperative Thread Array) in a cluster.

This method is used in multi-CTA scenarios to coordinate barrier arrivals across different CTAs within a cluster.

Args:

cta_id (SIMD[uint32, 1]): The ID of the CTA (Cooperative Thread Array) that is arriving.
count (SIMD[uint32, 1]): The number of arrivals to signal. Defaults to 1.

`arrive`

arrive(self) -> Int

Signal arrival at the barrier and return the arrival count.

This method increments the arrival count at the barrier and returns the updated count.

Returns:

The updated arrival count after this thread's arrival.

Fields​

Implemented traits​

Methods​

__init__​

init​

expect_bytes​

wait​

arrive_cluster​

arrive​

Fields

Implemented traits

Methods

`init`

`init`

`expect_bytes`

`wait`

`arrive_cluster`

`arrive`