Skip to main content
Log in

Mojo struct

DeviceContext

@register_passable struct DeviceContext

Represents a single stream of execution on a particular accelerator (GPU). A DeviceContext serves as the low-level interface to the accelerator inside a MAX custom operation and provides methods for allocating buffers on the device, copying data between host and device, and for compiling and running functions (also known as kernels) on the device.

The device context can be used as a context manager. For example:

from gpu.host import DeviceContext
from gpu import thread_idx

fn kernel():
print("hello from thread:", thread_idx.x, thread_idx.y, thread_idx.z)

with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
ctx.synchronize()
from gpu.host import DeviceContext
from gpu import thread_idx

fn kernel():
print("hello from thread:", thread_idx.x, thread_idx.y, thread_idx.z)

with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
ctx.synchronize()

A custom operation receives an opaque DeviceContextPtr, which provides a get_device_context() method to retrieve the device context:

from runtime.asyncrt import DeviceContextPtr

@register("custom_op")
struct CustomOp:
@staticmethod
fn execute(ctx_ptr: DeviceContextPtr) raises:
var ctx = ctx_ptr.get_device_context()
ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
ctx.synchronize()
from runtime.asyncrt import DeviceContextPtr

@register("custom_op")
struct CustomOp:
@staticmethod
fn execute(ctx_ptr: DeviceContextPtr) raises:
var ctx = ctx_ptr.get_device_context()
ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
ctx.synchronize()

Aliases

  • device_info = from_name[::StringLiteral](): gpu.info.Info object for the default accelerator.
  • device_api = from_name[::StringLiteral]().api: Device API for the default accelerator (for example, "cuda" or "hip").

Implemented traits

AnyType, CollectionElement, Copyable, Movable, UnknownDestructibility

Methods

__init__

__init__(out self, device_id: Int = 0, *, api: String = String(from_name[::StringLiteral]()), buffer_cache_size: UInt = UInt(0))

Constructs a DeviceContext for the specified device.

Args:

  • device_id (Int): ID of the accelerator device. If not specified, uses the default accelerator.
  • api (String): Device API, for example, "cuda" for an NVIDIA GPU, or "gpu" for the currently available accelerator.
  • buffer_cache_size (UInt): Amount of space to pre-allocate for device buffers, in bytes.

__copyinit__

__copyinit__(existing: Self) -> Self

Copy the DeviceContext.

__del__

__del__(owned self)

copy

copy(self) -> Self

Explicitly construct a copy of self.

Returns:

A copy of this value.

__enter__

__enter__(owned self) -> Self

name

name(self) -> String

Returns the device name, an ASCII string identifying this device, defined by the native device API.

api

api(self) -> String

Returns the name of the API used to program the device.

Possible values are:

  • "cpu": Generic host device (CPU).
  • "cuda": NVIDIA GPUs.
  • "hip": AMD GPUs.

malloc_host

malloc_host[type: AnyType](self, size: Int) -> UnsafePointer[type]

Allocates a block of pinned memory on the host.

Pinned memory is guaranteed to remain resident in the host's RAM, not be paged/swapped out to disk. Memory allocated normally (for example, using UnsafePointer.alloc()) is pageable—individual pages of memory can be moved to secondary storage (disk/SSD) when main memory fills up.

Using pinned memory allows devices to make fast transfers between host memory and device memory, because they can use direct memory access (DMA) to transfer data without relying on the CPU.

Allocating too much pinned memory can cause performance issues, since it reduces the amount of memory available for other processes.

Parameters:

  • type (AnyType): The data type to be stored in the allocated memory.

Args:

  • size (Int): The number of elements of type to allocate memory for.

Returns:

A pointer to the newly-allocated memory.

free_host

free_host[type: AnyType](self, ptr: UnsafePointer[type])

Frees a previously-allocated block of pinned memory.

Parameters:

  • type (AnyType): The data type stored in the allocated memory.

Args:

  • ptr (UnsafePointer[type]): Pointer to the data block to free.

enqueue_create_buffer

enqueue_create_buffer[type: DType](self, size: Int) -> DeviceBuffer[type]

Enqueues a buffer creation using the DeviceBuffer constructor.

For GPU devices, the space is allocated in the device's global memory.

Parameters:

  • type (DType): The data type to be stored in the allocated memory.

Args:

  • size (Int): The number of elements of type to allocate memory for.

Returns:

The allocated buffer.

create_buffer_sync

create_buffer_sync[type: DType](self, size: Int) -> DeviceBuffer[type]

Creates a buffer synchronously using the DeviceBuffer constructor.

Parameters:

  • type (DType): The data type to be stored in the allocated memory.

Args:

  • size (Int): The number of elements of type to allocate memory for.

Returns:

The allocated buffer.

enqueue_create_host_buffer

enqueue_create_host_buffer[type: DType](self, size: Int) -> DeviceBuffer[type]

Enqueues a the creation of a host memory DeviceBuffer.

compile_function

compile_function[func_type: AnyTrivialRegType, //, func: $0, *, dump_asm: Variant[Bool, Path, fn() capturing -> Path] = __init__[::CollectionElement](False), dump_llvm: Variant[Bool, Path, fn() capturing -> Path] = __init__[::CollectionElement](False)](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, target=from_name[::StringLiteral]().target[::Int]()])

Compiles the provided function for execution on this device.

Parameters:

  • func_type (AnyTrivialRegType): Type of the function.
  • func ($0): The function to compile.
  • dump_asm (Variant[Bool, Path, fn() capturing -> Path]): To dump the compiled assembly, pass True, or a file path to dump to, or a function returning a file path.
  • dump_llvm (Variant[Bool, Path, fn() capturing -> Path]): To dump the generated LLVM code, pass True, or a file path to dump to, or a function returning a file path.

Args:

  • func_attribute (OptionalReg[FuncAttribute]): An attribute to use when compiling the code (such as maximum shared memory size).

Returns:

The compiled function.

enqueue_function

enqueue_function[func_type: AnyTrivialRegType, //, func: $0, *Ts: AnyType, *, dump_asm: Variant[Bool, Path, fn() capturing -> Path] = __init__[::CollectionElement](False), dump_llvm: Variant[Bool, Path, fn() capturing -> Path] = __init__[::CollectionElement](False)](self, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))

Compiles and enqueues a kernel for execution on this device.

You can pass the function directly to enqueue_function without compiling it first:

from gpu.host import DeviceContext

fn kernel():
print("hello from the GPU")

with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
ctx.synchronize()
from gpu.host import DeviceContext

fn kernel():
print("hello from the GPU")

with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
ctx.synchronize()

If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead:

with DeviceContext() as ctx:
var compile_func = ctx.compile_function[kernel]()
ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
ctx.synchronize()
with DeviceContext() as ctx:
var compile_func = ctx.compile_function[kernel]()
ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
ctx.synchronize()

Parameters:

  • func_type (AnyTrivialRegType): The type of the function to launch.
  • func ($0): The function to launch.
  • *Ts (AnyType): The types of the arguments being passed to the function.
  • dump_asm (Variant[Bool, Path, fn() capturing -> Path]): Pass True or a Path to dump the assembly.
  • dump_llvm (Variant[Bool, Path, fn() capturing -> Path]): Pass True or a Path to dump the LLVM IR.

enqueue_function[*Ts: AnyType](self, f: DeviceFunction[func, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List())

Enqueues a compiled function for execution on this device.

You can pass the function directly to enqueue_function without compiling it first:

from gpu.host import DeviceContext

fn kernel():
print("hello from the GPU")

with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
ctx.synchronize()
from gpu.host import DeviceContext

fn kernel():
print("hello from the GPU")

with DeviceContext() as ctx:
ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
ctx.synchronize()

If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead:

with DeviceContext() as ctx:
var compiled_func = ctx.compile_function[kernel]()
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
ctx.synchronize()
with DeviceContext() as ctx:
var compiled_func = ctx.compile_function[kernel]()
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
ctx.synchronize()

Parameters:

  • *Ts (AnyType): Argument types.

Args:

  • f (DeviceFunction[func, target=target, _ptxas_info_verbose=_ptxas_info_verbose]): The compiled function to execute.
  • *args (*Ts): Arguments to pass to the function.
  • grid_dim (Dim): Dimensions of the compute grid, made up of thread blocks.
  • block_dim (Dim): Dimensions of each thread block in the grid.
  • cluster_dim (OptionalReg[Dim]): Dimensions of clusters (if the thread blocks are grouped into clusters).
  • shared_mem_bytes (OptionalReg[Int]): Amount of shared memory per thread block.
  • attributes (List[LaunchAttribute]): Launch attributes.
  • constant_memory (List[ConstantMemoryMapping]): Constant memory mapping.

execution_time

execution_time[: origin.set, //, func: fn(DeviceContext) raises capturing -> None](self, num_iters: Int) -> Int

execution_time_iter

execution_time_iter[: origin.set, //, func: fn(DeviceContext, Int) raises capturing -> None](self, num_iters: Int) -> Int

enqueue_copy_to_device

enqueue_copy_to_device[type: DType](self, dst_buf: DeviceBuffer[type], src_ptr: UnsafePointer[SIMD[type, 1]])

Enqueues an async copy from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer.

Parameters:

  • type (DType): Type of the data being copied.

Args:

  • dst_buf (DeviceBuffer[type]): Device buffer to copy to.
  • src_ptr (UnsafePointer[SIMD[type, 1]]): Host pointer to copy from.

enqueue_copy_from_device

enqueue_copy_from_device[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_buf: DeviceBuffer[type])

Enqueues an async copy from the device to the host. The number of bytes copied is determined by the size of the device buffer.

Parameters:

  • type (DType): Type of the data being copied.

Args:

  • dst_ptr (UnsafePointer[SIMD[type, 1]]): Host pointer to copy to.
  • src_buf (DeviceBuffer[type]): Device buffer to copy from.

enqueue_copy_from_device[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_ptr: UnsafePointer[SIMD[type, 1]], size: Int)

Enqueues an async copy of size elements from the device pointer to the host pointer.

Parameters:

  • type (DType): Type of the data being copied.

Args:

  • dst_ptr (UnsafePointer[SIMD[type, 1]]): Host pointer to copy to.
  • src_ptr (UnsafePointer[SIMD[type, 1]]): Device pointer to copy from.
  • size (Int): Number of elements (of the specified DType) to copy.

enqueue_copy_device_to_device

enqueue_copy_device_to_device[type: DType](self, dst_buf: DeviceBuffer[type], src_buf: DeviceBuffer[type])

Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

Parameters:

  • type (DType): Type of the data being copied.

Args:

  • dst_buf (DeviceBuffer[type]): Device buffer to copy to.
  • src_buf (DeviceBuffer[type]): Device buffer to copy from. Must be at least as large as dst.

enqueue_copy_device_to_device[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_ptr: UnsafePointer[SIMD[type, 1]], size: Int)

Enqueues an async copy of size elements from a device pointer to another device pointer.

Parameters:

  • type (DType): Type of the data being copied.

Args:

  • dst_ptr (UnsafePointer[SIMD[type, 1]]): Host pointer to copy to.
  • src_ptr (UnsafePointer[SIMD[type, 1]]): Device pointer to copy from.
  • size (Int): Number of elements (of the specified DType) to copy.

copy_to_device_sync

copy_to_device_sync[type: DType](self, dst_buf: DeviceBuffer[type], src_ptr: UnsafePointer[SIMD[type, 1]])

Copies data from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer.

Parameters:

  • type (DType): Type of the data being copied.

Args:

  • dst_buf (DeviceBuffer[type]): Device buffer to copy to.
  • src_ptr (UnsafePointer[SIMD[type, 1]]): Host pointer to copy from.

copy_from_device_sync

copy_from_device_sync[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_buf: DeviceBuffer[type])

Copies data from the device to the host. The number of bytes copied is determined by the size of the device buffer.

Parameters:

  • type (DType): Type of the data being copied.

Args:

  • dst_ptr (UnsafePointer[SIMD[type, 1]]): Host pointer to copy to.
  • src_buf (DeviceBuffer[type]): Device buffer to copy from.

copy_device_to_device_sync

copy_device_to_device_sync[type: DType](self, dst_buf: DeviceBuffer[type], src_buf: DeviceBuffer[type])

Copies data from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

Parameters:

  • type (DType): Type of the data being copied.

Args:

  • dst_buf (DeviceBuffer[type]): Device buffer to copy to.
  • src_buf (DeviceBuffer[type]): Device buffer to copy from. Must be at least as large as dst.

enqueue_memset

enqueue_memset[type: DType](self, dst: DeviceBuffer[type], val: SIMD[type, 1])

Enqueues an async memset operation, setting all of the elements in the destination device buffer to the specified value.

Parameters:

  • type (DType): Type of the data stored in the buffer.

Args:

  • dst (DeviceBuffer[type]): Destination buffer.
  • val (SIMD[type, 1]): Value to set all elements of dst to.

memset_sync

memset_sync[type: DType](self, dst: DeviceBuffer[type], val: SIMD[type, 1])

Synchronously sets all of the elements in the destination device buffer to the specified value.

Parameters:

  • type (DType): Type of the data stored in the buffer.

Args:

  • dst (DeviceBuffer[type]): The destination buffer.
  • val (SIMD[type, 1]): Value to set all elements of dst to.

memset

memset[type: DType](self, dst: DeviceBuffer[type], val: SIMD[type, 1])

Enqueues an async memset operation, setting all of the elements in the destination device buffer to the specified value.

Parameters:

  • type (DType): Type of the data stored in the buffer.

Args:

  • dst (DeviceBuffer[type]): Destination buffer.
  • val (SIMD[type, 1]): Value to set all elements of dst to.

synchronize

synchronize(self)

Blocks until all asynchronous calls on the stream associated with this device context have completed.

This should never be necessary when writing a custom operation.

enqueue_wait_for

enqueue_wait_for(self, other: Self)

Enqueue a wait for other to be processed.

get_driver_version

get_driver_version(self) -> Int

Returns the driver version associated with this device.

get_attribute

get_attribute(self, attr: DeviceAttribute) -> Int

Returns the specified attribute for this device.

Args:

  • attr (DeviceAttribute): The device attribute to query.

Returns:

The value for attr on this device.

is_compatible

is_compatible(self)

Returns True if this device is compatible with MAX.

id

id(self) -> SIMD[int64, 1]

Returns the ID associated with this device.

get_memory_info

get_memory_info(self) -> Tuple[UInt, UInt]

Returns the free and total memory size for this device.

Returns:

A tuple of (free memory, total memory) in bytes.

can_access

can_access(self, peer: Self) -> Bool

Returns True if this device can access the identified peer device.

Args:

  • peer (Self): The peer device.

enable_peer_access

enable_peer_access(self, peer: Self)

Enables access to the peer device.

Args:

  • peer (Self): The peer device.

number_of_devices

static number_of_devices(*, api: String = String(from_name[::StringLiteral]())) -> Int

Returns the number of devices available that support the specified API.

Args:

  • api (String): Requested device API (for example, "cuda" or "hip").

map_to_host

map_to_host[type: DType](self, buf: DeviceBuffer[type]) -> _HostMappedBuffer[type]

Allows for temporary access to the device buffer by the host from within a with statement.

var in_dev = ctx.enqueue_create_buffer[DType.float32](length)
var out_dev = ctx.enqueue_create_buffer[DType.float32](length)

# Initialize the input and output with known values.
with ctx.map_to_host(in_dev) as in_host, ctx.map_to_host(out_dev) as out_host:
for i in range(length):
in_host[i] = i
out_host[i] = 255
var in_dev = ctx.enqueue_create_buffer[DType.float32](length)
var out_dev = ctx.enqueue_create_buffer[DType.float32](length)

# Initialize the input and output with known values.
with ctx.map_to_host(in_dev) as in_host, ctx.map_to_host(out_dev) as out_host:
for i in range(length):
in_host[i] = i
out_host[i] = 255

Values modified inside the with statement are updated on the device when the with statement exits.