Mojo struct

DeviceContext

@register_passable struct DeviceContext

Represents a single stream of execution on a particular accelerator (GPU).

A DeviceContext serves as the low-level interface to the accelerator inside a MAX custom operation and provides methods for allocating buffers on the device, copying data between host and device, and for compiling and running functions (also known as kernels) on the device.

The device context can be used as a context manager. For example:

from gpu.host import DeviceContext
from gpu import thread_idx

fn kernel():
    print("hello from thread:", thread_idx.x, thread_idx.y, thread_idx.z)

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
    ctx.synchronize()

A custom operation receives an opaque DeviceContextPtr, which provides a get_device_context() method to retrieve the device context:

from runtime.asyncrt import DeviceContextPtr

@register("custom_op")
struct CustomOp:
    @staticmethod
    fn execute(ctx_ptr: DeviceContextPtr) raises:
        var ctx = ctx_ptr.get_device_context()
        ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
        ctx.synchronize()

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, Movable, UnknownDestructibility

Aliases

`copyinitis_trivial`

alias __copyinit__is_trivial = False

`delis_trivial`

alias __del__is_trivial = False

`moveinitis_trivial`

alias __moveinit__is_trivial = True

`default_device_info`

alias default_device_info = GPUInfo.from_name[_accelerator_arch()]()

GPUInfo object for the default accelerator.

Methods

`init`

__init__(out self, device_id: Int = 0, *, var api: String = GPUInfo.from_name[_accelerator_arch()]().api)

Constructs a DeviceContext for the specified device.

This initializer creates a new device context for the specified accelerator device. The device context provides an interface for interacting with the GPU, including memory allocation, data transfer, and kernel execution.

Example:

from gpu.host import DeviceContext

# Create a context for the default GPU
var ctx = DeviceContext()

# Create a context for a specific GPU (device 1)
var ctx2 = DeviceContext(1)

Args:

device_id (Int): ID of the accelerator device. If not specified, uses the default accelerator (device 0).
api (String): Requested device API (for example, "cuda" or "hip"). Defaults to the device API specified by current target accelerator.

Raises:

If device initialization fails or the specified device is not available.

`copyinit`

__copyinit__(existing: Self) -> Self

Creates a copy of an existing device context by incrementing its reference count.

This copy constructor creates a new reference to the same underlying device context by incrementing the reference count of the native context object. Both the original and the copy will refer to the same device context.

Args:

existing (Self): The device context to copy.

`del`

__del__(var self)

Releases resources associated with this device context.

This destructor decrements the reference count of the native device context. When the reference count reaches zero, the underlying resources are released, including any cached memory buffers and compiled device functions.

`enter`

__enter__(var self) -> Self

Enables the use of DeviceContext in a 'with' statement context manager.

This method allows DeviceContext to be used with Python-style context managers, which ensures proper resource management and cleanup when the context exits.

Example:

from gpu.host import DeviceContext

# Using DeviceContext as a context manager
with DeviceContext() as ctx:
    # Perform GPU operations
    # Resources are automatically released when exiting the block

Returns:

Self: The DeviceContext instance to be used within the context manager block.

`name`

name(self) -> String

Returns the device name, an ASCII string identifying this device, defined by the native device API.

This method queries the underlying GPU device for its name, which typically includes the model and other identifying information. This can be useful for logging, debugging, or making runtime decisions based on the specific GPU hardware.

Example:

from gpu.host import DeviceContext

var ctx = DeviceContext()
print("Running on device:", ctx.name())

Returns:

String: A string containing the device name.

`api`

api(self) -> String

Returns the name of the API used to program the device.

This method queries the underlying device context to determine which GPU programming API is being used for the current device. This information is useful for writing code that can adapt to different GPU architectures and programming models.

Possible values are:

"cpu": Generic host device (CPU).
"cuda": NVIDIA GPUs.
"hip": AMD GPUs.

Example:

from gpu.host import DeviceContext

var ctx = DeviceContext()
var api_name = ctx.api()
print("Using device API:", api_name)

# Conditionally execute code based on the API
if api_name == "cuda":
    print("Running on NVIDIA GPU")
elif api_name == "hip":
    print("Running on AMD GPU")

Returns:

String: A string identifying the device API.

`enqueue_create_buffer`

enqueue_create_buffer[dtype: DType](self, size: Int) -> DeviceBuffer[dtype]

Enqueues a buffer creation using the DeviceBuffer constructor.

For GPU devices, the space is allocated in the device's global memory.

Parameters:

dtype (DType): The data type to be stored in the allocated memory.

Args:

size (Int): The number of elements of type to allocate memory for.

Returns:

DeviceBuffer: The allocated buffer.

`create_buffer_sync`

create_buffer_sync[dtype: DType](self, size: Int) -> DeviceBuffer[dtype]

Creates a buffer synchronously using the DeviceBuffer constructor.

Parameters:

dtype (DType): The data type to be stored in the allocated memory.

Args:

size (Int): The number of elements of type to allocate memory for.

Returns:

DeviceBuffer: The allocated buffer.

`enqueue_create_host_buffer`

enqueue_create_host_buffer[dtype: DType](self, size: Int) -> HostBuffer[dtype]

Enqueues the creation of a HostBuffer.

This function allocates memory on the host that is accessible by the device. The memory is page-locked (pinned) for efficient data transfer between host and device.

Pinned memory is guaranteed to remain resident in the host's RAM, not be paged/swapped out to disk. Memory allocated normally (for example, using UnsafePointer.alloc()) is pageable—individual pages of memory can be moved to secondary storage (disk/SSD) when main memory fills up.

Using pinned memory allows devices to make fast transfers between host memory and device memory, because they can use direct memory access (DMA) to transfer data without relying on the CPU.

Allocating too much pinned memory can cause performance issues, since it reduces the amount of memory available for other processes.

Example:

from gpu.host import DeviceContext

with DeviceContext() as ctx:
    # Allocate host memory accessible by the device
    var host_buffer = ctx.enqueue_create_host_buffer[DType.float32](1024)

    # Use the host buffer for device operations
    # ...

Parameters:

dtype (DType): The data type to be stored in the allocated memory.

Args:

size (Int): The number of elements of type to allocate memory for.

Returns:

HostBuffer: A HostBuffer object that wraps the allocated host memory.

Raises:

If memory allocation fails or if the device context is invalid.

`compile_function`

compile_function[func_type: AnyTrivialRegType, //, func: func_type, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, target: target = GPUInfo.from_name[_accelerator_arch()]().target(), compile_options: StringSlice[StaticConstantOrigin] = CompilationTarget.default_compile_options[target](), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *, func_attribute: OptionalReg[FuncAttribute] = None, out result: DeviceFunction[func, None, target=target, compile_options=compile_options, _ptxas_info_verbose=_ptxas_info_verbose])

Compiles the provided function for execution on this device.

Parameters:

func_type (AnyTrivialRegType): Type of the function.
func (func_type): The function to compile.
dump_asm (Variant): To dump the compiled assembly, pass True, or a file path to dump to, or a function returning a file path.
dump_llvm (Variant): To dump the generated LLVM code, pass True, or a file path to dump to, or a function returning a file path.
target (target): Change the target to different device dtype than the one associated with this DeviceContext.
compile_options (StringSlice): Change the compile options to different options than the ones associated with this DeviceContext.
_dump_sass (Variant): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass True, or a file path to dump to, or a function returning a file path.
_ptxas_info_verbose (Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes dump_asm to output verbose PTX assembly (default False).

Args:

func_attribute (OptionalReg): An attribute to use when compiling the code (such as maximum shared memory size).

Returns:

DeviceFunction: The compiled function.

`compile_function_unchecked`

compile_function_unchecked[func_type: AnyTrivialRegType, //, func: func_type, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, target: target = GPUInfo.from_name[_accelerator_arch()]().target(), compile_options: StringSlice[StaticConstantOrigin] = CompilationTarget.default_compile_options[target](), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *, func_attribute: OptionalReg[FuncAttribute] = None, out result: DeviceFunction[func, None, target=target, compile_options=compile_options, _ptxas_info_verbose=_ptxas_info_verbose])

Compiles the provided function for execution on this device.

Parameters:

func_type (AnyTrivialRegType): Type of the function.
func (func_type): The function to compile.
dump_asm (Variant): To dump the compiled assembly, pass True, or a file path to dump to, or a function returning a file path.
dump_llvm (Variant): To dump the generated LLVM code, pass True, or a file path to dump to, or a function returning a file path.
target (target): Change the target to different device dtype than the one associated with this DeviceContext.
compile_options (StringSlice): Change the compile options to different options than the ones associated with this DeviceContext.
_dump_sass (Variant): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass True, or a file path to dump to, or a function returning a file path.
_ptxas_info_verbose (Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes dump_asm to output verbose PTX assembly (default False).

Returns:

DeviceFunction: The compiled function.

`compile_function_checked`

compile_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, target: target = GPUInfo.from_name[_accelerator_arch()]().target(), compile_options: StringSlice[StaticConstantOrigin] = CompilationTarget.default_compile_options[target](), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *, func_attribute: OptionalReg[FuncAttribute] = None, out result: DeviceFunction[func, declared_arg_types, target=target, compile_options=compile_options, _ptxas_info_verbose=_ptxas_info_verbose])

Compiles the provided function for execution on this device.

Parameters:

func_type (AnyTrivialRegType): Type of the function.
declared_arg_types (Variadic): Types of the arguments to pass to the device function.
func (func_type): The function to compile.
signature_func (fn(*args: *declared_arg_types) -> None): The function to compile, passed in again. Used for checking argument dtypes later. Note: This will disappear in future versions.
dump_asm (Variant): To dump the compiled assembly, pass True, or a file path to dump to, or a function returning a file path.
dump_llvm (Variant): To dump the generated LLVM code, pass True, or a file path to dump to, or a function returning a file path.
target (target): Change the target to different device dtype than the one associated with this DeviceContext.
compile_options (StringSlice): Change the compile options to different options than the ones associated with this DeviceContext.
_dump_sass (Variant): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass True, or a file path to dump to, or a function returning a file path.
_ptxas_info_verbose (Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes dump_asm to output verbose PTX assembly (default False).

Returns:

DeviceFunction: The compiled function.

compile_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) capturing -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, target: target = GPUInfo.from_name[_accelerator_arch()]().target(), compile_options: StringSlice[StaticConstantOrigin] = CompilationTarget.default_compile_options[target](), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *, func_attribute: OptionalReg[FuncAttribute] = None, out result: DeviceFunction[func, declared_arg_types, target=target, compile_options=compile_options, _ptxas_info_verbose=_ptxas_info_verbose])

Compiles the provided function for execution on this device.

Parameters:

func_type (AnyTrivialRegType): Type of the function.
declared_arg_types (Variadic): Types of the arguments to pass to the device function.
func (func_type): The function to compile.
signature_func (fn(*args: *declared_arg_types) capturing -> None): The function to compile, passed in again. Used for checking argument dtypes later. Note: This will disappear in future versions.
dump_asm (Variant): To dump the compiled assembly, pass True, or a file path to dump to, or a function returning a file path.
dump_llvm (Variant): To dump the generated LLVM code, pass True, or a file path to dump to, or a function returning a file path.
target (target): Change the target to different device dtype than the one associated with this DeviceContext.
compile_options (StringSlice): Change the compile options to different options than the ones associated with this DeviceContext.
_dump_sass (Variant): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass True, or a file path to dump to, or a function returning a file path.
_ptxas_info_verbose (Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes dump_asm to output verbose PTX assembly (default False).

Returns:

DeviceFunction: The compiled function.

`compile_function_experimental`

compile_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, target: target = GPUInfo.from_name[_accelerator_arch()]().target(), compile_options: StringSlice[StaticConstantOrigin] = CompilationTarget.default_compile_options[target](), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *, func_attribute: OptionalReg[FuncAttribute] = None, out result: DeviceFunction[func, declared_arg_types, target=target, compile_options=compile_options, _ptxas_info_verbose=_ptxas_info_verbose])

Compiles the provided function for execution on this device.

Parameters:

declared_arg_types (Variadic): Types of the arguments to pass to the device function.
func (fn(*args: *declared_arg_types) -> None): The function to compile.
dump_asm (Variant): To dump the compiled assembly, pass True, or a file path to dump to, or a function returning a file path.
dump_llvm (Variant): To dump the generated LLVM code, pass True, or a file path to dump to, or a function returning a file path.
target (target): Change the target to different device dtype than the one associated with this DeviceContext.
compile_options (StringSlice): Change the compile options to different options than the ones associated with this DeviceContext.
_dump_sass (Variant): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass True, or a file path to dump to, or a function returning a file path.
_ptxas_info_verbose (Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes dump_asm to output verbose PTX assembly (default False).

Returns:

DeviceFunction: The compiled function.

compile_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) capturing -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, target: target = GPUInfo.from_name[_accelerator_arch()]().target(), compile_options: StringSlice[StaticConstantOrigin] = CompilationTarget.default_compile_options[target](), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *, func_attribute: OptionalReg[FuncAttribute] = None, out result: DeviceFunction[func, declared_arg_types, target=target, compile_options=compile_options, _ptxas_info_verbose=_ptxas_info_verbose])

Compiles the provided function for execution on this device.

Parameters:

declared_arg_types (Variadic): Types of the arguments to pass to the device function.
func (fn(*args: *declared_arg_types) capturing -> None): The function to compile.
dump_asm (Variant): To dump the compiled assembly, pass True, or a file path to dump to, or a function returning a file path.
dump_llvm (Variant): To dump the generated LLVM code, pass True, or a file path to dump to, or a function returning a file path.
target (target): Change the target to different device dtype than the one associated with this DeviceContext.
compile_options (StringSlice): Change the compile options to different options than the ones associated with this DeviceContext.
_dump_sass (Variant): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass True, or a file path to dump to, or a function returning a file path.
_ptxas_info_verbose (Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes dump_asm to output verbose PTX assembly (default False).

Returns:

DeviceFunction: The compiled function.

`load_function`

load_function[func_type: AnyTrivialRegType, //, func: func_type](self, *, function_name: StringSlice[origin], asm: StringSlice[origin], func_attribute: OptionalReg[FuncAttribute] = None, out result: DeviceExternalFunction)

Loads a pre-compiled device function from assembly code.

This method loads an external GPU function from provided assembly code (PTX/SASS) rather than compiling it from Mojo source. This is useful for integrating with existing CUDA/HIP code or for using specialized assembly optimizations.

Example:

from gpu.host import DeviceContext
from gpu.host.device_context import DeviceExternalFunction

fn func_signature(
    # Arguments being passed to the assembly code
    # e.g. two pointers and a length
    input: UnsafePointer[Float32],
    output: UnsafePointer[Float32],
    len: Int,
):
    # No body because that is passed as assembly code below.
    pass

var ctx = DeviceContext()
var ptx_code = "..."  # PTX assembly code
var ext_func = ctx.load_function[func_signature](
    function_name="my_kernel",
    asm=ptx_code,
)

Parameters:

func_type (AnyTrivialRegType): The dtype of the function to load.
func (func_type): The function reference.

Args:

function_name (StringSlice): The name of the function in the assembly code.
asm (StringSlice): The assembly code (PTX/SASS) containing the function.
func_attribute (OptionalReg): Optional attribute to apply to the function (such as maximum shared memory size).

Returns:

DeviceExternalFunction: The loaded function is stored in the result parameter.

Raises:

If loading the function fails or the assembly code is invalid.

`enqueue_function`

enqueue_function[func_type: AnyTrivialRegType, //, func: func_type, *Ts: AnyType, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, target: target = GPUInfo.from_name[_accelerator_arch()]().target(), compile_options: StringSlice[StaticConstantOrigin] = CompilationTarget.default_compile_options[target](), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List[LaunchAttribute](, Tuple[]()), var constant_memory: List[ConstantMemoryMapping] = List[ConstantMemoryMapping](, Tuple[]()), func_attribute: OptionalReg[FuncAttribute] = None, location: OptionalReg[_SourceLocation] = None)

Compiles and enqueues a kernel for execution on this device.

You can pass the function directly to enqueue_function without compiling it first:

from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()

If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead:

with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()

Parameters:

func_type (AnyTrivialRegType): The dtype of the function to launch.
func (func_type): The function to launch.
*Ts (AnyType): The dtypes of the arguments being passed to the function.
dump_asm (Variant): To dump the compiled assembly, pass True, or a file path to dump to, or a function returning a file path.
dump_llvm (Variant): To dump the generated LLVM code, pass True, or a file path to dump to, or a function returning a file path.
target (target): Change the target to different device dtype than the one associated with this DeviceContext.
compile_options (StringSlice): Change the compile options to different options than the ones associated with this DeviceContext.
_dump_sass (Variant): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass True, or a file path to dump to, or a function returning a file path.
_ptxas_info_verbose (Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes dump_asm to output verbose PTX assembly (default False).

Args:

*args (*Ts): Variadic arguments which are passed to the func.
grid_dim (Dim): The grid dimensions.
block_dim (Dim): The block dimensions.
cluster_dim (OptionalReg): The cluster dimensions.
shared_mem_bytes (OptionalReg): Per-block memory shared between blocks.
attributes (List): A List of launch attributes.
constant_memory (List): A List of constant memory mappings.
func_attribute (OptionalReg): CUfunction_attribute enum.
location (OptionalReg): Source location for the function call.

enqueue_function[*Ts: AnyType](self, f: DeviceFunction[func, declared_arg_types, target=target, compile_options=compile_options, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List[LaunchAttribute](, Tuple[]()), var constant_memory: List[ConstantMemoryMapping] = List[ConstantMemoryMapping](, Tuple[]()), location: OptionalReg[_SourceLocation] = None)

Enqueues a compiled function for execution on this device.

You can pass the function directly to enqueue_function without compiling it first:

from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()

If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead:

from gpu.host import DeviceContext

with DeviceContext() as ctx:
    var compiled_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.synchronize()

Parameters:

*Ts (AnyType): Argument dtypes.

Args:

f (DeviceFunction): The compiled function to execute.
*args (*Ts): Arguments to pass to the function.
grid_dim (Dim): Dimensions of the compute grid, made up of thread blocks.
block_dim (Dim): Dimensions of each thread block in the grid.
cluster_dim (OptionalReg): Dimensions of clusters (if the thread blocks are grouped into clusters).
shared_mem_bytes (OptionalReg): Amount of shared memory per thread block.
attributes (List): Launch attributes.
constant_memory (List): Constant memory mapping.
location (OptionalReg): Source location for the function call.

enqueue_function[*Ts: AnyType](self, f: DeviceExternalFunction, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List[LaunchAttribute](, Tuple[]()), var constant_memory: List[ConstantMemoryMapping] = List[ConstantMemoryMapping](, Tuple[]()), location: OptionalReg[_SourceLocation] = None)

Enqueues an external device function for asynchronous execution on the GPU.

This method schedules an external device function to be executed on the GPU with the specified execution configuration. The function and its arguments are passed to the underlying GPU runtime, which will execute them when resources are available.

Example:

from gpu.host import DeviceContext
from gpu.host.device_context import DeviceExternalFunction

# Create a device context and load an external function
with DeviceContext() as ctx:
    var ext_func = DeviceExternalFunction("my_kernel")

    # Enqueue the external function with execution configuration
    ctx.enqueue_function(
        ext_func,
        grid_dim=Dim(16),
        block_dim=Dim(256)
    )

    # Wait for completion
    ctx.synchronize()

Parameters:

*Ts (AnyType): The dtypes of the arguments to be passed to the device function.

Args:

f (DeviceExternalFunction): The external device function to execute.
*args (*Ts): The arguments to pass to the device function.
grid_dim (Dim): The dimensions of the grid (number of thread blocks).
block_dim (Dim): The dimensions of each thread block (number of threads per block).
cluster_dim (OptionalReg): Optional dimensions for thread block clusters (for newer GPU architectures).
shared_mem_bytes (OptionalReg): Optional amount of dynamic shared memory to allocate per block.
attributes (List): Optional list of launch attributes for fine-grained control.
constant_memory (List): Optional list of constant memory mappings to use during execution.
location (OptionalReg): Source location for the function call.

Raises:

If there's an error enqueuing the function or if the function execution fails.

`enqueue_function_unchecked`

enqueue_function_unchecked[func_type: AnyTrivialRegType, //, func: func_type, *Ts: AnyType, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List[LaunchAttribute](, Tuple[]()), var constant_memory: List[ConstantMemoryMapping] = List[ConstantMemoryMapping](, Tuple[]()), func_attribute: OptionalReg[FuncAttribute] = None, location: OptionalReg[_SourceLocation] = None)

Compiles and enqueues a kernel for execution on this device.

You can pass the function directly to enqueue_function without compiling it first:

from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()

If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead:

with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()

Parameters:

func_type (AnyTrivialRegType): The dtype of the function to launch.
func (func_type): The function to launch.
*Ts (AnyType): The dtypes of the arguments being passed to the function.
dump_asm (Variant): To dump the compiled assembly, pass True, or a file path to dump to, or a function returning a file path.
dump_llvm (Variant): To dump the generated LLVM code, pass True, or a file path to dump to, or a function returning a file path.
_dump_sass (Variant): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass True, or a file path to dump to, or a function returning a file path.
_ptxas_info_verbose (Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes dump_asm to output verbose PTX assembly (default False).

Args:

*args (*Ts): Variadic arguments which are passed to the func.
grid_dim (Dim): The grid dimensions.
block_dim (Dim): The block dimensions.
cluster_dim (OptionalReg): The cluster dimensions.
shared_mem_bytes (OptionalReg): Per-block memory shared between blocks.
attributes (List): A List of launch attributes.
constant_memory (List): A List of constant memory mappings.
func_attribute (OptionalReg): CUfunction_attribute enum.
location (OptionalReg): Source location for the function call.

enqueue_function_unchecked[*Ts: AnyType](self, f: DeviceFunction[func, declared_arg_types, target=target, compile_options=compile_options, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List[LaunchAttribute](, Tuple[]()), var constant_memory: List[ConstantMemoryMapping] = List[ConstantMemoryMapping](, Tuple[]()), location: OptionalReg[_SourceLocation] = None)

Enqueues a compiled function for execution on this device.

You can pass the function directly to enqueue_function without compiling it first:

from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()

If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead:

from gpu.host import DeviceContext

with DeviceContext() as ctx:
    var compiled_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.synchronize()

Parameters:

*Ts (AnyType): Argument dtypes.

Args:

f (DeviceFunction): The compiled function to execute.
*args (*Ts): Arguments to pass to the function.
grid_dim (Dim): Dimensions of the compute grid, made up of thread blocks.
block_dim (Dim): Dimensions of each thread block in the grid.
cluster_dim (OptionalReg): Dimensions of clusters (if the thread blocks are grouped into clusters).
shared_mem_bytes (OptionalReg): Amount of shared memory per thread block.
attributes (List): Launch attributes.
constant_memory (List): Constant memory mapping.
location (OptionalReg): Source location for the function call.

`enqueue_function_checked`

enqueue_function_checked[*Ts: DevicePassable](self, f: DeviceFunction[func, declared_arg_types, target=target, compile_options=compile_options, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List[LaunchAttribute](, Tuple[]()), var constant_memory: List[ConstantMemoryMapping] = List[ConstantMemoryMapping](, Tuple[]()), location: OptionalReg[_SourceLocation] = None)

Enqueues a compiled function for execution on this device.

You can pass the function directly to enqueue_function without compiling it first:

from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()

If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead:

from gpu.host import DeviceContext

with DeviceContext() as ctx:
    var compiled_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.synchronize()

Parameters:

*Ts (DevicePassable): Argument dtypes.

Args:

f (DeviceFunction): The compiled function to execute.
*args (*Ts): Arguments to pass to the function.
grid_dim (Dim): Dimensions of the compute grid, made up of thread blocks.
block_dim (Dim): Dimensions of each thread block in the grid.
cluster_dim (OptionalReg): Dimensions of clusters (if the thread blocks are grouped into clusters).
shared_mem_bytes (OptionalReg): Amount of shared memory per thread block.
attributes (List): Launch attributes.
constant_memory (List): Constant memory mapping.
location (OptionalReg): Source location for the function call.

enqueue_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List[LaunchAttribute](, Tuple[]()), var constant_memory: List[ConstantMemoryMapping] = List[ConstantMemoryMapping](, Tuple[]()), func_attribute: OptionalReg[FuncAttribute] = None, location: OptionalReg[_SourceLocation] = None)

Compiles and enqueues a kernel for execution on this device.

You can pass the function directly to enqueue_function without compiling it first:

from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()

If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead:

with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()

Parameters:

func_type (AnyTrivialRegType): The dtype of the function to launch.
declared_arg_types (Variadic): Types of the arguments to pass to the device function.
func (func_type): The function to compile and launch.
signature_func (fn(*args: *declared_arg_types) -> None): The function to compile and launch, passed in again. Used for checking argument dtypes later. Note: This will disappear in future versions.
*actual_arg_types (DevicePassable): The dtypes of the arguments being passed to the function.
dump_asm (Variant): To dump the compiled assembly, pass True, or a file path to dump to, or a function returning a file path.
dump_llvm (Variant): To dump the generated LLVM code, pass True, or a file path to dump to, or a function returning a file path.
_dump_sass (Variant): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass True, or a file path to dump to, or a function returning a file path.
_ptxas_info_verbose (Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes dump_asm to output verbose PTX assembly (default False).

Args:

*args (*actual_arg_types): Variadic arguments which are passed to the func.
grid_dim (Dim): The grid dimensions.
block_dim (Dim): The block dimensions.
cluster_dim (OptionalReg): The cluster dimensions.
shared_mem_bytes (OptionalReg): Per-block memory shared between blocks.
attributes (List): A List of launch attributes.
constant_memory (List): A List of constant memory mappings.
func_attribute (OptionalReg): CUfunction_attribute enum.
location (OptionalReg): Source location for the function call.

enqueue_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) capturing -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List[LaunchAttribute](, Tuple[]()), var constant_memory: List[ConstantMemoryMapping] = List[ConstantMemoryMapping](, Tuple[]()), func_attribute: OptionalReg[FuncAttribute] = None, location: OptionalReg[_SourceLocation] = None)

Compiles and enqueues a kernel for execution on this device. This overload takes in a function that's capturing.

You can pass the function directly to enqueue_function without compiling it first:

from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()

If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead:

with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()

Parameters:

func_type (AnyTrivialRegType): The dtype of the function to launch.
declared_arg_types (Variadic): Types of the arguments to pass to the device function.
func (func_type): The function to compile and launch.
signature_func (fn(*args: *declared_arg_types) capturing -> None): The function to compile and launch, passed in again. Used for checking argument dtypes later. Note: This will disappear in future versions.
*actual_arg_types (DevicePassable): The dtypes of the arguments being passed to the function.
dump_asm (Variant): To dump the compiled assembly, pass True, or a file path to dump to, or a function returning a file path.
dump_llvm (Variant): To dump the generated LLVM code, pass True, or a file path to dump to, or a function returning a file path.
_dump_sass (Variant): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass True, or a file path to dump to, or a function returning a file path.
_ptxas_info_verbose (Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes dump_asm to output verbose PTX assembly (default False).

Args:

*args (*actual_arg_types): Variadic arguments which are passed to the func.
grid_dim (Dim): The grid dimensions.
block_dim (Dim): The block dimensions.
cluster_dim (OptionalReg): The cluster dimensions.
shared_mem_bytes (OptionalReg): Per-block memory shared between blocks.
attributes (List): A List of launch attributes.
constant_memory (List): A List of constant memory mappings.
func_attribute (OptionalReg): CUfunction_attribute enum.
location (OptionalReg): Source location for the function call.

`enqueue_function_experimental`

enqueue_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List[LaunchAttribute](, Tuple[]()), var constant_memory: List[ConstantMemoryMapping] = List[ConstantMemoryMapping](, Tuple[]()), func_attribute: OptionalReg[FuncAttribute] = None, location: OptionalReg[_SourceLocation] = None)

Compiles and enqueues a kernel for execution on this device.

You can pass the function directly to enqueue_function without compiling it first:

from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()

If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead:

with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()

Parameters:

declared_arg_types (Variadic): Types of the arguments to pass to the device function.
func (fn(*args: *declared_arg_types) -> None): The function to compile and launch.
*actual_arg_types (DevicePassable): The dtypes of the arguments being passed to the function.
dump_asm (Variant): To dump the compiled assembly, pass True, or a file path to dump to, or a function returning a file path.
dump_llvm (Variant): To dump the generated LLVM code, pass True, or a file path to dump to, or a function returning a file path.
_dump_sass (Variant): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass True, or a file path to dump to, or a function returning a file path.
_ptxas_info_verbose (Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes dump_asm to output verbose PTX assembly (default False).

Args:

*args (*actual_arg_types): Variadic arguments which are passed to the func.
grid_dim (Dim): The grid dimensions.
block_dim (Dim): The block dimensions.
cluster_dim (OptionalReg): The cluster dimensions.
shared_mem_bytes (OptionalReg): Per-block memory shared between blocks.
attributes (List): A List of launch attributes.
constant_memory (List): A List of constant memory mappings.
func_attribute (OptionalReg): CUfunction_attribute enum.
location (OptionalReg): Source location for the function call.

enqueue_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) capturing -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List[LaunchAttribute](, Tuple[]()), var constant_memory: List[ConstantMemoryMapping] = List[ConstantMemoryMapping](, Tuple[]()), func_attribute: OptionalReg[FuncAttribute] = None, location: OptionalReg[_SourceLocation] = None)

Compiles and enqueues a kernel for execution on this device. This overload takes in a function that's capturing.

You can pass the function directly to enqueue_function without compiling it first:

from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()

If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead:

with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()

Parameters:

declared_arg_types (Variadic): Types of the arguments to pass to the device function.
func (fn(*args: *declared_arg_types) capturing -> None): The function to compile and launch.
*actual_arg_types (DevicePassable): The dtypes of the arguments being passed to the function.
dump_asm (Variant): To dump the compiled assembly, pass True, or a file path to dump to, or a function returning a file path.
dump_llvm (Variant): To dump the generated LLVM code, pass True, or a file path to dump to, or a function returning a file path.
_dump_sass (Variant): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass True, or a file path to dump to, or a function returning a file path.
_ptxas_info_verbose (Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes dump_asm to output verbose PTX assembly (default False).

Args:

*args (*actual_arg_types): Variadic arguments which are passed to the func.
grid_dim (Dim): The grid dimensions.
block_dim (Dim): The block dimensions.
cluster_dim (OptionalReg): The cluster dimensions.
shared_mem_bytes (OptionalReg): Per-block memory shared between blocks.
attributes (List): A List of launch attributes.
constant_memory (List): A List of constant memory mappings.
func_attribute (OptionalReg): CUfunction_attribute enum.
location (OptionalReg): Source location for the function call.

enqueue_function_experimental[func_type: AnyTrivialRegType, //, func: func_type, declared_arg_types: Optional[Variadic[AnyType]], *Ts: DevicePassable](self, f: DeviceFunction[func, declared_arg_types, target=target, compile_options=compile_options, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List[LaunchAttribute](, Tuple[]()), var constant_memory: List[ConstantMemoryMapping] = List[ConstantMemoryMapping](, Tuple[]()), location: OptionalReg[_SourceLocation] = None)

Enqueues a compiled function for execution on this device.

You can pass the function directly to enqueue_function without compiling it first:

from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()

If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead:

from gpu.host import DeviceContext

with DeviceContext() as ctx:
    var compiled_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.synchronize()

Parameters:

func_type (AnyTrivialRegType): Something.
func (func_type): Something.
declared_arg_types (Optional): Something.
*Ts (DevicePassable): Argument dtypes.

Args:

f (DeviceFunction): The compiled function to execute.
*args (*Ts): Arguments to pass to the function.
grid_dim (Dim): Dimensions of the compute grid, made up of thread blocks.
block_dim (Dim): Dimensions of each thread block in the grid.
cluster_dim (OptionalReg): Dimensions of clusters (if the thread blocks are grouped into clusters).
shared_mem_bytes (OptionalReg): Amount of shared memory per thread block.
attributes (List): Launch attributes.
constant_memory (List): Constant memory mapping.
location (OptionalReg): Source location for the function call.

`execution_time`

execution_time[func: fn(DeviceContext) raises capturing -> None](self, num_iters: Int) -> Int

Measures the execution time of a function that takes a DeviceContext parameter.

This method times the execution of a provided function that requires the DeviceContext as a parameter. It runs the function for the specified number of iterations and returns the total elapsed time in nanoseconds.

Example:

from gpu.host import DeviceContext

fn gpu_operation(ctx: DeviceContext) raises capturing [_] -> None:
    # Perform some GPU operation using ctx
    pass

with DeviceContext() as ctx:
    # Measure execution time of a function that uses the context
    var time_ns = ctx.execution_time[gpu_operation](10)
    print("Execution time for 10 iterations:", time_ns, "ns")

Parameters:

func (fn(DeviceContext) raises capturing -> None): A function that takes a DeviceContext parameter to execute and time.

Args:

num_iters (Int): The number of iterations to run the function.

Returns:

Int: The total elapsed time in nanoseconds for all iterations.

Raises:

If the timer operations fail or if the function raises an exception.

execution_time[func: fn() raises capturing -> None](self, num_iters: Int) -> Int

Measures the execution time of a function over multiple iterations.

This method times the execution of a provided function that doesn't require the DeviceContext as a parameter. It runs the function for the specified number of iterations and returns the total elapsed time in nanoseconds.

Example:

from gpu.host import DeviceContext

fn some_gpu_operation() raises capturing [_] -> None:
    # Perform some GPU operation
    pass

with DeviceContext() as ctx:
    # Measure execution time of a function
    var time_ns = ctx.execution_time[some_gpu_operation]
    print("Execution time:", time_ns, "ns")

Parameters:

func (fn() raises capturing -> None): A function with no parameters to execute and time.

Args:

num_iters (Int): The number of iterations to run the function.

Returns:

Int: The total elapsed time in nanoseconds for all iterations.

Raises:

If the timer operations fail or if the function raises an exception.

`push_context`

push_context(self) -> _DeviceContextScope

Returns a context manager that ensures this device's driver context is active.

This method returns a context manager that pushes this device's driver context as the current context on entry and restores the previous context on exit. This is useful for operations that require a specific GPU context to be active, such as cuDNN operations on multi-GPU systems.

Example:

var ctx = DeviceContext(device_id=1)
# Ensure GPU 1's context is active for these operations.
with ctx.push_context():
    # All GPU operations here will use GPU 1's context.
    ...  # call external stateful APIs, such as cudnn.
# Previous context is automatically restored

Returns:

_DeviceContextScope: A context manager that manages the driver context stack.

Raises:

If there's an error switching contexts.

`set_as_current`

set_as_current(self)

For use with libraries that require a specific GPU context to be active. Sets the current device to the one associated with this DeviceContext.

Example:

from gpu.host import DeviceContext
var ctx = DeviceContext(device_id=1)
ctx.set_as_current()

Raises:

If there's an error setting the current device.

`execution_time_iter`

execution_time_iter[func: fn(DeviceContext, Int) raises capturing -> None](self, num_iters: Int) -> Int

Measures the execution time of a function that takes iteration index as input.

This method times the execution of a provided function that requires both the DeviceContext and the current iteration index as parameters. It runs the function for the specified number of iterations, passing the iteration index to each call, and returns the total elapsed time in nanoseconds.

Example:

from gpu.host import DeviceContext

var my_kernel = DeviceFunction(...)

fn benchmark_kernel(ctx: DeviceContext, i: Int) raises capturing [_] -> None:
    # Run kernel with different parameters based on iteration
    ctx.enqueue_function[my_kernel](grid_dim=Dim(i), block_dim=Dim(256))

with DeviceContext() as ctx:
    # Measure execution time with iteration awareness
    var time_ns = ctx.execution_time_iter[benchmark_kernel](10)
    print("Total execution time:", time_ns, "ns")

Parameters:

func (fn(DeviceContext, Int) raises capturing -> None): A function that takes the DeviceContext and an iteration index.

Args:

num_iters (Int): The number of iterations to run the function.

Returns:

Int: The total elapsed time in nanoseconds for all iterations.

Raises:

If the timer operations fail or if the function raises an exception.

`enqueue_copy`

enqueue_copy[dtype: DType](self, dst_buf: DeviceBuffer[dtype], src_ptr: UnsafePointer[Scalar[dtype]])

Enqueues an async copy from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer.

Parameters:

dtype (DType): Type of the data being copied.

Args:

dst_buf (DeviceBuffer): Device buffer to copy to.
src_ptr (UnsafePointer): Host pointer to copy from.

enqueue_copy[dtype: DType](self, dst_buf: HostBuffer[dtype], src_ptr: UnsafePointer[Scalar[dtype]])

Enqueues an async copy from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer.

Parameters:

dtype (DType): Type of the data being copied.

Args:

dst_buf (HostBuffer): Device buffer to copy to.
src_ptr (UnsafePointer): Host pointer to copy from.

enqueue_copy[dtype: DType](self, dst_ptr: UnsafePointer[Scalar[dtype]], src_buf: DeviceBuffer[dtype])

Enqueues an async copy from the device to the host. The number of bytes copied is determined by the size of the device buffer.

Parameters:

dtype (DType): Type of the data being copied.

Args:

dst_ptr (UnsafePointer): Host pointer to copy to.
src_buf (DeviceBuffer): Device buffer to copy from.

enqueue_copy[dtype: DType](self, dst_ptr: UnsafePointer[Scalar[dtype]], src_buf: HostBuffer[dtype])

Enqueues an async copy from the device to the host. The number of bytes copied is determined by the size of the device buffer.

Parameters:

dtype (DType): Type of the data being copied.

Args:

dst_ptr (UnsafePointer): Host pointer to copy to.
src_buf (HostBuffer): Device buffer to copy from.

enqueue_copy[dtype: DType](self, dst_ptr: UnsafePointer[Scalar[dtype]], src_ptr: UnsafePointer[Scalar[dtype]], size: Int)

Enqueues an async copy of size elements from a device pointer to another device pointer.

Parameters:

dtype (DType): Type of the data being copied.

Args:

dst_ptr (UnsafePointer): Host pointer to copy to.
src_ptr (UnsafePointer): Device pointer to copy from.
size (Int): Number of elements (of the specified DType) to copy.

enqueue_copy[dtype: DType](self, dst_buf: DeviceBuffer[dtype], src_buf: DeviceBuffer[dtype])

Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

Parameters:

dtype (DType): Type of the data being copied.

Args:

dst_buf (DeviceBuffer): Device buffer to copy to.
src_buf (DeviceBuffer): Device buffer to copy from. Must be at least as large as dst.

enqueue_copy[dtype: DType](self, dst_buf: DeviceBuffer[dtype], src_buf: HostBuffer[dtype])

Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

Parameters:

dtype (DType): Type of the data being copied.

Args:

dst_buf (DeviceBuffer): Device buffer to copy to.
src_buf (HostBuffer): Device buffer to copy from. Must be at least as large as dst.

enqueue_copy[dtype: DType](self, dst_buf: HostBuffer[dtype], src_buf: DeviceBuffer[dtype])

Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

Parameters:

dtype (DType): Type of the data being copied.

Args:

dst_buf (HostBuffer): Device buffer to copy to.
src_buf (DeviceBuffer): Device buffer to copy from. Must be at least as large as dst.

enqueue_copy[dtype: DType](self, dst_buf: HostBuffer[dtype], src_buf: HostBuffer[dtype])

Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

Parameters:

dtype (DType): Type of the data being copied.

Args:

dst_buf (HostBuffer): Device buffer to copy to.
src_buf (HostBuffer): Device buffer to copy from. Must be at least as large as dst.

`enqueue_memset`

enqueue_memset[dtype: DType](self, dst: DeviceBuffer[dtype], val: Scalar[dtype])

Enqueues an async memset operation, setting all of the elements in the destination device buffer to the specified value.

Parameters:

dtype (DType): Type of the data stored in the buffer.

Args:

dst (DeviceBuffer): Destination buffer.
val (Scalar): Value to set all elements of dst to.

enqueue_memset[dtype: DType](self, dst: HostBuffer[dtype], val: Scalar[dtype])

Enqueues an async memset operation, setting all of the elements in the destination host buffer to the specified value.

Parameters:

dtype (DType): Type of the data stored in the buffer.

Args:

dst (HostBuffer): Destination buffer.
val (Scalar): Value to set all elements of dst to.

`create_event`

create_event[*, blocking_sync: Bool = False, disable_timing: Bool = True, interprocess: Bool = False](self) -> DeviceEvent

Creates a new event for synchronization between streams.

Provides the best performance by default, disabling timing and blocking sync. DeviceContext.execution_time() provides the functionality required for timing kernels by passing it a closure, and is functionally equivalent to recording start and end events, then calculating the elapsed time.

Example:

from gpu.host import DeviceContext

var ctx = DeviceContext()

var default_stream = ctx.stream()
var new_stream = ctx.create_stream()

# Create an event
var event = ctx.create_event()

# Wait for the event in new_stream
new_stream.enqueue_wait_for(event)

# new_stream can continue
default_stream.record_event(event)
default_stream.synchronize()

Parameters:

blocking_sync (Bool): Enable event.synchronize() to block until the event has been recorded. Incurs overhead compared to stream.enqueue_wait_for(event) (default: False).
disable_timing (Bool): Remove timing overhead (default: True).
interprocess (Bool): Enable interprocess synchronization, currently unimplemented. (default: False).

Returns:

DeviceEvent: A DeviceEvent that can be used for synchronization.

Raises:

If event creation fails.

`stream_priority_range`

stream_priority_range(self) -> StreamPriorityRange

Returns the range of stream priorities supported by this device context.

Returns:

StreamPriorityRange: A StreamPriorityRange object containing the minimum and maximum stream priorities.

`create_stream`

create_stream(self, *, blocking: Bool = True) -> DeviceStream

Creates a new stream associated with the given device context.

Args:

blocking (Bool): Whether the stream should be blocking.

Returns:

DeviceStream Raises:

If stream creation fails.

create_stream(self, *, priority: Int, blocking: Bool = True) -> DeviceStream

Creates a new stream associated with the given device context.

To create a non-blocking stream with the highest priority, use:

from gpu.host import DeviceContext
var ctx = DeviceContext()
var priority = ctx.stream_priority_range().largest
var stream = ctx.create_stream(priority=priority, blocking=False)

Args:

priority (Int): The priority of the stream.
blocking (Bool): Whether the stream should be blocking.

Returns:

DeviceStream Raises:

If stream creation fails.

`synchronize`

synchronize(self)

Blocks until all asynchronous calls on the stream associated with this device context have completed.

This should never be necessary when writing a custom operation.

`enqueue_wait_for`

enqueue_wait_for(self, other: Self)

Enqueues a wait operation for another device context to complete its work.

This method creates a dependency between two device contexts, ensuring that operations in the current context will not begin execution until all previously enqueued operations in the other context have completed. This is useful for synchronizing work across multiple devices or streams.

Example:

from gpu.host import DeviceContext

# Create two device contexts
var ctx1 = DeviceContext(0)  # First GPU
var ctx2 = DeviceContext(1)  # Second GPU

# Enqueue operations on ctx1
# ...

# Make ctx2 wait for ctx1 to complete before proceeding
ctx2.enqueue_wait_for(ctx1)

# Enqueue operations on ctx2 that depend on ctx1's completion
# ...

Args:

other (Self): The device context whose operations must complete before operations in this context can proceed.

Raises:

If there's an error enqueuing the wait operation or if the operation is not supported by the underlying device API.

`get_api_version`

get_api_version(self) -> Int

Returns the API version associated with this device.

This method retrieves the version number of the GPU driver currently installed on the system for the device associated with this context. The version is returned as an integer that can be used to check compatibility with specific features or to troubleshoot driver-related issues.

Example:

from gpu.host import DeviceContext

with DeviceContext() as ctx:
    # Get the API version
    var api_version = ctx.get_api_version()
    print("GPU API version:", api_version)

Returns:

Int: An integer representing the driver version.

Raises:

If the driver version cannot be retrieved or if the device context is invalid.

`get_attribute`

get_attribute(self, attr: DeviceAttribute) -> Int

Returns the specified attribute for this device.

Use the aliases defined by DeviceAttribute to specify attributes. For example:

from gpu.host import DeviceAttribute, DeviceContext

def main():
    var ctx = DeviceContext()
    var attr = DeviceAttribute.MAX_BLOCKS_PER_MULTIPROCESSOR
    var max_blocks = ctx.get_attribute(attr)
    print(max_blocks)

Args:

attr (DeviceAttribute): The device attribute to query.

Returns:

Int: The value for attr on this device.

`is_compatible`

is_compatible(self) -> Bool

Returns True if this device is compatible with MAX.

This method checks whether the current device is compatible with the Modular Accelerated Execution (MAX) runtime. It's useful for validating that the device can execute the compiled code before attempting operations.

Example:

from gpu.host import DeviceContext

var ctx = DeviceContext()
print("Device is compatible with MAX:", ctx.is_compatible())

Returns:

Bool: True if the device is compatible with MAX, False otherwise.

`id`

id(self) -> Int64

Returns the ID associated with this device.

This method retrieves the unique identifier for the current device. Device IDs are used to distinguish between multiple devices in a system and are often needed for multi-GPU programming.

Example:

var ctx = DeviceContext()
try:
    var device_id = ctx.id()
    print("Using device with ID:", device_id)
except:
    print("Failed to get device ID")

Returns:

Int64: The unique device ID as an Int64.

Raises:

If there's an error retrieving the device ID.

`get_memory_info`

get_memory_info(self) -> Tuple[UInt, UInt]

Returns the free and total memory size for this device.

This method queries the current state of device memory, providing information about how much memory is available and the total memory capacity of the device. This is useful for memory management and determining if there's enough space for planned operations.

Example:

from gpu.host import DeviceContext

var ctx = DeviceContext()
try:
    (free, total) = ctx.get_memory_info()
    print("Free memory:", free / (1024*1024), "MB")
    print("Total memory:", total / (1024*1024), "MB")
except:
    print("Failed to get memory information")

Returns:

Tuple: A tuple of (free memory, total memory) in bytes.

Raises:

If there's an error retrieving the memory information.

`can_access`

can_access(self, peer: Self) -> Bool

Returns True if this device can access the identified peer device.

This method checks whether the current device can directly access memory on the specified peer device. Peer-to-peer access allows for direct memory transfers between devices without going through host memory, which can significantly improve performance in multi-GPU scenarios.

Example:

from gpu.host import DeviceContext
var ctx1 = DeviceContext(0)  # First GPU
var ctx2 = DeviceContext(1)  # Second GPU

try:
    if ctx1.can_access(ctx2):
        print("Direct peer access is possible")
        ctx1.enable_peer_access(ctx2)
    else:
        print("Direct peer access is not supported")
except:
    print("Failed to check peer access capability")

Args:

peer (Self): The peer device to check for accessibility.

Returns:

Bool: True if the current device can access the peer device, False otherwise.

Raises:

If there's an error checking peer access capability.

`enable_peer_access`

enable_peer_access(self, peer: Self)

Enables direct memory access to the peer device.

This method establishes peer-to-peer access from the current device to the specified peer device. Once enabled, the current device can directly read from and write to memory allocated on the peer device without going through host memory, which can significantly improve performance for multi-GPU operations.

Notes:

It's recommended to call can_access() first to check if peer access is possible.
Peer access is not always symmetric; you may need to enable access in both directions.

Example:

from gpu.host import DeviceContext

var ctx1 = DeviceContext(0)  # First GPU
var ctx2 = DeviceContext(1)  # Second GPU

try:
    if ctx1.can_access(ctx2):
        ctx1.enable_peer_access(ctx2)
        print("Peer access enabled from device 0 to device 1")

        # For bidirectional access
        if ctx2.can_access(ctx1):
            ctx2.enable_peer_access(ctx1)
            print("Peer access enabled from device 1 to device 0")
    else:
        print("Peer access not supported between these devices")
except:
    print("Failed to enable peer access")

Args:

peer (Self): The peer device to enable access to.

Raises:

If there's an error enabling peer access or if peer access is not supported between the devices.

`supports_multicast`

supports_multicast(self) -> Bool

Returns True if this device supports multicast memory mappings.

Returns:

Bool: True if the current device supports multicast memory, False otherwise.

Raises:

If there's an error checking peer access capability.

`number_of_devices`

static number_of_devices(*, api: String = GPUInfo.from_name[_accelerator_arch()]().api) -> Int

Returns the number of devices available that support the specified API.

This function queries the system for available devices that support the requested API (such as CUDA or HIP). It's useful for determining how many accelerators are available before allocating resources or distributing work.

Example:

from gpu.host import DeviceContext

# Get number of CUDA devices
var num_cuda_devices = DeviceContext.number_of_devices(api="cuda")

# Get number of devices for the default API
var num_devices = DeviceContext.number_of_devices()

Args:

api (String): Requested device API (for example, "cuda" or "hip"). Defaults to the device API specified by current target accelerator.

Returns:

Int: The number of available devices supporting the specified API.

`enable_all_peer_access`

static enable_all_peer_access()

Enable peer-to-peer memory access between all available accelerators.

This function detects all available accelerators in the system and enables peer-to-peer (P2P) memory access between every pair of devices.

When peer access is enabled, kernels running on one device can directly access memory allocated on another device without going through host memory. This is crucial for efficient multi-GPU operations like allreduce.

The function is a no-op when:

No accelerators are available
Only one accelerator is available
Peer access is already enabled between devices

Example:

from gpu.host import DeviceContext

# Enable P2P access between all GPUs
DeviceContext.enable_all_peer_access()

# Now GPUs can directly access each other's memory

Raises:

Error: If peer access cannot be enabled between any pair of devices. This can happen if the hardware doesn't support P2P access or if there's a configuration issue.

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!

Implemented traits​

Aliases​

__copyinit__is_trivial​

__del__is_trivial​

__moveinit__is_trivial​

default_device_info​

Methods​

__init__​

__copyinit__​

__del__​

__enter__​

name​

api​

enqueue_create_buffer​

create_buffer_sync​

enqueue_create_host_buffer​

compile_function​

compile_function_unchecked​

compile_function_checked​

compile_function_experimental​

load_function​

enqueue_function​

enqueue_function_unchecked​

enqueue_function_checked​

enqueue_function_experimental​

execution_time​

push_context​

set_as_current​

execution_time_iter​

enqueue_copy​

enqueue_memset​

create_event​

stream_priority_range​

create_stream​

synchronize​

enqueue_wait_for​

get_api_version​

get_attribute​

is_compatible​

id​

get_memory_info​

can_access​

enable_peer_access​

supports_multicast​

number_of_devices​

enable_all_peer_access​

Implemented traits

Aliases

`copyinitis_trivial`

`delis_trivial`

`moveinitis_trivial`

`default_device_info`

Methods

`init`

`copyinit`

`del`

`enter`

`name`

`api`

`enqueue_create_buffer`

`create_buffer_sync`

`enqueue_create_host_buffer`

`compile_function`

`compile_function_unchecked`

`compile_function_checked`

`compile_function_experimental`

`load_function`

`enqueue_function`

`enqueue_function_unchecked`

`enqueue_function_checked`

`enqueue_function_experimental`

`execution_time`

`push_context`

`set_as_current`

`execution_time_iter`

`enqueue_copy`

`enqueue_memset`

`create_event`

`stream_priority_range`

`create_stream`

`synchronize`

`enqueue_wait_for`

`get_api_version`

`get_attribute`

`is_compatible`

`id`

`get_memory_info`

`can_access`

`enable_peer_access`

`supports_multicast`

`number_of_devices`

`enable_all_peer_access`