Run inference with C

Our C API allows you to integrate MAX Engine into your high-performance application code, and run inference with models from PyTorch and ONNX.

This tutorial shows how to use the MAX Engine C API to load a BERT model and run inference. We'll walk through a complete example that demonstrates loading a model, preparing inputs, and executing inference.

Create a virtual environment

Using a virtual environment ensures that you have the Python version and packages that are compatible with this project. We'll use the Magic CLI to create the environment and install the required packages.

If you don't have the magic CLI yet, you can install it on macOS and Ubuntu Linux with this command:

curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash

Then run the source command that's printed in your terminal.

Initialize the runtime context

The first step in using the MAX Engine C API is initializing the runtime context. This context manages resources like thread pools and memory allocators that are needed during inference.

Create a new file called main.c. We'll need to create three key objects:

// Helper macro for error checking
#define CHECK(x)
  if (M_isError(x)) {
    logError(M_getError(x));
    return EXIT_FAILURE;
  }

M_Status *status = M_newStatus();
M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig();
M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status);
CHECK(status);
// Helper macro for error checking
#define CHECK(x)
  if (M_isError(x)) {
    logError(M_getError(x));
    return EXIT_FAILURE;
  }

M_Status *status = M_newStatus();
M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig();
M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status);
CHECK(status);

M_RuntimeContext is an application level object that sets up various resources such as threadpool and allocators during inference. We recommended you create one context and use it throughout your application.

M_RuntimeContext requires two objects:

M_RuntimeConfig: This configures details about the runtime context such as the number of threads to use and the logging level.
M_Status: This is the object through which MAX Engine passes all error messages.

Notice that this code checks if the M_Status object is an error, using M_isError(), and then exits if it is.

Compile the model

After initializing the runtime, you'll need to compile your model. MAX Engine supports both PyTorch's TorchScript format and ONNX models. The process differs slightly depending on your model format.

We support PyTorch's TorchScript format, TensorFlow's SavedModel format, and ONNX format. For more information read about our Supported model formats.

To compile the model, pass your model path to M_setModelPath(), along with an M_CompileConfig object. Then call M_compileModel().

PyTorch
ONNX

PyTorch models require additional input shape specifications since these aren't included in TorchScript format.

// Create compilation config and set model path
  logInfo("Compiling Model");
  M_CompileConfig *compileConfig = M_newCompileConfig();
  const char *modelPath = argv[1];
  M_setModelPath(compileConfig, /*path=*/modelPath);

  // Define input specifications for PyTorch model
  // Input IDs specification
  int64_t inputShape[] = {1, 512};  // Example shape for BERT-like model
  M_TorchInputSpec *inputSpec = M_newTorchInputSpec(
      inputShape,           // Shape array
      /*dimNames=*/NULL,    // Dimension names (optional)
      /*rankSize=*/2,       // Number of dimensions
      /*dtype=*/M_INT32,    // Data type
      /*device=*/NULL,      // Device (optional)
      status
  );
  CHECK(status);

  // Attention mask specification
  int64_t maskShape[] = {1, 512};
  M_TorchInputSpec *maskSpec = M_newTorchInputSpec(
      maskShape,
      /*dimNames=*/NULL,
      /*rankSize=*/2,
      /*dtype=*/M_INT32,
      /*device=*/NULL,
      status
  );
  CHECK(status);

  // Set input specifications for compilation
  M_TorchInputSpec *inputSpecs[] = {inputSpec, maskSpec};
  M_setTorchInputSpecs(compileConfig, inputSpecs, /*numInputs=*/2);

  // Compile and initialize the model
  M_AsyncCompiledModel *compiledModel = M_compileModel(
      context,
      &compileConfig,
      status
  );
  CHECK(status);
// Create compilation config and set model path
  logInfo("Compiling Model");
  M_CompileConfig *compileConfig = M_newCompileConfig();
  const char *modelPath = argv[1];
  M_setModelPath(compileConfig, /*path=*/modelPath);

  // Define input specifications for PyTorch model
  // Input IDs specification
  int64_t inputShape[] = {1, 512};  // Example shape for BERT-like model
  M_TorchInputSpec *inputSpec = M_newTorchInputSpec(
      inputShape,           // Shape array
      /*dimNames=*/NULL,    // Dimension names (optional)
      /*rankSize=*/2,       // Number of dimensions
      /*dtype=*/M_INT32,    // Data type
      /*device=*/NULL,      // Device (optional)
      status
  );
  CHECK(status);

  // Attention mask specification
  int64_t maskShape[] = {1, 512};
  M_TorchInputSpec *maskSpec = M_newTorchInputSpec(
      maskShape,
      /*dimNames=*/NULL,
      /*rankSize=*/2,
      /*dtype=*/M_INT32,
      /*device=*/NULL,
      status
  );
  CHECK(status);

  // Set input specifications for compilation
  M_TorchInputSpec *inputSpecs[] = {inputSpec, maskSpec};
  M_setTorchInputSpecs(compileConfig, inputSpecs, /*numInputs=*/2);

  // Compile and initialize the model
  M_AsyncCompiledModel *compiledModel = M_compileModel(
      context,
      &compileConfig,
      status
  );
  CHECK(status);

The M_CompileConfig takes the model path set by M_setModelPath().

The M_setTorchInputSpecs() takes the input spec: shape, rank, and types.

Although you must specify all input shapes, the shapes can be dynamic: use M_getDynamicDimensionValue() for any dimension size that's dynamic. For more detail, see M_newTorchInputSpec().

Because the TorchScript model does not include metadata about the input specs, this code loads the input shapes from .bin files that were generated earlier. You can see an example of how to generate these files in our download-model.py script for bert-c-torchscript on GitHub.

ONNX models include input shape information, making the process simpler:

// Set the model path
M_CompileConfig *compileConfig = M_newCompileConfig();
M_setModelPath(compileConfig, /*path=*/modelPath);

// Compile the model
M_AsyncCompiledModel *compiledModel =
    M_compileModel(context, &compileConfig, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}
// Set the model path
M_CompileConfig *compileConfig = M_newCompileConfig();
M_setModelPath(compileConfig, /*path=*/modelPath);

// Compile the model
M_AsyncCompiledModel *compiledModel =
    M_compileModel(context, &compileConfig, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

The M_CompileConfig needs just the model path set by M_setModelPath(). Once set, call M_compileModel() to compile the model.

MAX Engine now begins compiling the model asynchronously; M_compileModel() returns immediately.

M_CompileConfig can only be used for a single compilation call. Any subsequent calls require a new M_CompileConfig.

Initialize the model

Now that the model is compiled, you can initialize the model.

Call M_initModel(), which returns an instance of M_AsyncModel

This step prepares the compiled model for fast execution by running and initializing some of the graph operations that are input-independent.

  M_AsyncModel *model = M_initModel(
      context,
      compiledModel,
      /*weightsRegistry=*/NULL,
      status
  );
  CHECK(status);

  // Wait for compilation to complete
  logInfo("Waiting for model compilation to finish");
  M_waitForModel(model, status);
  CHECK(status);
  M_AsyncModel *model = M_initModel(
      context,
      compiledModel,
      /*weightsRegistry=*/NULL,
      status
  );
  CHECK(status);

  // Wait for compilation to complete
  logInfo("Waiting for model compilation to finish");
  M_waitForModel(model, status);
  CHECK(status);

You don't need to wait for M_compileModel() to return before calling M_initModel(), because internally it waits for compilation to finish. If you want to wait, add a call to M_waitForCompilation() before you call M_initModel(). This is the general pattern followed by all MAX Engine APIs that accept an asynchronous value as an argument.

M_initModel() is also asynchronous and returns immediately. If you want to wait for it to finish, add a call to M_waitForModel().

Prepare input tensors

Before running inference, you need to prepare your input data in the format expected by the model. This involves creating an M_AsyncTensorMap and adding your input tensors:

// Define the tensor spec
int64_t *inputIdsShape =
    (int64_t *)readFileOrExit("inputs/input_ids_shape.bin");
M_TensorSpec *inputIdsSpec =
    M_newTensorSpec(inputIdsShape, /*rankSize=*/2, /*dtype=*/M_INT32,
                    /*tensorName=*/"input_ids");
free(inputIdsShape);

// Create the tensor map
M_AsyncTensorMap *inputToModel = M_newAsyncTensorMap(context);
// Add an input to the tensor map
int32_t *inputIdsTensor = (int32_t *)readFileOrExit("inputs/input_ids.bin");
M_borrowTensorInto(inputToModel, inputIdsTensor, inputIdsSpec, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}
// Define the tensor spec
int64_t *inputIdsShape =
    (int64_t *)readFileOrExit("inputs/input_ids_shape.bin");
M_TensorSpec *inputIdsSpec =
    M_newTensorSpec(inputIdsShape, /*rankSize=*/2, /*dtype=*/M_INT32,
                    /*tensorName=*/"input_ids");
free(inputIdsShape);

// Create the tensor map
M_AsyncTensorMap *inputToModel = M_newAsyncTensorMap(context);
// Add an input to the tensor map
int32_t *inputIdsTensor = (int32_t *)readFileOrExit("inputs/input_ids.bin");
M_borrowTensorInto(inputToModel, inputIdsTensor, inputIdsSpec, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

Add each input by calling M_borrowTensorInto(), passing it the input tensor and the corresponding tensor specification (shape, type, etc) as an M_TensorSpec.

Run inference

With your input data prepared, you can now run inference with M_executeModelSync():

logInfo("Running Inference...");
M_AsyncTensorMap *outputs = M_executeModelSync(context, model, inputToModel, status);
CHECK(status);

M_AsyncValue *resultValue = M_getValueByNameFrom(outputs, "result0", status);
CHECK(status);
logInfo("Running Inference...");
M_AsyncTensorMap *outputs = M_executeModelSync(context, model, inputToModel, status);
CHECK(status);

M_AsyncValue *resultValue = M_getValueByNameFrom(outputs, "result0", status);
CHECK(status);

Process the output

After inference completes, you'll need to process the output tensors:

logInfo("Extracting output values");
M_AsyncTensor *result = M_getTensorFromValue(resultValue);
size_t numElements = M_getTensorNumElements(result);
printf("Tensor size: %ld\n", numElements);
M_Dtype dtype = M_getTensorType(result);

// Save output to file
const char *outputFilePath = "outputs.bin";
FILE *file = fopen(outputFilePath, "wb");
if (!file) {
    printf("failed to open %s. Aborting.\n", outputFilePath);
    return EXIT_FAILURE;
}
fwrite(M_getTensorData(result), M_sizeOf(dtype), numElements, file);
fclose(file);
logInfo("Extracting output values");
M_AsyncTensor *result = M_getTensorFromValue(resultValue);
size_t numElements = M_getTensorNumElements(result);
printf("Tensor size: %ld\n", numElements);
M_Dtype dtype = M_getTensorType(result);

// Save output to file
const char *outputFilePath = "outputs.bin";
FILE *file = fopen(outputFilePath, "wb");
if (!file) {
    printf("failed to open %s. Aborting.\n", outputFilePath);
    return EXIT_FAILURE;
}
fwrite(M_getTensorData(result), M_sizeOf(dtype), numElements, file);
fclose(file);

The output is returned in an M_AsyncTensorMap, and you can get individual outputs from it with M_getTensorByNameFrom().

If you don't know the tensor name, you can get it from M_getTensorNameAt().

Clean up

In this guide, you learned how to use the MAX Engine C API to run machine learning inference in C applications. You now know how to initialize the runtime environment, load models, prepare input data, execute inference, and process results all in C.

Don't forget to free all the things—see the types reference to find each free function.

For more example code, see our GitHub repo.

Create a virtual environment​

Initialize the runtime context​

Compile the model​

Initialize the model​

Prepare input tensors​

Run inference​

Process the output​

Clean up​