Skip to main content

Run inference with C

Our C API allows you to integrate MAX Engine into your high-performance application code, and run inference with models from PyTorch and ONNX.

This page shows how to use the MAX Engine C API to load a model and execute it with MAX Engine.

Create a runtime context

The first thing you need is an M_RuntimeContext, which is an application level object that sets up various resources such as threadpool and allocators during inference. We recommended you create one context and use it throughout your application.

To create an M_RuntimeContext, you need two other objects:

  • M_RuntimeConfig: This configures details about the runtime context such as the number of threads to use and the logging level.
  • M_Status: This is the object through which MAX Engine passes all error messages.

Here's how you can create both of these objects and then create the M_RuntimeContext:

M_Status *status = M_newStatus();
M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig();
M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
M_Status *status = M_newStatus();
M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig();
M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}

Notice that this code checks if the M_Status object is an error, using M_isError(), and then exits if it is.

Compile the model

Now you can compile your PyTorch or ONNX model.

Generally, you do that by passing your model path to M_setModelPath(), along with an M_CompileConfig object, and then call M_compileModel().

However, the MAX Engine compiler needs to know the model input shapes, which are not specified in a TorchScript file (they are specified in TF SavedModel and ONNX files). So, you need some extra code if you're loading a TorchScript model, as shown in the following PyTorch tab.

If you're using a PyTorch model (it must be in TorchScript format), the M_CompileConfig needs the model path, via M_setModelPath(), and the input specs (shape, rank, and types), via M_setTorchInputSpecs().

Here's an abbreviated example:

// Set the model path
M_CompileConfig *compileConfig = M_newCompileConfig();
M_setModelPath(compileConfig, /*path=*/modelPath);

// Create torch input specs
int64_t *inputIdsShape =
(int64_t *)readFileOrExit("inputs/input_ids_shape.bin");
M_TorchInputSpec *inputIdsInputSpec =
M_newTorchInputSpec(inputIdsShape, /*dimNames=*/NULL, /*rankSize=*/2,
/*dtype=*/M_INT32, status);

// ... Similar code here to also create M_TorchInputSpec for
// attentionMaskInputSpec and tokenTypeIdsInputSpec

// Set the input specs
M_TorchInputSpec *inputSpecs[3] = {inputIdsInputSpec, attentionMaskInputSpec,
tokenTypeIdsInputSpec};
M_setTorchInputSpecs(compileConfig, inputSpecs, 3);

// Compile the model
M_AsyncCompiledModel *compiledModel =
M_compileModel(context, &compileConfig, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
// Set the model path
M_CompileConfig *compileConfig = M_newCompileConfig();
M_setModelPath(compileConfig, /*path=*/modelPath);

// Create torch input specs
int64_t *inputIdsShape =
(int64_t *)readFileOrExit("inputs/input_ids_shape.bin");
M_TorchInputSpec *inputIdsInputSpec =
M_newTorchInputSpec(inputIdsShape, /*dimNames=*/NULL, /*rankSize=*/2,
/*dtype=*/M_INT32, status);

// ... Similar code here to also create M_TorchInputSpec for
// attentionMaskInputSpec and tokenTypeIdsInputSpec

// Set the input specs
M_TorchInputSpec *inputSpecs[3] = {inputIdsInputSpec, attentionMaskInputSpec,
tokenTypeIdsInputSpec};
M_setTorchInputSpecs(compileConfig, inputSpecs, 3);

// Compile the model
M_AsyncCompiledModel *compiledModel =
M_compileModel(context, &compileConfig, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}

Because the TorchScript model does not include metadata about the input specs, this code loads the input shapes from .bin files that were generated earlier. You can see an example of how to generate these files in our download-model.py script for bert-c-torchscript on GitHub.

MAX Engine now begins compiling the model asynchronously; M_compileModel() returns immediately. Note that an M_CompileConfig can only be used for a single compilation call. Any subsequent calls require a new M_CompileConfig.

Initialize the model

The M_AsyncCompiledModel returned by M_compileModel() is not ready for inference yet. You now need to initialize the model by calling M_initModel(), which returns an instance of M_AsyncModel.

This step prepares the compiled model for fast execution by running and initializing some of the graph operations that are input-independent.

M_AsyncModel *model = M_initModel(context, compiledModel, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
M_AsyncModel *model = M_initModel(context, compiledModel, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}

You don't need to wait for M_compileModel() to return before calling M_initModel(), because it internally waits for compilation to finish. If you want to wait, add a call to M_waitForCompilation() before you call M_initModel(). This is the general pattern followed by all MAX Engine APIs that accept an asynchronous value as an argument.

M_initModel() is also asynchronous and returns immediately. If you want to wait for it to finish, add a call to M_waitForModel().

Prepare input tensors

The last step before you run an inference is to move each input tensor into a single M_AsyncTensorMap. You can add each input by calling M_borrowTensorInto(), passing it the input tensor and the corresponding tensor specification (shape, type, etc) as an M_TensorSpec.

// Define the tensor spec
int64_t *inputIdsShape =
(int64_t *)readFileOrExit("inputs/input_ids_shape.bin");
M_TensorSpec *inputIdsSpec =
M_newTensorSpec(inputIdsShape, /*rankSize=*/2, /*dtype=*/M_INT32,
/*tensorName=*/"input_ids");
free(inputIdsShape);

// Create the tensor map
M_AsyncTensorMap *inputToModel = M_newAsyncTensorMap(context);
// Add an input to the tensor map
int32_t *inputIdsTensor = (int32_t *)readFileOrExit("inputs/input_ids.bin");
M_borrowTensorInto(inputToModel, inputIdsTensor, inputIdsSpec, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
// Define the tensor spec
int64_t *inputIdsShape =
(int64_t *)readFileOrExit("inputs/input_ids_shape.bin");
M_TensorSpec *inputIdsSpec =
M_newTensorSpec(inputIdsShape, /*rankSize=*/2, /*dtype=*/M_INT32,
/*tensorName=*/"input_ids");
free(inputIdsShape);

// Create the tensor map
M_AsyncTensorMap *inputToModel = M_newAsyncTensorMap(context);
// Add an input to the tensor map
int32_t *inputIdsTensor = (int32_t *)readFileOrExit("inputs/input_ids.bin");
M_borrowTensorInto(inputToModel, inputIdsTensor, inputIdsSpec, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}

Run an inference

Now you're ready to run an inference with M_executeModelSync():

M_AsyncTensorMap *outputs =
M_executeModelSync(context, model, inputToModel, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
M_AsyncTensorMap *outputs =
M_executeModelSync(context, model, inputToModel, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}

Process the output

The output is returned in an M_AsyncTensorMap, and you can get individual outputs from it with M_getTensorByNameFrom().

M_AsyncTensor *logits =
M_getTensorByNameFrom(outputs,
/*tensorName=*/"logits", status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
M_AsyncTensor *logits =
M_getTensorByNameFrom(outputs,
/*tensorName=*/"logits", status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}

If you don't know the tensor name, you can get it from M_getTensorNameAt().

Clean up

That's it! Don't forget to free all the things—see the types reference to find each free function.

For more example code, see our GitHub repo.