Skip to main content

Serving

Our high-performance serving library provides an OpenAI-compatible REST endpoint out of the box, so you can easily migrate from OpenAI services or other libraries like vLLM and SGLang with minimal code changes. We handle the complete request lifecycle with built-in support for function calling that enables agentic workflows, structured output for predictable JSON responses, and advanced performance optimizations like prefix caching and speculative decoding.

For scenarios where you don't need a REST endpoint, you can also use our Python API for offline inference to run models directly in your applications for batch processing or embedded use cases.

Guides

Tutorials