PRODUCT · INFERENCE

Serverless inference, OpenAI-compatible.

Stream tokens from any model in the catalog with sub-15ms time-to-first-token. Autoscale to zero, pay per second of compute, and keep the SDK you already use.

START BUILDINGREAD THE DOCS
CAPABILITIES

One endpoint, the whole stack

Everything you need to serve production inference — without managing a single GPU.

OpenAI-compatible API

Point your existing SDK at our base URL. Same request and response shapes — no rewrites, no new client.

Token streaming

Server-sent tokens the moment they're generated. Median time-to-first-token under 15ms keeps chat UIs instant.

Autoscale to zero

Endpoints scale up under load and back to nothing when idle. You pay per second of compute, never for headroom.

Global edge routing

Requests land at the nearest of 200+ regions and route to the warmest GPU pool, cutting cold starts and round-trips.

Structured outputs

Constrain generations to JSON Schema or call your tools directly. Valid, typed responses without a parsing layer.

Any model, one endpoint

Swap the model id to move between text, vision, code, and audio models. Bring your own weights to a private registry.

HOW IT WORKS

Tokens, end to end

From request to streamed token in three hops — the platform handles routing, scaling, and capacity.

  1. 01
    Send a request

    POST a chat completion to the OpenAI-compatible endpoint with your model id and messages.

  2. 02
    We route it

    The request lands at the nearest region and routes to the warmest GPU pool for that model.

  3. 03
    Tokens stream back

    Tokens stream as they're generated. The endpoint scales with traffic and back to zero when it stops.

Tokens streaming along a conveyor belt into the output bin
QUICKSTART

Streaming in five lines

Override the base URL on the OpenAI SDK and stream tokens — the rest of your code is unchanged.

TYPESCRIPT
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.FORTIS_API_KEY,
  baseURL: "https://api.fortis.dev/v1",
});

const stream = await client.chat.completions.create({
  model: "fortis-l-70b",
  messages: [{ role: "user", content: "Stream me a haiku about GPUs." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}
PERFORMANCE

Built for production load

Measured across the global fleet, p50 unless noted.

12ms
MEDIAN TIME-TO-FIRST-TOKEN
3,200
TOKENS / SEC PER STREAM
99.99%
MONTHLY UPTIME SLA
200+
EDGE REGIONS
FAQ

Questions, answered

Is it really drop-in compatible?

Yes. Change the baseURL and apiKey on the official openai SDK and existing requests, parsing, and tooling keep working. Streaming uses the same stream: true flag.

What happens when traffic spikes?

Endpoints autoscale across the fleet and route to the warmest GPU pool. Bursts return 503 only while capacity scales — retry with backoff and the request lands.

Can I serve my own model?

Enterprise plans include a private model registry. Upload your weights and call them through the same endpoint, with reserved H100 / B200 capacity.

Serve your first token in minutes.

Spin up an OpenAI-compatible endpoint on your first 1M tokens, free. No credit card required.