One endpoint, the whole stack
Everything you need to serve production inference — without managing a single GPU.
OpenAI-compatible API
Point your existing SDK at our base URL. Same request and response shapes — no rewrites, no new client.
Token streaming
Server-sent tokens the moment they're generated. Median time-to-first-token under 15ms keeps chat UIs instant.
Autoscale to zero
Endpoints scale up under load and back to nothing when idle. You pay per second of compute, never for headroom.
Global edge routing
Requests land at the nearest of 200+ regions and route to the warmest GPU pool, cutting cold starts and round-trips.
Structured outputs
Constrain generations to JSON Schema or call your tools directly. Valid, typed responses without a parsing layer.
Any model, one endpoint
Swap the model id to move between text, vision, code, and audio models. Bring your own weights to a private registry.
Tokens, end to end
From request to streamed token in three hops — the platform handles routing, scaling, and capacity.
- 01Send a request
POST a chat completion to the OpenAI-compatible endpoint with your model id and messages.
- 02We route it
The request lands at the nearest region and routes to the warmest GPU pool for that model.
- 03Tokens stream back
Tokens stream as they're generated. The endpoint scales with traffic and back to zero when it stops.
Streaming in five lines
Override the base URL on the OpenAI SDK and stream tokens — the rest of your code is unchanged.
Built for production load
Measured across the global fleet, p50 unless noted.
Questions, answered
Is it really drop-in compatible?
Yes. Change the baseURL and apiKey on the official openai SDK and existing requests, parsing, and tooling keep working. Streaming uses the same stream: true flag.
What happens when traffic spikes?
Endpoints autoscale across the fleet and route to the warmest GPU pool. Bursts return 503 only while capacity scales — retry with backoff and the request lands.
Can I serve my own model?
Enterprise plans include a private model registry. Upload your weights and call them through the same endpoint, with reserved H100 / B200 capacity.
Serve your first token in minutes.
Spin up an OpenAI-compatible endpoint on your first 1M tokens, free. No credit card required.