For a chat UI, the number that matters isn’t tokens-per-second — it’s how long the user stares at an empty box before the first character appears. We obsess over time-to-first-token(TTFT), and over the last two quarters we took our p50 from 90ms down to 12ms. Here’s how.
Routing to the warmest pool
The biggest win came from routing. Instead of sending a request to the nearest region, we route to the nearest region that already has a warm worker for the requested model. A request that lands on a warm pool skips the entire model-load path.
Keeping pools warm
We predict demand per-model, per-region from recent traffic and keep a small number of workers resident ahead of it. When traffic drops, the pool drains back toward zero. The trick is sizing the buffer so it absorbs bursts without paying for idle capacity.
Speculative connection setup
We start setting up the downstream connection while authentication and rate-limit checks are still in flight. By the time the request is authorized, the path to the GPU is already open. In code terms: start the promises early, awaitthem late.
The result
- p50 TTFT: 90ms → 12ms
- p99 TTFT: 410ms → 68ms
- Cold-start rate: down 94% during peak hours
None of these are exotic — they’re the unglamorous work of measuring the path a request actually takes and removing the waits one at a time.
Ready to serve your first token?
Spin up an OpenAI-compatible endpoint on your first 1M tokens, free.