ENGINEERING

How we cut time-to-first-token to 12ms

Time-to-first-token is the latency that users actually feel. Here's the routing and warm-pool work that took ours from 90ms to 12ms.

Marcus WebbInfrastructure7 min read

For a chat UI, the number that matters isn’t tokens-per-second — it’s how long the user stares at an empty box before the first character appears. We obsess over time-to-first-token(TTFT), and over the last two quarters we took our p50 from 90ms down to 12ms. Here’s how.

Routing to the warmest pool

The biggest win came from routing. Instead of sending a request to the nearest region, we route to the nearest region that already has a warm worker for the requested model. A request that lands on a warm pool skips the entire model-load path.

Keeping pools warm

We predict demand per-model, per-region from recent traffic and keep a small number of workers resident ahead of it. When traffic drops, the pool drains back toward zero. The trick is sizing the buffer so it absorbs bursts without paying for idle capacity.

Speculative connection setup

We start setting up the downstream connection while authentication and rate-limit checks are still in flight. By the time the request is authorized, the path to the GPU is already open. In code terms: start the promises early, awaitthem late.

The result

  • p50 TTFT: 90ms → 12ms
  • p99 TTFT: 410ms → 68ms
  • Cold-start rate: down 94% during peak hours

None of these are exotic — they’re the unglamorous work of measuring the path a request actually takes and removing the waits one at a time.

NEXT UP
GUIDE

A field guide to streaming LLM responses

Streaming is the difference between an app that feels instant and one that feels broken. A practical guide to doing it well.

Ready to serve your first token?

Spin up an OpenAI-compatible endpoint on your first 1M tokens, free.