ANNOUNCEMENT

Introducing Fortis Inference

A serverless, OpenAI-compatible inference layer that scales from your first request to billions — without managing a single GPU.

Priya NairStaff Engineer·June 10, 2026·4 min read

Today we’re opening up Fortis Inferenceto everyone. It’s the same platform we’ve been running internally for the last year: a serverless endpoint that takes an OpenAI-compatible request and streams tokens back with a median time-to-first-token under 15ms.

Why we built it

Running inference in production means wrestling with GPU pools, autoscaling, cold starts, and routing — none of which is the product you actually want to ship. Fortis collapses all of that into one endpoint. You change a baseURLand an apiKey, and the rest of your code stays exactly the same.

What you get on day one

OpenAI-compatible REST API for every model in the catalog
Token streaming with sub-15ms time-to-first-token
Autoscale to zero — you pay per second of compute, never for idle headroom
Global edge routing across 200+ regions

Getting started

Create a free account, grab an API key, and point your SDK at https://api.fortis.dev/v1. Your first 1M tokens are on us. The quickstart guide takes you from zero to a streaming completion in about five minutes.

We can’t wait to see what you build.

← Back to all articles

NEXT UP

ENGINEERING

How we cut time-to-first-token to 12ms

Time-to-first-token is the latency that users actually feel. Here's the routing and warm-pool work that took ours from 90ms to 12ms.

Ready to serve your first token?

Spin up an OpenAI-compatible endpoint on your first 1M tokens, free.

GET STARTED FREE READ THE DOCS