OpenAI Voice Stack: Realtime-2 vs Translate vs Whisper

OpenAI split its voice API into three priced-separately models: GPT-Realtime-2 for reasoning, Translate for cross-language, Whisper for transcription. The routing decision is now the unit-economics decision.

OpenAI did not ship a voice model in May. It shipped three: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. Each one is scoped to a single job, and the pricing pages list each one separately. That is the news. The marketing framing is "a complete voice stack." The engineering framing is "you now have a routing decision."

For anyone building a voice agent today, this matters more than the headline benchmark numbers. The cheapest model in this lineup costs a fraction of the most capable one per minute, and OpenAI built that gap on purpose. Send every utterance through GPT-Realtime-2 and the unit economics will collapse on the first ten thousand calls. Route by job and the same workload becomes viable.

Below is the at-a-glance shape, then a per-model breakdown, then a decision framework for when to mix them and when to look elsewhere — including at incumbents like Eleven Labs, Cartesia, and Assembly AI that already occupy parts of this stack.

The three-model voice stack at a glance

Here is the comparison readers actually want. Pricing reflects OpenAI's published rates as of May 2026. Latency and use-case rows summarise the published positioning and the practical implications.

Model	Primary job	Audio input / output pricing	Latency profile	Best fit
GPT-Realtime-2	Reasoning + tool-use voice agent	~$32 / 1M input, ~$64 / 1M output audio tokens	Tunable: minimal → xhigh reasoning levels	Agents that decide, plan, and call tools mid-conversation
GPT-Realtime-Translate	Voice-to-voice translation	Realtime-tier audio pricing	Sustained conversational latency across language pairs	Cross-language meetings, support, field operations
GPT-Realtime-Whisper	Streaming transcription only	Per-minute transcription tier (cheapest of the three)	Lowest — designed for live captioning	Captions, notes, listening front-ends for other agents

One immediate consequence: a well-designed voice product probably calls more than one of these. The captioning layer wants Whisper. The reasoning layer wants Realtime-2. The translation feature, if you have one, wants Translate. Treating the lineup as a single model is the mistake the pricing structure is trying to prevent.

Why OpenAI fragmented its own voice product

The previous Realtime API was one model trying to do everything — listen, think, talk back, occasionally translate, sometimes transcribe. That works for demos and breaks for production. Customers running high-volume support, captioning, or multilingual workflows were paying full reasoning-tier audio prices for jobs that did not need any reasoning at all.

Splitting the stack into three lets OpenAI price each tier against the workload it actually serves. It also lets them ship better single-purpose models. Whisper streaming gets to focus on word error rate and time-to-first-token. Translate gets to focus on language-pair quality. Realtime-2 gets to focus on tool use and agentic recovery without dragging transcription accuracy along for the ride.

The cost to developers is that you now have to think about routing. The benefit is that the routing decision is also the unit-economics decision.

GPT-Realtime-2

This is the headline model. It brings what OpenAI calls "GPT-5-class reasoning" to real-time voice — meaning planning, tool-calling, recovering from interruptions, and handling longer agentic workflows without falling out of conversational cadence. Developers pick from five reasoning levels (minimal through xhigh), with low as the default to keep latency tight on simple turns.

Strengths. Top-scoring on Big Bench Audio and Audio MultiChallenge versus its own predecessor. Handles tool calls inside a voice turn, which is the capability nothing else in the OpenAI lineup can do. Tunable reasoning depth means you can keep simple turns cheap and reserve xhigh for the moments that need it.

The honest limitation. Audio output at the realtime tier is expensive when you measure per-minute cost on long calls. Sixty seconds of assistant TTS is roughly 1,200 audio output tokens, which at $64 per million is meaningful when you run it across thousands of concurrent agents. Reasoning levels above "low" make this worse.

Pick Realtime-2 if your agent makes decisions, calls tools, or has to recover gracefully when a user interrupts mid-sentence.
Pick Realtime-2 if conversational fidelity matters more than per-minute cost — sales agents, premium support, healthcare intake.
Skip Realtime-2 if the workload is "listen and write down what was said." That is Whisper's job, not this one's.

GPT-Realtime-Translate

Translate is the model that quietly changes the calculus for global support and field-ops teams. It is built for the voice-to-voice case: speaker A talks in their language, speaker B hears it in theirs, the model holds context across code-switches and domain terms. OpenAI reported double-digit word-error-rate gains over alternatives on Hindi, Tamil, and Telugu evals, with latency low enough to sustain a real conversation.

Strengths. Purpose-built for the multilingual real-time path, which historically required stitching a transcription model, a translation model, and a TTS model together — each stitch adding latency and breaking on code-switches. Translate collapses that into one call.

The honest limitation. It is a translation model. It does not reason, it does not call tools, and you should not ask it to. If your product needs the translated content to also drive an agentic workflow, you are running Translate plus Realtime-2, not Translate alone.

Pick Translate if you run cross-language support, telemedicine, or field operations where two people need to speak to each other in different languages in real time.
Pick Translate if you previously assembled a transcription → translation → TTS pipeline and are tired of the latency budget.
Skip Translate if all speakers share a language. You are paying for a capability you are not using.

GPT-Realtime-Whisper

Whisper streaming is the workhorse. It is a transcription model, not a conversational one — it listens, it writes down what was said, it streams the tokens out as they arrive. The use cases OpenAI emphasises are captions for meetings and broadcasts, live notes during customer-support calls, and front-ending other agents that need a continuously updated text stream of what the user is saying.

Strengths. Cheapest of the three by a wide margin. Lowest latency, by design. The right primitive when you have other infrastructure handling the response side — for example, a custom voice from Eleven Labs or Resemble AI, or an existing LLM stack that does not need to live inside OpenAI's voice models.

The honest limitation. It does one thing. There is no built-in response generation, no tool use, no translation. If you want a complete voice agent, Whisper is the input stage and you are responsible for everything downstream.

Pick Whisper if you need live captions, meeting notes, or call transcripts at scale.
Pick Whisper if your stack already has a strong response layer and you only need a better listening layer than what you were using.
Pick Whisper if per-minute economics matter and the use case does not require reasoning inside the audio path.

What the routing decision actually looks like

The naive build wires every call through Realtime-2. It works in demos and ruins margins in production. A better default is to ask, for each turn of the conversation, what job is being done.

If the job is "understand what the user just said and write it down," route to Whisper. If the job is "respond in the same conversational turn, possibly with a tool call," route to Realtime-2. If the job is "convert speaker A's words into speaker B's language," route to Translate. The same call session can use more than one model — and most production-grade voice agents will.

This is the same routing logic that has driven LLM cost optimisation for the past two years, just applied to audio. You would not run every chat completion through GPT-4 if a Haiku-class model would do. The voice equivalent has now arrived.

How the OpenAI stack compares to the incumbents

OpenAI is not the only vendor with credible voice infrastructure. The decision is not just "which OpenAI model" but "which OpenAI model versus which specialist."

Eleven Labs remains the strongest standalone TTS option, especially when voice cloning, emotional range, or multilingual output matters more than tight LLM integration. If your product's voice identity is a brand asset, the OpenAI stack alone is not enough — you pair Whisper or Realtime-2 with Eleven Labs for the spoken response.

Cartesia competes hardest on latency for the synthesis layer, with a streaming architecture that gets first audio out faster than most alternatives. Cartesia plus Whisper is a real configuration for teams where every hundred milliseconds counts.

Assembly AI sits directly across from Whisper on the transcription side, with strong real-time accuracy and rich metadata (speaker diarisation, sentiment, summarisation). For teams that need post-call analytics as well as live captions, Assembly AI is often the easier integration than running Whisper plus a separate analytics pass.

Resemble AI and Typecast address the voice-cloning and synthetic-actor end of the market, which OpenAI's stack does not really serve.

Three honest scenarios to ground the decision

Reading model cards in isolation is not how the choice gets made. The choice gets made by mapping the model to the workload.

Scenario one: a customer-support voice agent that books appointments

The agent has to listen, understand intent, call calendar APIs, and respond. This is Realtime-2 territory, ideally at the "low" reasoning level by default with bumps to "medium" for trickier scheduling logic. Trying to do this with Whisper plus a separate LLM plus a separate TTS is technically possible and operationally painful — you give up the interruption handling that makes the conversation feel natural.

Scenario two: a live-captioning product for webinars

Pure Whisper. No reasoning required, no tool use, no response generation. Per-minute economics matter because the workload is continuous and the margins are thin. The cheapest of the three models is also the right one.

Scenario three: a multilingual telehealth platform

Translate for the patient-clinician conversation. Whisper for the recorded notes that need to be transcribed for the chart. Possibly Realtime-2 for the triage agent that screens patients before they reach the clinician. Three models, three jobs, one product.

The 90-second decision framework

Two questions narrow this fast.

One: does the voice path need to make decisions or call tools mid-conversation? If yes, Realtime-2 is your spine. If no, you almost certainly want Whisper as the input stage and something else (or nothing else) for output.

Two: do the speakers share a language? If no, Translate goes between them. If yes, skip it.

A third question, only if the first two do not settle it: is the response voice a brand asset? If yes, pair whichever OpenAI model handles the input with an external TTS like Eleven Labs or Cartesia. If the response voice is generic, Realtime-2's built-in TTS is fine.

What to do this week

If you already have a voice agent in production on the previous Realtime API, run a one-day audit: log every audio turn, classify it by job (transcription, reasoning, translation), and add up what each bucket would cost under the new three-model pricing. The savings are usually enough to justify the routing work within a quarter. If you are starting fresh, build the routing layer before you build the agent. Retrofitting it later costs more than designing for it on day one.

OpenAI's Three-Model Voice Stack Forces a Hard Routing Decision

The three-model voice stack at a glance

Why OpenAI fragmented its own voice product

GPT-Realtime-2

GPT-Realtime-Translate

GPT-Realtime-Whisper

What the routing decision actually looks like

How the OpenAI stack compares to the incumbents

Three honest scenarios to ground the decision

Scenario one: a customer-support voice agent that books appointments

Scenario two: a live-captioning product for webinars

Scenario three: a multilingual telehealth platform

The 90-second decision framework

What to do this week

Discussion

Author

Recent Posts

Qwen's Open-Source Bait-and-Switch: What the Max-Preview Pivot Costs Buyers

Sierra's $15B Valuation Is a Stress Test for AI Customer Support Buyers

More from the Blog