by OpenAI · GPT-5 family · best for frontier agentic coding and computer use
GPT-5.5 is OpenAI's flagship foundation model, released 2026-04-23 (API 2026-04-24) — the first fully retrained base since GPT-4.5 and the spine of ChatGPT's "super app" direction. It is the strongest single model OpenAI ships for end-to-end agentic work: top-of-leaderboard on SWE-bench Verified (88.7%) and Terminal-Bench 2.0 (82.7%), with a configurable reasoning effort dial spanning `none` through `xhigh`. The one-sentence buyer's take: if you need the best agentic coding and computer-use loop money can buy and can architect around roughly 2x the cost of GPT-5.4, this is the pick. - Provider: OpenAI - Release: 2026-04-23 (API 2026-04-24) - Status: GA - Context: 1,050,000 tokens (input pricing rises past 272K) - Max output: 128,000 tokens - Modalities: text + image in, text out - Knowledge cutoff: 2025-12 - Headline price: $5.00 in / $30.00 out per 1M tokens
| Benchmark | Score | Source |
|---|---|---|
| Humanity's Last Exam | 41.4% | llm-stats.com 2026-04-23T00:00:00.000Z |
| MMMU | 81.2% | llm-stats.com 2026-04-23T00:00:00.000Z |
| MMLU-Pro | 92.4% | tokenmix.ai 2026-04-23T00:00:00.000Z |
| TAU-bench | 98% | openai.com 2026-04-23T00:00:00.000Z |
| LMArena Elo | 1476 | presenc.ai 2026-05-01T00:00:00.000Z |
| GPQA Diamond | 93.6% | llm-stats.com 2026-04-23T00:00:00.000Z |
| Terminal-Bench | 82.7% | tech-insider.org 2026-04-23T00:00:00.000Z |
| MRCR Long Context | 74% | nipralo.com 2026-04-23T00:00:00.000Z |
| SWE-bench Verified | 88.7% | openai.com 2026-04-23T00:00:00.000Z |
| Artificial Analysis Index | 60 | artificialanalysis.ai 2026-05-01T00:00:00.000Z |
Six personas, six verdicts — the same panel that reviews every product on TopReviewed.
“The most defensible frontier bet in OpenAI's lineup — if you can swallow 2x the token cost and the Responses-API lock-in.”
GPT-5.5 is the safest choice today for frontier-quality agent loops with operational maturity. It is clearly differentiated from GPT-5.4 on agentic and coding evals, and ChatGPT's distribution gives OpenAI a moat no rival has. The risks are familiar: lock-in through the Responses API and bespoke tools (apply_patch, skills, MCP semantics) that do not port cleanly to Anthropic or Google, plus a pricing step that forces a cost conversation per workload. Roadmap confidence is high given the cadence (5.4 to 5.5 in six weeks). For most enterprises, standardize the escalation tier here and the default below.
“OpenAI's wedge is agentic coding plus computer use — that's where 5.5 wins the market, not on raw conversational charm.”
In market terms, GPT-5.5 wins the agentic-developer and coding-tool segment decisively at release — SWE-bench Verified and Terminal-Bench leadership plus the Codex-powered serving story are a credible moat. Where it does not win is conversational preference (LMArena puts Claude Opus 4.7 ahead) and native multimodal generation (Gemini 3 Pro is stronger). The differentiation is the end-to-end tool surface and ChatGPT distribution, not any single benchmark. Market timing is excellent: shipping six weeks after 5.4 kept OpenAI ahead of the Claude and Gemini cadence and reset the frontier narrative.
“List price is the trap; cached input and Batch are the only way the math survives at frontier-tier traffic.”
$5 in / $30 out is 2x GPT-5.4 and 4x the original GPT-5. The math only works if cached input (90% off to $0.50) and Batch (50% off to $2.50/$15) are designed in from day one. A well-architected app — prefix caching plus Batch for non-interactive work — can land effective spend near GPT-5.4 list. But uncached interactive traffic gets expensive fast, the hidden reasoning tokens inflate output billing, and the >272K full-session overage is a real budget trap. Predictability is decent because tiering is transparent; value-per-dollar is mediocre at frontier intelligence prices.
“Best DX in the family — one SDK call, a reasoning dial, and tool use that actually recovers from errors mid-loop.”
The Responses API is the clear primary surface, structured outputs are reliable, and the reasoning-effort dial means the same call covers chat-grade and deep-work-grade tasks. Tool use is materially more stable than GPT-5.4 — fewer dropped function calls and better recovery from tool errors in long agent loops. Hosted shell, apply_patch, and computer use are transformative for coding agents. Irritants: no fine-tuning, the 272K cliff must live in your token budgeting, and verbose reasoning can blow past expected output sizes. SDK coverage across Python/TS/Java/Go/.NET is best-in-class.
“Replies are crisper and it hallucinates less, but heavy requests stall while it thinks — that wait is the real cost.”
For ChatGPT users GPT-5.5 is the default flagship and the gap is felt most in agentic tasks — research, multi-step actions, code projects. Reply quality is noticeably crisper, factual hallucinations are down, and refusals on legitimate professional questions are less common than GPT-5.4. Latency at default reasoning is fine for chat; bumping to high or xhigh produces visible thinking time but better answers. Image understanding feels integrated. The main UX costs: heavier requests stall while the model reasons, and the free tier has no access at all.
“It's the best agent model, sure — but the '#1 index' headline hides that it's pricey, verbose, and not the conversational leader.”
The frontier claims hold up where OpenAI measures them — agentic coding and the AA Index — but the marketing elides three things. First, GPT-5.5 trails Claude Opus 4.7 on LMArena for actual human conversational preference, so "smartest model" is benchmark-conditional. Second, the "matches GPT-5.4 latency" line ignores a ~64s TTFT at xhigh and a 75M-token eval run that screams verbosity (and reasoning-token billing). Third, several flagship benchmarks (AIME, LiveCodeBench, MMLU) aren't separately published for 5.5, so the suite is thinner than it looks. The 60% hallucination-reduction figure is internal and unaudited. Real, but priced and hyped at a premium.
- Multi-step agentic loops mixing coding, browsing, and computer use in one session. - Repository-level code edits where SWE-bench-class performance is the gating constraint. - Long-document synthesis under 272K tokens where hallucination cost is high. - Enterprise assistants needing reliable structured output and tool routing. - Consolidating a GPT-5.4-plus-specialized-agent stack into a single model.
Exactly 2x on input ($5 vs $2.50) and 2x on output ($30 vs $15). Cached input and Batch each cut that substantially, so effective cost depends heavily on your caching ratio.
No. GPT-5.5 takes image input but outputs text only; image generation routes to gpt-image-2 and video to Sora-class models.
Hidden reasoning tokens count as output tokens even though they are not returned. At high/xhigh effort a short answer can bill several times its visible length.
No, not by API default. Opt-out and zero-retention options exist for enterprise.
Azure OpenAI and Azure AI Foundry, plus first-party OpenAI API. OpenRouter proxies it.
The entire session is billed at 2x input / 1.5x output. Chunk deliberately if you cross that line.
Not as of 2026-05-28.
Does not train on API inputs by default
Last verified 2026-05-27