GPT-5.5

GALatest Pro

by OpenAI · GPT-5 family · best for frontier agentic coding and computer use

FrontierReasoningCodingagenticLong-ContextMultimodal
8.7
AI Panel Score
Value 6.5/10

GPT-5.5 is OpenAI's flagship foundation model, released 2026-04-23 (API 2026-04-24) — the first fully retrained base since GPT-4.5 and the spine of ChatGPT's "super app" direction. It is the strongest single model OpenAI ships for end-to-end agentic work: top-of-leaderboard on SWE-bench Verified (88.7%) and Terminal-Bench 2.0 (82.7%), with a configurable reasoning effort dial spanning `none` through `xhigh`. The one-sentence buyer's take: if you need the best agentic coding and computer-use loop money can buy and can architect around roughly 2x the cost of GPT-5.4, this is the pick. - Provider: OpenAI - Release: 2026-04-23 (API 2026-04-24) - Status: GA - Context: 1,050,000 tokens (input pricing rises past 272K) - Max output: 128,000 tokens - Modalities: text + image in, text out - Knowledge cutoff: 2025-12 - Headline price: $5.00 in / $30.00 out per 1M tokens

What's new

  • GPT-5.5 is a clean generational step over GPT-5.4. The headline gains: SWE-bench Verified jumps from the low 80s to 88.7%, Terminal-Bench 2.0 from 75.1% to 82.7%, ARC-AGI-2 from 73.3% to 85.0%, and 1M-token long-context reasoning leaps from 36.6% to 74.0%. It adds an `xhigh` reasoning tier above the prior `none`/`low`/`medium`/`high` ladder, and OpenAI reports roughly a 60% reduction in hallucination rate versus GPT-5.4 on internal evals. The knowledge cutoff advances four months to December 2025. Notably, OpenAI used Codex to rewrite its own serving infrastructure before launch, so GPT-5.5 matches GPT-5.4 per-token latency despite far higher intelligence, and uses fewer tokens to complete the same task.

Benchmarks

BenchmarkScoreSource
Humanity's Last Exam41.4%llm-stats.com 2026-04-23T00:00:00.000Z
MMMU81.2%llm-stats.com 2026-04-23T00:00:00.000Z
MMLU-Pro92.4%tokenmix.ai 2026-04-23T00:00:00.000Z
TAU-bench98%openai.com 2026-04-23T00:00:00.000Z
LMArena Elo1476presenc.ai 2026-05-01T00:00:00.000Z
GPQA Diamond93.6%llm-stats.com 2026-04-23T00:00:00.000Z
Terminal-Bench82.7%tech-insider.org 2026-04-23T00:00:00.000Z
MRCR Long Context74%nipralo.com 2026-04-23T00:00:00.000Z
SWE-bench Verified88.7%openai.com 2026-04-23T00:00:00.000Z
Artificial Analysis Index60artificialanalysis.ai 2026-05-01T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker9/10
The most defensible frontier bet in OpenAI's lineup — if you can swallow 2x the token cost and the Responses-API lock-in.

GPT-5.5 is the safest choice today for frontier-quality agent loops with operational maturity. It is clearly differentiated from GPT-5.4 on agentic and coding evals, and ChatGPT's distribution gives OpenAI a moat no rival has. The risks are familiar: lock-in through the Responses API and bespoke tools (apply_patch, skills, MCP semantics) that do not port cleanly to Anthropic or Google, plus a pricing step that forces a cost conversation per workload. Roadmap confidence is high given the cadence (5.4 to 5.5 in six weeks). For most enterprises, standardize the escalation tier here and the default below.

Strategic Fit 9Vendor Risk 7Roadmap Confidence 9
Pros
  • leader on agentic coding
  • strong roadmap
  • deep tooling
Cons
  • lock-in
  • 2x cost
  • text-only output
Right for: enterprises building agentic products
Avoid if: you need portability across providers or open weights
Domain Strategist8.8/10
OpenAI's wedge is agentic coding plus computer use — that's where 5.5 wins the market, not on raw conversational charm.

In market terms, GPT-5.5 wins the agentic-developer and coding-tool segment decisively at release — SWE-bench Verified and Terminal-Bench leadership plus the Codex-powered serving story are a credible moat. Where it does not win is conversational preference (LMArena puts Claude Opus 4.7 ahead) and native multimodal generation (Gemini 3 Pro is stronger). The differentiation is the end-to-end tool surface and ChatGPT distribution, not any single benchmark. Market timing is excellent: shipping six weeks after 5.4 kept OpenAI ahead of the Claude and Gemini cadence and reset the frontier narrative.

Competitive Positioning 9Differentiation 8Market Timing 9
Pros
  • clear agentic-coding moat
  • distribution
  • cadence
Cons
  • not the conversational leader
  • no image-gen
Right for: dev-tool and agent platforms
Avoid if: your wedge is multimodal creation
Finance Lead7/10
List price is the trap; cached input and Batch are the only way the math survives at frontier-tier traffic.

$5 in / $30 out is 2x GPT-5.4 and 4x the original GPT-5. The math only works if cached input (90% off to $0.50) and Batch (50% off to $2.50/$15) are designed in from day one. A well-architected app — prefix caching plus Batch for non-interactive work — can land effective spend near GPT-5.4 list. But uncached interactive traffic gets expensive fast, the hidden reasoning tokens inflate output billing, and the >272K full-session overage is a real budget trap. Predictability is decent because tiering is transparent; value-per-dollar is mediocre at frontier intelligence prices.

Cost Efficiency 6Pricing Transparency 8Value per Dollar 6
Pros
  • deep cache/batch discounts
  • clear tiers
Cons
  • 2x predecessor
  • reasoning-token inflation
  • overage cliff
Right for: workloads that exploit caching
Avoid if: high-volume uncached interactive traffic
Domain Practitioner9.2/10
Best DX in the family — one SDK call, a reasoning dial, and tool use that actually recovers from errors mid-loop.

The Responses API is the clear primary surface, structured outputs are reliable, and the reasoning-effort dial means the same call covers chat-grade and deep-work-grade tasks. Tool use is materially more stable than GPT-5.4 — fewer dropped function calls and better recovery from tool errors in long agent loops. Hosted shell, apply_patch, and computer use are transformative for coding agents. Irritants: no fine-tuning, the 272K cliff must live in your token budgeting, and verbose reasoning can blow past expected output sizes. SDK coverage across Python/TS/Java/Go/.NET is best-in-class.

API Ergonomics 9Tool/Agent Support 10Reliability 9
Pros
  • top tool surface
  • reasoning dial
  • broad SDKs
Cons
  • no fine-tuning
  • 272K cliff
  • verbose
Right for: agent and coding-tool builders
Avoid if: you need fine-tuning or fixed-length outputs
Power User9/10
Replies are crisper and it hallucinates less, but heavy requests stall while it thinks — that wait is the real cost.

For ChatGPT users GPT-5.5 is the default flagship and the gap is felt most in agentic tasks — research, multi-step actions, code projects. Reply quality is noticeably crisper, factual hallucinations are down, and refusals on legitimate professional questions are less common than GPT-5.4. Latency at default reasoning is fine for chat; bumping to high or xhigh produces visible thinking time but better answers. Image understanding feels integrated. The main UX costs: heavier requests stall while the model reasons, and the free tier has no access at all.

Output Quality 9Speed 8Everyday Usefulness 9
Pros
  • crisp answers
  • fewer hallucinations
  • integrated vision
Cons
  • think-time stalls
  • no free access
Right for: paying ChatGPT power users
Avoid if: you want instant replies on every query
Skeptic7.8/10
It's the best agent model, sure — but the '#1 index' headline hides that it's pricey, verbose, and not the conversational leader.

The frontier claims hold up where OpenAI measures them — agentic coding and the AA Index — but the marketing elides three things. First, GPT-5.5 trails Claude Opus 4.7 on LMArena for actual human conversational preference, so "smartest model" is benchmark-conditional. Second, the "matches GPT-5.4 latency" line ignores a ~64s TTFT at xhigh and a 75M-token eval run that screams verbosity (and reasoning-token billing). Third, several flagship benchmarks (AIME, LiveCodeBench, MMLU) aren't separately published for 5.5, so the suite is thinner than it looks. The 60% hallucination-reduction figure is internal and unaudited. Real, but priced and hyped at a premium.

Claim Accuracy 8Weakness Severity 7Hype vs Reality 7
Pros
  • genuine agentic leadership
  • real coding gains
Cons
  • not conversational leader
  • verbose/expensive
  • thin public suite
Right for: buyers who verify on their own tasks
Avoid if: you take "smartest model" at face value

Strengths

  • Best-in-class agentic coding: top SWE-bench Verified (88.7%) and Terminal-Bench 2.0 (82.7%) at release.
  • #1 Artificial Analysis Intelligence Index (60) at launch.
  • 1M-token long-context retention nearly doubled vs GPT-5.4 (74.0%).
  • Reasoning effort dial covers cheap chat through deep multi-minute work in one SKU.
  • Mature, native tool surface (apply_patch, hosted shell, computer use, MCP, skills, tool search).
  • ~60% fewer hallucinations than GPT-5.4 on internal factual evals.

Limitations

  • Output is text-only — image/audio/video generation routes to separate models.
  • Roughly 2x GPT-5.4 list price; naive migration is expensive without caching/batch discipline.
  • Above 272K input tokens, full-session rates rise to 2x input / 1.5x output.
  • No fine-tuning support.
  • LMArena Elo trails the top Claude Opus 4.7 tier for general conversational preference.
  • High TTFT and verbose reasoning at xhigh effort (generated 75M tokens in AA's eval run).

Best use cases

- Multi-step agentic loops mixing coding, browsing, and computer use in one session. - Repository-level code edits where SWE-bench-class performance is the gating constraint. - Long-document synthesis under 272K tokens where hallucination cost is high. - Enterprise assistants needing reliable structured output and tool routing. - Consolidating a GPT-5.4-plus-specialized-agent stack into a single model.

Buyer questions

How much more does GPT-5.5 cost than GPT-5.4?

Exactly 2x on input ($5 vs $2.50) and 2x on output ($30 vs $15). Cached input and Batch each cut that substantially, so effective cost depends heavily on your caching ratio.

Does it generate images?

No. GPT-5.5 takes image input but outputs text only; image generation routes to gpt-image-2 and video to Sora-class models.

What is the reasoning-token billing gotcha?

Hidden reasoning tokens count as output tokens even though they are not returned. At high/xhigh effort a short answer can bill several times its visible length.

Is my data used for training?

No, not by API default. Opt-out and zero-retention options exist for enterprise.

Which clouds host it?

Azure OpenAI and Azure AI Foundry, plus first-party OpenAI API. OpenRouter proxies it.

What happens past 272K input tokens?

The entire session is billed at 2x input / 1.5x output. Chunk deliberately if you cross that line.

Can I fine-tune it?

Not as of 2026-05-28.

Comparable models

**Anthropic Claude Opus 4.7** — closest coding/agent peer; comparable SWE-bench, leads LMArena conversational preference, often preferred for long-form writing; loses on Terminal-Bench and OpenAI's tool ecosystem.
**Google Gemini 3 Pro** — comparable knowledge benchmarks, stronger native multimodality and image generation, weaker on agentic coding and tool-loop reliability.
**xAI Grok 4** — competitive on reasoning math and the strongest real-time-data wedge, but weaker tool-use reliability and a thinner enterprise compliance story.

Model specs

Input price
$5 / Mtok
Output price
$30 / Mtok
Cached input
$0.50 / Mtok
Batch (in/out)
$2.50 / $15
Context window
1.1M tokens
Max output
128K tokens
Knowledge cutoff
2025-12
Released
2026-04-22
Modalities
text, image → text
Output speed
~68.7 tok/s
License
Proprietary
Clouds
Azure OpenAI, Azure AI Foundry

Does not train on API inputs by default

Last verified 2026-05-27