GPT-5.4

by OpenAI · GPT-5 family · best for default production workhorse

FrontierReasoningCodingagenticLong-ContextMultimodal

8.7

AI Panel Score

Value 8.5/10

GPT-5.4 is OpenAI's mainstream workhorse, released 2026-03-05 — the cost-effective default that sits between the cheaper GPT-5.4 mini and the costlier GPT-5.5 flagship. It was the first OpenAI GA model with production-grade computer use (clearing human-expert level on OSWorld) and ships the full Responses API tool surface (apply_patch, hosted shell, tool search, skills). The one-sentence buyer's take: for most production builds this is the right default at half the price of GPT-5.5, with GPT-5.5 reserved as the targeted upgrade for the hardest agentic work.

Compare this model All GPT-5 versions

What's new

GPT-5.4 extended the context window to 1.05M tokens with a 272K break-point, added the none/low/medium/high/xhigh reasoning effort ladder (replacing the older reasoning-model split), and was the first OpenAI GA model with native computer use clearing >75% on OSWorld (above the ~72% human-expert baseline). It brought apply_patch, hosted shell, tool search, and skills to the Responses API. Versus its predecessor GPT-5, it is a clear capability step on coding, reasoning, and tool reliability, and it slotted in as the new mid-tier default. It has since been superseded on the hardest agentic/coding evals by GPT-5.5 but remains the value default.

Benchmarks

Benchmark	Score	Source
Humanity's Last Exam	39.8%	llm-stats.com 2026-03-05T00:00:00.000Z
MMMU	81.2%	llm-stats.com 2026-03-05T00:00:00.000Z
LMArena Elo	1484	presenc.ai 2026-05-01T00:00:00.000Z
GPQA Diamond	92.8%	llm-stats.com 2026-03-05T00:00:00.000Z
Terminal-Bench	75.1%	llm-stats.com 2026-03-05T00:00:00.000Z
MRCR Long Context	36.6%	nipralo.com 2026-03-05T00:00:00.000Z
Artificial Analysis Index	57	artificialanalysis.ai 2026-03-05T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker9/10

“The model most production architectures should standardize on — half the GPT-5.5 cost, inside the same envelope on non-frontier work.”

GPT-5.4 is the steady-state default. It is half the price of GPT-5.5 with capability that lands inside the same envelope on most non-frontier workloads, and production-grade computer use unlocks real RPA-replacement cases. The lock-in story is identical to the rest of OpenAI — Responses API, apply_patch, skills semantics — but at this price/quality point you accept it. The 272K pricing cliff is a real architectural constraint to design around with chunking. Treat GPT-5.4 as the default and GPT-5.5 as the targeted upgrade; roadmap confidence is high given the active cadence.

Strategic Fit 9Vendor Risk 7Roadmap Confidence 9

Pros

value default
production computer use
full tooling

Cons

lock-in
272K cliff
behind 5.5 on hardest tasks

Right for: most new production builds

Avoid if: you need frontier-max coding or portability

Domain Strategist8.7/10

“The volume sweet spot — OpenAI's distribution plus a price that makes agentic features affordable at scale beats rivals on TCO.”

In market terms, GPT-5.4 wins the high-volume agentic and coding segment on total cost of ownership: production computer use and the full tool surface at $2.50/$15 is a strong value position against Claude Sonnet 4.6 and Gemini 3 Flash. Differentiation is the tooling-plus-price combination and ChatGPT distribution, not raw benchmark leadership (GPT-5.5 and Claude Opus lead there). Market timing is excellent — it became the default the moment GPT-5.5's price doubled, so cost-aware teams gravitate here. Its risk is internal cannibalization as a future GPT-5.5 mini arrives.

Competitive Positioning 9Differentiation 8Market Timing 9

Pros

best TCO at tier
full tooling
distribution

Cons

not benchmark leader
future 5.5-mini risk

Right for: scale agentic deployments

Avoid if: you need the absolute frontier

Finance Lead9/10

“Where the unit economics finally work — half of GPT-5.5, 90% cache discount, and Batch take effective spend an order of magnitude below list.”

$2.50 in / $15.00 out is half GPT-5.5; cached input drops to $0.25 (-90%) and Batch to $1.25/$7.50. For backend pipelines with stable prompts, prefix caching plus Batch can land effective spend an order of magnitude below list. The 272K cliff is the budget trap — segment workloads so long-context calls are deliberate. Predictability is good because tiering is transparent and well-documented. This is the model to standardize on for cost-aware production traffic, with GPT-5.5 reserved for measured escalation. Value-per-dollar is the best among OpenAI's full-size models.

Cost Efficiency 9Pricing Transparency 9Value per Dollar 9

Pros

half 5.5 cost
deep cache/batch discounts
clear tiers

Cons

272K cliff
reasoning-token billing

Right for: cost-aware production

Avoid if: nothing material at this price/quality

Domain Practitioner9/10

“The model most developers actually ship on — mature tool calling, reliable structured output, apply_patch that makes code agents tractable.”

This is the workhorse developers ship on. Tool calling is mature, structured outputs are reliable, and the Responses API is the right primary surface. apply_patch makes code-edit agents tractable; computer use plus hosted shell means serious automation in a single SDK call. Versus GPT-5.5 the gap on hard coding tasks is real but most day-to-day tasks land identically. The reasoning-effort dial is the key DX upgrade — tune compute per request, not per model. Minor friction: no fine-tuning, the 272K context cliff to watch. SDK coverage across Python/TS/Java/Go/.NET is excellent.

API Ergonomics 9Tool/Agent Support 9Reliability 9

Pros

mature tooling
apply_patch
reasoning dial

Cons

no fine-tuning
272K cliff

Right for: most production developers

Avoid if: you need fine-tuning or frontier-max coding

Power User8.5/10

“On ChatGPT Plus this handles most queries fast and well — the gap to 5.5 only shows on the hardest agent and code tasks.”

For ChatGPT Plus users, GPT-5.4 is the workhorse that handles most queries fast. Latency at default reasoning is good, refusals are reasonable, and conversation quality is high. The gap versus GPT-5.5 shows up only on the hardest agentic and code tasks; for everyday drafting, research, image discussion, and light coding it is essentially indistinguishable from the flagship. The knowledge cutoff (2025-08) occasionally surfaces in time-sensitive questions, mitigated by web search.

Output Quality 8Speed 9Everyday Usefulness 9

Pros

fast
high everyday quality
web-search mitigates cutoff

Cons

behind 5.5 on hardest tasks
older knowledge

Right for: everyday ChatGPT Plus use

Avoid if: you constantly hit frontier-hard problems

Skeptic8/10

“A great value model, but 'first to beat human experts on OSWorld' is a narrow benchmark — and 1M context is mostly a spec, not a capability.”

The adversarial read: GPT-5.4 is genuinely the value pick, but two marketing claims deserve scrutiny. The "beats human experts on OSWorld" headline is one benchmark on a constrained task set — real-world computer use is far messier. And the 1.05M context is largely a spec: at 1M tokens retention is only 36.6%, so anything beyond the 272K standard tier is both expensive and degraded. The public benchmark trail is also thinner than GPT-5.5's (no separate MMLU-Pro, AIME, SWE-bench Verified figure). None of this undermines the value case — it just means the headline numbers oversell the long-context and computer-use stories.

Claim Accuracy 8Weakness Severity 7Hype vs Reality 8

Pros

real value
honest mid-tier positioning

Cons

oversold long-context
narrow OSWorld claim
thin public suite

Right for: buyers who size context to 272K

Avoid if: you need true 1M-token fidelity

Strengths

Mid-tier pricing with frontier-adjacent capability — the obvious default SKU.
Best-in-class computer use at GA (first to clear human-expert on OSWorld).
Full Responses API tool surface, including apply_patch and skills.
1.05M context handles long documents and repos (standard tier to 272K).
Reasoning-effort dial covers chat and deep work in one SKU.
Half the list price of GPT-5.5 with comparable quality on most non-frontier tasks.

Limitations

Superseded on the hardest agentic and coding benchmarks by GPT-5.5.
1M-token long-context retention (36.6%) is far below GPT-5.5 (74.0%).
Above 272K input tokens, full-session rates rise to 2x input / 1.5x output.
No fine-tuning support.
Text-only output; image/audio/video routed to separate models.
Knowledge cutoff (2025-08) is four months older than GPT-5.5.

Best use cases

Default production SKU for agents, coding, and analytical workloads where GPT-5.5 cost is unjustified.
Computer-use automation (RPA, browser agents) where OSWorld-class reliability matters.
Long-context document analysis under 272K tokens.
High-volume backend pipelines paired with Batch + cached input for unit economics.
Tiered routing: GPT-5.4 mini for cheap traffic, GPT-5.4 for default, GPT-5.5 for escalation.

Deep dive

The full research notes behind this review — verified against primary sources.

Architecture Capabilities Benchmark analysis Speed & latency Pricing analysis Deployment & access Safety & privacy Ecosystem & tooling

Architecture

OpenAI discloses no parameter counts, layer counts, or dense/MoE structure — all null. It is a unified reasoning model with text + image input and text-only output, o200k_base tokenizer, and a 1.05M context window with a 272K pricing break-point. The notable engineering disclosure is operational, not structural: GPT-5.4 was the first GA model to make computer use reliable at human-expert level on OSWorld.

Capabilities

GPT-5.4 is the mid-tier workhorse, justifying its 9.0 coding and 9.2 agentic scores — SWE-bench in the low 80s, full Responses API tool surface, and the first production-grade computer use (OSWorld ~75%). Reasoning (9.1) and math (9.0) are anchored by GPQA Diamond 92.8% and ARC-AGI-2 73.3%. Long-context (8.3) reflects the 1.05M window with a usable 272K standard tier, though 1M-token retention (36.6% MRCR-style) is materially weaker than GPT-5.5's 74.0% — a real gap at the extreme. Multilingual (8.8), instruction-following (9.1), and function-calling (9.3) are strong via the Responses API. Vision (8.4) and document/OCR (8.2) handle image input but not generation (routes to gpt-image-2). Real-time data (6.8) is web-search-dependent. The reasoning-effort dial lets one SKU cover chat and deep work.

Benchmark analysis

Benchmark	Score	vs Predecessor (GPT-5)	vs Top Competitor (GPT-5.5)	Source
GPQA Diamond	92.8%	up	-0.8pp vs 5.5 (93.6%)	llm-stats
SWE-bench Pro	57.7%	up	-0.9pp vs 5.5 (58.6%)	nxcode
Terminal-Bench 2.0	75.1%	up	-7.6pp vs 5.5 (82.7%)	llm-stats
ARC-AGI-2	73.3%	up	-11.7pp vs 5.5 (85.0%)	llm-stats
OSWorld	~75%	first GA model above human-expert	leader at release	nxcode
MMMU-Pro	81.2%	up	flat vs 5.5	llm-stats
HLE (no tools)	39.8%	up	-1.6pp vs 5.5 (41.4%)	llm-stats
1M-token long-context	36.6%	up	-37.4pp vs 5.5 (74.0%)	nipralo
LMArena Elo	1484 (high)	up	peer with 5.5-high (1484)	presenc
Artificial Analysis Index	57 (xhigh)	up	-3 vs 5.5 (60)	Artificial Analysis

MMLU-Pro, AIME 2025, MATH-500, SWE-bench Verified (exact), HumanEval, LiveCodeBench, Aider Polyglot, IFEval, BBH, Tau-bench, and SimpleQA have no separately-published GPT-5.4 figure as of 2026-05-28 and are recorded null.

Speed & latency

GPT-5.4 is faster than GPT-5.5 — top providers serve it at ~86–92 tokens/sec (Azure 91.8 t/s, OpenAI 86.3 t/s) at xhigh reasoning. At default/low effort, time-to-first-token is roughly 1–2 seconds and it feels interactive; at xhigh the TTFT climbs substantially because of the reasoning pass. latency_tier is medium overall — interactive for chat and low-effort tool calls, slow for deep reasoning runs. For high-volume backend pipelines, Batch removes latency from the equation entirely.

Pricing analysis

Surface	Cost	Notes
API input (standard)	$2.50 / 1M tok	up to 272K input
API output (standard)	$15.00 / 1M tok	up to 272K input
Cached input	$0.25 / 1M tok	90% discount
API input (> 272K)	$5.00 / 1M tok	2x overage rate
API output (> 272K)	$22.50 / 1M tok	1.5x overage rate
Batch (in/out)	$1.25 / $7.50	50% off, 24h SLA
Flex (in/out)	$1.25 / $7.50	variable latency
Priority (in/out)	$5.00 / $30.00	2x list, low queue
Direct UI	$20/mo (Plus), $100–$200/mo (Pro)
Free tier	none
Rate limits	15,000 RPM / 40M TPM	Tier 5

Reasoning-token note: hidden reasoning tokens bill as output tokens. At higher effort tiers, factor a multiple of visible output into the cost.

Deployment & access

API-only via the Responses API; not open-weights, license Proprietary, not self-hostable. Cloud-managed via Azure OpenAI and Azure AI Foundry; OpenRouter proxies it. Data residency covers US and EU. The full Responses API tool surface — including apply_patch, hosted shell, computer use, skills, and tool search — is native, which is both the strength and the lock-in surface.

Safety & privacy

Governed by OpenAI's Preparedness Framework. No training on API inputs by default; opt-out and zero-retention available for enterprise. Compliance covers SOC2, GDPR, CCPA, and HIPAA (BAA). Content moderation is built in. Refusal calibration is reasonable — false refusals on legitimate professional topics are uncommon.

Ecosystem & tooling

First-party SDKs in Python, TypeScript, Java, Go, and .NET, plus the OpenAI Agents SDK. Framework integrations span LangChain, LlamaIndex, Vercel AI SDK, and Pydantic AI. It is a default tier in ChatGPT and powers GitHub Copilot and Codex. Popularity tier: dominant.

Buyer questions

Should I default to GPT-5.4 or GPT-5.5?

Default to GPT-5.4 — it is half the price and comparable on most non-frontier work. Escalate to GPT-5.5 only for the hardest agentic/coding tasks.

Is the 1M context usable?

The standard tier to 272K is genuinely usable; beyond that, costs rise 2x/1.5x and retention drops to ~36.6%. Treat 1M as overflow, not a workhorse window.

Does it generate images?

No — image input only; generation routes to gpt-image-2.

How cheap can I get it?

Cached input is $0.25 (-90%) and Batch is $1.25/$7.50 (-50%). A well-cached pipeline lands far below list.

Can I fine-tune it?

Not as of 2026-05-28.

Is my data used for training?

No, not by API default; enterprise opt-out and zero-retention exist.

Comparable models

GPT-5.5 — OpenAI

frontier sibling, 2x price, clear edge on the hardest agent and coding tasks and far better 1M-token retention.

Claude Sonnet 4.6 — Anthropic

direct peer at this tier, often preferred for long-form writing and conversational quality; comparable pricing.

Gemini 3 Flash — Google

comparable pricing, stronger native multimodality, weaker agentic coding and tool-loop reliability.

Sources

Primary references used to verify this review.

Model specs

Input price: $2.50 / Mtok
Output price: $15 / Mtok
Cached input: $0.25 / Mtok
Batch (in/out): $1.25 / $7.50
Context window: 1.1M tokens
Max output: 128K tokens
Knowledge cutoff: 2025-08
Released: 2026-03-04
Modalities: text, image → text
Output speed: ~86.3 tok/s
License: Proprietary
Clouds: Azure OpenAI, Azure AI Foundry

Does not train on API inputs by default

Other GPT-5 versions

Last verified 2026-05-27