GPT-5.4

GA

by OpenAI · GPT-5 family · best for default production workhorse

FrontierReasoningCodingagenticLong-ContextMultimodal
8.7
AI Panel Score
Value 8.5/10

GPT-5.4 is OpenAI's mainstream workhorse, released 2026-03-05 — the cost-effective default that sits between the cheaper GPT-5.4 mini and the costlier GPT-5.5 flagship. It was the first OpenAI GA model with production-grade computer use (clearing human-expert level on OSWorld) and ships the full Responses API tool surface (apply_patch, hosted shell, tool search, skills). The one-sentence buyer's take: for most production builds this is the right default at half the price of GPT-5.5, with GPT-5.5 reserved as the targeted upgrade for the hardest agentic work. - Provider: OpenAI - Release: 2026-03-05 - Status: GA - Context: 1,050,000 tokens (input pricing rises past 272K) - Max output: 128,000 tokens - Modalities: text + image in, text out - Knowledge cutoff: 2025-08 - Headline price: $2.50 in / $15.00 out per 1M tokens

What's new

  • GPT-5.4 extended the context window to 1.05M tokens with a 272K break-point, added the `none`/`low`/`medium`/`high`/`xhigh` reasoning effort ladder (replacing the older reasoning-model split), and was the first OpenAI GA model with native computer use clearing >75% on OSWorld (above the ~72% human-expert baseline). It brought apply_patch, hosted shell, tool search, and skills to the Responses API. Versus its predecessor GPT-5, it is a clear capability step on coding, reasoning, and tool reliability, and it slotted in as the new mid-tier default. It has since been superseded on the hardest agentic/coding evals by GPT-5.5 but remains the value default.

Benchmarks

BenchmarkScoreSource
Humanity's Last Exam39.8%llm-stats.com 2026-03-05T00:00:00.000Z
MMMU81.2%llm-stats.com 2026-03-05T00:00:00.000Z
LMArena Elo1484presenc.ai 2026-05-01T00:00:00.000Z
GPQA Diamond92.8%llm-stats.com 2026-03-05T00:00:00.000Z
Terminal-Bench75.1%llm-stats.com 2026-03-05T00:00:00.000Z
MRCR Long Context36.6%nipralo.com 2026-03-05T00:00:00.000Z
Artificial Analysis Index57artificialanalysis.ai 2026-03-05T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker9/10
The model most production architectures should standardize on — half the GPT-5.5 cost, inside the same envelope on non-frontier work.

GPT-5.4 is the steady-state default. It is half the price of GPT-5.5 with capability that lands inside the same envelope on most non-frontier workloads, and production-grade computer use unlocks real RPA-replacement cases. The lock-in story is identical to the rest of OpenAI — Responses API, apply_patch, skills semantics — but at this price/quality point you accept it. The 272K pricing cliff is a real architectural constraint to design around with chunking. Treat GPT-5.4 as the default and GPT-5.5 as the targeted upgrade; roadmap confidence is high given the active cadence.

Strategic Fit 9Vendor Risk 7Roadmap Confidence 9
Pros
  • value default
  • production computer use
  • full tooling
Cons
  • lock-in
  • 272K cliff
  • behind 5.5 on hardest tasks
Right for: most new production builds
Avoid if: you need frontier-max coding or portability
Domain Strategist8.7/10
The volume sweet spot — OpenAI's distribution plus a price that makes agentic features affordable at scale beats rivals on TCO.

In market terms, GPT-5.4 wins the high-volume agentic and coding segment on total cost of ownership: production computer use and the full tool surface at $2.50/$15 is a strong value position against Claude Sonnet 4.6 and Gemini 3 Flash. Differentiation is the tooling-plus-price combination and ChatGPT distribution, not raw benchmark leadership (GPT-5.5 and Claude Opus lead there). Market timing is excellent — it became the default the moment GPT-5.5's price doubled, so cost-aware teams gravitate here. Its risk is internal cannibalization as a future GPT-5.5 mini arrives.

Competitive Positioning 9Differentiation 8Market Timing 9
Pros
  • best TCO at tier
  • full tooling
  • distribution
Cons
  • not benchmark leader
  • future 5.5-mini risk
Right for: scale agentic deployments
Avoid if: you need the absolute frontier
Finance Lead9/10
Where the unit economics finally work — half of GPT-5.5, 90% cache discount, and Batch take effective spend an order of magnitude below list.

$2.50 in / $15.00 out is half GPT-5.5; cached input drops to $0.25 (-90%) and Batch to $1.25/$7.50. For backend pipelines with stable prompts, prefix caching plus Batch can land effective spend an order of magnitude below list. The 272K cliff is the budget trap — segment workloads so long-context calls are deliberate. Predictability is good because tiering is transparent and well-documented. This is the model to standardize on for cost-aware production traffic, with GPT-5.5 reserved for measured escalation. Value-per-dollar is the best among OpenAI's full-size models.

Cost Efficiency 9Pricing Transparency 9Value per Dollar 9
Pros
  • half 5.5 cost
  • deep cache/batch discounts
  • clear tiers
Cons
  • 272K cliff
  • reasoning-token billing
Right for: cost-aware production
Avoid if: nothing material at this price/quality
Domain Practitioner9/10
The model most developers actually ship on — mature tool calling, reliable structured output, apply_patch that makes code agents tractable.

This is the workhorse developers ship on. Tool calling is mature, structured outputs are reliable, and the Responses API is the right primary surface. apply_patch makes code-edit agents tractable; computer use plus hosted shell means serious automation in a single SDK call. Versus GPT-5.5 the gap on hard coding tasks is real but most day-to-day tasks land identically. The reasoning-effort dial is the key DX upgrade — tune compute per request, not per model. Minor friction: no fine-tuning, the 272K context cliff to watch. SDK coverage across Python/TS/Java/Go/.NET is excellent.

API Ergonomics 9Tool/Agent Support 9Reliability 9
Pros
  • mature tooling
  • apply_patch
  • reasoning dial
Cons
  • no fine-tuning
  • 272K cliff
Right for: most production developers
Avoid if: you need fine-tuning or frontier-max coding
Power User8.5/10
On ChatGPT Plus this handles most queries fast and well — the gap to 5.5 only shows on the hardest agent and code tasks.

For ChatGPT Plus users, GPT-5.4 is the workhorse that handles most queries fast. Latency at default reasoning is good, refusals are reasonable, and conversation quality is high. The gap versus GPT-5.5 shows up only on the hardest agentic and code tasks; for everyday drafting, research, image discussion, and light coding it is essentially indistinguishable from the flagship. The knowledge cutoff (2025-08) occasionally surfaces in time-sensitive questions, mitigated by web search.

Output Quality 8Speed 9Everyday Usefulness 9
Pros
  • fast
  • high everyday quality
  • web-search mitigates cutoff
Cons
  • behind 5.5 on hardest tasks
  • older knowledge
Right for: everyday ChatGPT Plus use
Avoid if: you constantly hit frontier-hard problems
Skeptic8/10
A great value model, but 'first to beat human experts on OSWorld' is a narrow benchmark — and 1M context is mostly a spec, not a capability.

The adversarial read: GPT-5.4 is genuinely the value pick, but two marketing claims deserve scrutiny. The "beats human experts on OSWorld" headline is one benchmark on a constrained task set — real-world computer use is far messier. And the 1.05M context is largely a spec: at 1M tokens retention is only 36.6%, so anything beyond the 272K standard tier is both expensive and degraded. The public benchmark trail is also thinner than GPT-5.5's (no separate MMLU-Pro, AIME, SWE-bench Verified figure). None of this undermines the value case — it just means the headline numbers oversell the long-context and computer-use stories.

Claim Accuracy 8Weakness Severity 7Hype vs Reality 8
Pros
  • real value
  • honest mid-tier positioning
Cons
  • oversold long-context
  • narrow OSWorld claim
  • thin public suite
Right for: buyers who size context to 272K
Avoid if: you need true 1M-token fidelity

Strengths

  • Mid-tier pricing with frontier-adjacent capability — the obvious default SKU.
  • Best-in-class computer use at GA (first to clear human-expert on OSWorld).
  • Full Responses API tool surface, including apply_patch and skills.
  • 1.05M context handles long documents and repos (standard tier to 272K).
  • Reasoning-effort dial covers chat and deep work in one SKU.
  • Half the list price of GPT-5.5 with comparable quality on most non-frontier tasks.

Limitations

  • Superseded on the hardest agentic and coding benchmarks by GPT-5.5.
  • 1M-token long-context retention (36.6%) is far below GPT-5.5 (74.0%).
  • Above 272K input tokens, full-session rates rise to 2x input / 1.5x output.
  • No fine-tuning support.
  • Text-only output; image/audio/video routed to separate models.
  • Knowledge cutoff (2025-08) is four months older than GPT-5.5.

Best use cases

- Default production SKU for agents, coding, and analytical workloads where GPT-5.5 cost is unjustified. - Computer-use automation (RPA, browser agents) where OSWorld-class reliability matters. - Long-context document analysis under 272K tokens. - High-volume backend pipelines paired with Batch + cached input for unit economics. - Tiered routing: GPT-5.4 mini for cheap traffic, GPT-5.4 for default, GPT-5.5 for escalation.

Buyer questions

Should I default to GPT-5.4 or GPT-5.5?

Default to GPT-5.4 — it is half the price and comparable on most non-frontier work. Escalate to GPT-5.5 only for the hardest agentic/coding tasks.

Is the 1M context usable?

The standard tier to 272K is genuinely usable; beyond that, costs rise 2x/1.5x and retention drops to ~36.6%. Treat 1M as overflow, not a workhorse window.

Does it generate images?

No — image input only; generation routes to gpt-image-2.

How cheap can I get it?

Cached input is $0.25 (-90%) and Batch is $1.25/$7.50 (-50%). A well-cached pipeline lands far below list.

Can I fine-tune it?

Not as of 2026-05-28.

Is my data used for training?

No, not by API default; enterprise opt-out and zero-retention exist.

Comparable models

**OpenAI GPT-5.5** — frontier sibling, 2x price, clear edge on the hardest agent and coding tasks and far better 1M-token retention.
**Anthropic Claude Sonnet 4.6** — direct peer at this tier, often preferred for long-form writing and conversational quality; comparable pricing.
**Google Gemini 3 Flash** — comparable pricing, stronger native multimodality, weaker agentic coding and tool-loop reliability.

Model specs

Input price
$2.50 / Mtok
Output price
$15 / Mtok
Cached input
$0.25 / Mtok
Batch (in/out)
$1.25 / $7.50
Context window
1.1M tokens
Max output
128K tokens
Knowledge cutoff
2025-08
Released
2026-03-04
Modalities
text, image → text
Output speed
~86.3 tok/s
License
Proprietary
Clouds
Azure OpenAI, Azure AI Foundry

Does not train on API inputs by default

Last verified 2026-05-27