Claude Opus 4.7

GA

by Anthropic · Claude 4 family · best for agentic coding at the frontier

FrontierReasoningCodingMultimodalLong-Context
9.0
AI Panel Score
Value 7.5/10

Claude Opus 4.7 is Anthropic's most capable generally available model, released April 16, 2026 as the flagship of the Claude 4 family. Its headline is a step-change in agentic coding: SWE-bench Verified jumps to 87.6% (from 80.8% on Opus 4.6) and SWE-bench Pro to 64.3%, while it tops the LMArena coding leaderboard. For a buyer, the single sentence is this: if you are building long-horizon coding or computer-use agents and accuracy beats latency, Opus 4.7 is the default choice on Anthropic — at unchanged $5/$25 pricing. - Provider: Anthropic - Released: 2026-04-16 - Status: GA (current flagship) - Context window: 1,000,000 tokens - Max output: 128,000 tokens (300k on Batch API with the output-300k beta header) - Modalities: text, image (vision up to 2,576 px long edge, ~3.75 MP) - Knowledge cutoff: January 2026 (reliable and training cutoff both January 2026) - Headline price: $5 input / $25 output per 1M tokens

What's new

  • Step-change in agentic coding: SWE-bench Verified 80.8% to 87.6%, SWE-bench Pro 53.4% to 64.3%, Terminal-Bench 2.0 65.4% to 69.4%, OSWorld-Verified 72.7% to 78.0%.
  • New tokenizer that can consume up to ~35% more tokens for identical text — a real cost-modeling change despite flat per-token pricing.
  • Vision pipeline accepts images roughly 3x larger than Opus 4.6 (2,576 px long edge), making PDF/slide/screenshot reasoning genuinely reliable; CharXiv with tools 84.7% to 91.0%.
  • New `xhigh` effort level and task budgets (beta) for finer control over thinking spend.
  • The model verifies its own outputs and follows instructions more literally; some prompts tuned for prior Opus generations may need updating.
  • State-of-the-art Finance Agent (64.4%) and strong cyber-defense eval (CyberGym 73.1%).
  • BrowseComp regressed slightly (83.7% on 4.6 to 79.3%), the one benchmark that did not improve.

Benchmarks

BenchmarkScoreSource
Humanity's Last Exam54.7%anthropic.com 2026-04-16T00:00:00.000Z
MMMU91.5%vellum.ai 2026-04-16T00:00:00.000Z
LMArena Elo1503openlm.ai 2026-05-28T00:00:00.000Z
GPQA Diamond94.2%vellum.ai 2026-04-16T00:00:00.000Z
Terminal-Bench69.4%vellum.ai 2026-04-16T00:00:00.000Z
LMArena Coding Elo1554openlm.ai 2026-05-28T00:00:00.000Z
SWE-bench Verified87.6%anthropic.com 2026-04-16T00:00:00.000Z
Artificial Analysis Index57artificialanalysis.ai 2026-05-28T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker9/10
Opus 4.7 is the safest frontier bet on Anthropic: flat price, real multi-cloud, and the agentic-coding lead that justifies the tier.

For a one-to-two-year platform bet, Opus 4.7 is the lowest-risk Anthropic landing pad. Pricing held at $5/$25, it is GA across Anthropic, Bedrock, Vertex, and Foundry, and the coding lead is defensible against GPT-5.5 and Gemini 3.1. The structural risk is lock-in: prompts tuned to Opus 4.7's literal instruction-following and new tokenizer do not port one-to-one to other vendors, and the tokenizer change resets cost models. Roadmap confidence is high given Anthropic's cadence. For frontier coding agents this is the default; chat surfaces should tier down to Sonnet or Haiku.

Strategic Fit 9Vendor Risk 7Roadmap Confidence 9
Pros
  • Frontier coding, flat price, multi-cloud failover, mature governance
Cons
  • Lock-in via tokenizer/prompt tuning
  • slow latency
Right for: orgs standardizing serious agentic workloads on Anthropic
Avoid if: you need vendor-neutral portability or sub-second UX
Domain Strategist9/10
Anthropic owns the agentic-coding narrative, and Opus 4.7 is the proof point that keeps Cursor, Windsurf, and Copilot in the fold.

In market terms, Opus 4.7 cements Anthropic's moat in code and agents — the segment with the stickiest enterprise spend and the clearest ROI story. Topping the LMArena coding board and SWE-bench while holding price is a positioning win that competitors must answer on capability, not discounting. The differentiation is narrowing on raw intelligence (GPT-5.5 leads the AA Index at 60 vs 57), so Anthropic's edge is increasingly "best at doing work," not "highest IQ." Market timing is strong: launched into a cycle where agentic coding is the dominant enterprise use case.

Competitive Positioning 9Differentiation 8Market Timing 10
Pros
  • Clear category leadership in coding/agents
  • ecosystem gravity
Cons
  • Intelligence-index lead ceded to GPT-5.5
  • browse regressed
Right for: teams betting on the agentic-coding wave
Avoid if: your wedge is raw reasoning or real-time data
Finance Lead8/10
Headline price is unchanged, but the new tokenizer is a stealth 35% line-item increase you must model before you commit.

At $5/$25 the rate card matches Opus 4.6 and 4.5, and cache reads ($0.50) plus batch ($2.50/$12.50) cut stable-workload bills sharply. The trap is the tokenizer: identical text can bill up to 35% more tokens, so per-task TCO rises even though per-token price did not. The 1M context at flat pricing is a genuine win — no long-context premium. Fast Mode at 6x must be ring-fenced to UX-critical calls. Value-per-dollar is good for hard work, mediocre for routine work that Sonnet or Haiku would handle at a fraction of the cost.

Cost Efficiency 7Pricing Transparency 8Value per Dollar 7
Pros
  • Flat rate card, deep cache/batch discounts, flat 1M-context pricing
Cons
  • Tokenizer inflates effective cost
  • overkill for simple tasks
Right for: high-value agentic jobs with good caching
Avoid if: budget-sensitive high-volume traffic
Domain Practitioner9.5/10
The SDK is unchanged, tool use just works, and the diffs are cleaner — Opus 4.7 closes real PRs, not toy issues.

For a hands-on builder, Opus 4.7 is the smoothest agent target Anthropic ships. The bash, text-editor, and computer-use scaffolds from Opus 4.6 keep working; streaming, structured output, and prompt caching behave identically. SWE-bench Verified 87.6% and SWE-bench Pro 64.3% translate to noticeably fewer "almost right" diffs in real workflows, and the `xhigh` effort level and task budgets give finer control over thinking cost. The one sharp edge is the tokenizer: cost estimators and context budgeters built for 4.6 silently under-count. Docs are excellent and the Claude Agent SDK is first-class.

API Ergonomics 9Tool/Agent Support 10Reliability 9
Pros
  • Stable API surface, best-in-class agent tooling, fewer broken diffs
Cons
  • Tokenizer breaks old budgets
  • latency hurts tight loops
Right for: developers shipping autonomous coding agents
Avoid if: you need fast, cheap, simple completions
Power User8.5/10
Conversation quality is excellent and refusals are rare, but you feel every second of the thinking latency.

For a heavy daily user, Opus 4.7 is the most capable conversational Claude — concise when asked, coherent across very long sessions, and far less prone to over-refusal than 2024-era Claude. Vision genuinely works on pasted screenshots and PDFs. The trade-off is speed: time-to-first-token in max-effort mode is long and output streams at ~55 t/s, so it feels deliberate, not snappy. For deep work that is fine; for rapid back-and-forth it is frustrating, and most users will keep Sonnet 4.6 for everyday chat and reach for Opus 4.7 on hard problems.

Output Quality 9.5Speed 6Everyday Usefulness 8.5
Pros
  • Top-tier answers, low refusal, strong vision
Cons
  • Slow first token
  • verbose by default
Right for: power users on hard, high-stakes tasks
Avoid if: you want instant, lightweight replies
Skeptic7.5/10
A frontier coder that GPT-5.5 out-thinks on the index, slowed to a crawl, with a tokenizer that quietly raises your bill 35%.

The agentic-coding numbers are real and independently corroborated, so the headline holds. But three claims deserve scrutiny. First, "most capable" is selective: GPT-5.5 xhigh leads the Artificial Analysis Index 60 to 57, and Gemini 3.1 Pro ties at 57 — Opus 4.7's lead is in coding and agents, not general intelligence. Second, the "unchanged pricing" framing hides a real ~35% effective cost increase from the new tokenizer. Third, BrowseComp regressed, and the absence of clean AIME/MMLU-Pro disclosures makes the reasoning story partly unverifiable. SWE-bench is also the most prompt-sensitive, scaffold-sensitive benchmark in the suite — gains there are partly engineering, not pure model. It is excellent at code; treat the "frontier everything" gloss with caution.

Claim Accuracy 7.5Weakness Severity 6.5Hype vs Reality 7
Pros
  • Coding leadership is genuine
  • governance is real
Cons
  • Not the smartest by index
  • slow
  • stealth cost rise
  • one benchmark regressed
Right for: skeptics who need verifiable coding wins
Avoid if: you take "most capable model" at face value

Strengths

  • Best-in-class agentic coding on real-world SWE-bench tasks; sustains multi-hour engineering work with limited human-in-the-loop.
  • 1M-token context at standard pricing — whole-monorepo or whole-trial-transcript prompts are cost-competitive.
  • High-resolution vision unlocks reliable PDF/slide/screenshot reasoning without external OCR.
  • More literal, self-verifying instruction following reduces "creative interpretation" failures.
  • Near-frontier scientific reasoning (GPQA Diamond 94.2%) and SOTA Finance Agent.

Limitations

  • New tokenizer means Opus-4.6 cost and context budgets silently under-count by up to ~35%.
  • High time-to-first-token and ~55 t/s output make it the slowest-feeling current Claude; poor for snappy chat.
  • BrowseComp regressed vs 4.6 (79.3% vs 83.7%) — heavy browse-and-synthesize pipelines may not see uniform uplift.
  • No explicit extended-thinking toggle; effort levels are a different mental model for some teams.
  • No clean published AIME 2025 / MMLU-Pro figure, so a couple of standard reasoning columns are null.

Best use cases

- Long-horizon coding agents (multi-file refactors, autonomous PR loops, end-to-end feature builds in Claude Code). - Computer-use and browser-control agents where each step is expensive and accuracy outweighs throughput. - Heavy-document workflows (legal review, financial analysis, scientific literature) where 1M context plus high-res vision are real differentiators. - Hard reasoning where extra thinking budget actually moves the answer (GPQA-Diamond-difficulty research synthesis).

Buyer questions

How much will the new tokenizer actually cost me?

Per-token price is unchanged at $5/$25, but identical text can tokenize up to ~35% larger. Re-run your cost model on representative prompts before committing volume.

Is it worth migrating from Opus 4.6?

For coding and agents, yes — the SWE-bench Pro and OSWorld gains are large. For chat or copy, the lift is marginal and 4.6 may cost less per task.

Can I run it without latency pain?

Use Batch API for non-interactive work (latency irrelevant, 50% off) or Fast Mode (6x price) for interactive needs; otherwise expect deliberate response times.

What about data security?

No training on API inputs by default; SOC 2 Type II, ISO 27001/42001, HIPAA BAA, and GDPR all covered. Zero-retention and US data residency are available.

Which clouds?

First-party Claude API plus Bedrock, Vertex AI, and Microsoft Foundry, with regional endpoints for data-residency needs.

How does it handle 1M-token prompts?

Served at standard per-token pricing with no long-context premium; caching and batch discounts apply across the full window.

Comparable models

GPT-5.5OpenAI

Leads the Artificial Analysis Intelligence Index (60 vs 57) and many reasoning evals; Opus 4.7 counters on SWE-bench and agentic coding leadership and on LMArena coding Elo.

Gemini 3.1 ProGoogle

Ties Opus 4.7 on the AA Index (57) and competes on long-context and multimodal; generally behind on SWE-bench Pro and agentic coding.

Claude Sonnet 4.6: Same family; ~8 pts behind on SWE-bench Verified (79.6%) at 60% of input cost and faster latency — the right default for everything that is not frontier-hard.

Model specs

Input price
$5 / Mtok
Output price
$25 / Mtok
Cached input
$0.50 / Mtok
Batch (in/out)
$2.50 / $12.50
Context window
1M tokens
Max output
128K tokens
Knowledge cutoff
2026-01
Released
2026-04-15
Modalities
text, image → text
Output speed
~54.6 tok/s
License
Proprietary
Clouds
Bedrock, Vertex AI, Azure AI Foundry

Does not train on API inputs by default

Last verified 2026-05-27