by Anthropic · Claude 4 family · best for prior frontier, stable production target
Claude Opus 4.6 was Anthropic's flagship from February 5, 2026 until Opus 4.7 superseded it in April. It remains fully supported and widely deployed: a frontier all-rounder with SWE-bench Verified 80.8%, GPQA Diamond 91.3%, a 1M-token context at standard pricing, and the distinction of a stable tokenizer that Opus 4.7's new tokenizer broke. For a buyer, the single sentence is this: a still-excellent frontier model whose main remaining advantage over 4.7 is prompt/tokenizer stability and the fact that identical text can cost less than on 4.7. - Provider: Anthropic - Released: 2026-02-05 - Status: GA (legacy in Anthropic docs — superseded by Opus 4.7 but fully supported) - Context window: 1,000,000 tokens - Max output: 128,000 tokens (300k on Batch API beta) - Modalities: text, image - Knowledge cutoff: May 2025 reliable (training cutoff August 2025) - Headline price: $5 input / $25 output per 1M tokens
| Benchmark | Score | Source |
|---|---|---|
| Humanity's Last Exam | 53.1% | vellum.ai 2026-02-05T00:00:00.000Z |
| MMMU | 77.3% | vellum.ai 2026-02-05T00:00:00.000Z |
| MMLU-Pro | 88.3% | vellum.ai 2026-02-05T00:00:00.000Z |
| AIME 2025 | 85% | datacamp.com 2026-02-05T00:00:00.000Z |
| HumanEval | 95% | morphllm.com 2026-02-05T00:00:00.000Z |
| TAU-bench | 91.9% | vellum.ai 2026-02-05T00:00:00.000Z |
| LMArena Elo | 1490 | openlm.ai 2026-05-28T00:00:00.000Z |
| GPQA Diamond | 91.3% | vellum.ai 2026-02-05T00:00:00.000Z |
| Terminal-Bench | 65.4% | vellum.ai 2026-02-05T00:00:00.000Z |
| MRCR Long Context | 76% | morphllm.com 2026-02-05T00:00:00.000Z |
| LMArena Coding Elo | 1535 | openlm.ai 2026-05-28T00:00:00.000Z |
| SWE-bench Verified | 80.8% | vellum.ai 2026-02-05T00:00:00.000Z |
| Artificial Analysis Index | 53 | artificialanalysis.ai 2026-02-05T00:00:00.000Z |
Six personas, six verdicts — the same panel that reviews every product on TopReviewed.
“Opus 4.6 is a solid production target, but the strategic question is now migration timing, not continued use.”
Opus 4.6 remains a strong production model, and Anthropic's own docs recommend migrating to 4.7. Pricing is identical, infra fit is unchanged, and the capability lift on 4.7 is real on agentic coding. The argument for staying is stability: Opus 4.7's new tokenizer and more-literal instruction following can shift behavior on tuned prompt suites, and 4.6's tokenizer keeps cost models intact. For most buyers the right call is a controlled migration over a quarter rather than treating 4.6 as a long-term destination.
“Opus 4.6 was the model that briefly gave Anthropic the #1 intelligence-index slot — its legacy is the price-parity ladder.”
Opus 4.6's market role was to take the top of the Artificial Analysis Index for the first time while holding the $5/$25 price Opus 4.5 had set, proving Anthropic could lead on raw intelligence without raising prices. That positioning compounded ecosystem trust. Today its differentiation is mostly historical: 4.7 owns the coding narrative and GPT-5.5 leads the index. Its remaining strategic value is as the stable rung on a price-parity ladder that lets teams adopt Opus capability without repeated cost renegotiation.
“Identical rate card to 4.7 — and because 4.6 keeps the old tokenizer, the same work can actually bill less here.”
On headline rates Opus 4.6 is a wash with 4.7 at $5/$25, with the same cache and batch discounts and the same 1M-context flat pricing. The wrinkle that favors 4.6 is the tokenizer: Opus 4.7's new tokenizer can use up to 35% more tokens for identical text, so for workloads where 4.6's output quality is sufficient, staying can be cheaper per equivalent task for a quarter or two while planning migration. There is no cost penalty to remaining on 4.6 short-term, and a real (if modest) cost reason to do so.
“Opus 4.6 was the first 'throw the whole repo at it' model and it still is — but new code I write on 4.7.”
For builders, Opus 4.6 was a great model and still is, but 4.7 has materially better coding numbers at the same price, so new code goes to 4.7. Where 4.6 stays useful is integrations tuned tightly to its instruction-following quirks, which do not always port cleanly. Adaptive thinking and four effort levels work well, tool use is identical to 4.7, and the 1M context made this generation the first practical whole-repo model. The honest framing: 4.6 is the previous best, 4.7 is the current best.
“On casual chat most users can't tell 4.6 from 4.7 — and some prefer 4.6's slightly warmer voice.”
For a consumer chat product, Opus 4.6 still delivers a polished experience: high conversation quality, calibrated refusals, working vision, and moderate latency with a lower time-to-first-token than 4.7's max-effort mode. The voice has a slight Anthropic warmth that some users prefer to 4.7's more direct, literal style. The May 2025 reliable knowledge cutoff is starting to feel dated for current events, but web search fills the gap. Most end users will not perceive a difference between 4.6 and 4.7 in casual use.
“A genuinely strong model now mostly notable for being cheaper-per-task than its own successor.”
Opus 4.6's benchmarks are real and were briefly best-in-class, so there is no hype problem with the model itself. The skeptical point is positioning: it has been superseded across nearly every eval by 4.7 at the same list price, and its main live advantage is the accident that its older tokenizer bills fewer tokens than 4.7's. The ARC-AGI-2 jump to 68.8% was impressive but is a single benchmark, and like all Opus models the latency is poor for interactive use. As a data point it is honest; as a purchase decision for new work it loses cleanly to 4.7.
- Production systems already integrated against Opus 4.6 where prompt and tokenizer stability beat the 4.7 capability lift. - Long-context document workflows where the 1M-token window is the differentiator. - Hard reasoning benchmarks (GPQA, AIME, ARC-AGI-2) where Opus-tier scores justify the cost. - Agent teams and Claude Code workflows tuned to Opus 4.6 tooling.
For new builds, move to 4.7. For tuned production prompts, plan a controlled migration over a quarter — 4.6 stays fully supported meanwhile.
On rate card, identical; in practice 4.6's older tokenizer can bill up to ~35% fewer tokens for the same text.
Yes, at standard pricing with no premium, plus context compaction for long agents.
Yes — no training on inputs, SOC 2 Type II, ISO 27001/42001, HIPAA BAA, GDPR, data-residency options.
First-party Claude API plus Bedrock, Vertex AI, and Microsoft Foundry with regional endpoints.
1M context at the Opus tier, adaptive thinking with four effort levels, context compaction, and agent teams in Claude Code.
Does not train on API inputs by default
Last verified 2026-05-27