Grok 4.20

GA

by xAI · Grok 4 family · best for runtime reasoning toggle with long context and live X data

ReasoningLong-Context
6.9
AI Panel Score
Value 7.0/10

Grok 4.20 is xAI's reasoning model released to full GA on 2026-03-10 (beta 2026-02-17), and was the flagship between Grok 4 / Grok 4 Fast and the newer Grok 4.3. Its defining traits: a runtime reasoning toggle (exposed as separate `grok-4.20-0309-reasoning` and `grok-4.20-0309-non-reasoning` slugs plus a reasoning-effort parameter), a large context window, the best non-hallucination rate of any model at its release on Artificial Analysis's Omniscience benchmark, and xAI's live-X data access. The single sentence a buyer needs: it is a legacy-but-supported flagship whose niche today is a reasoning on/off switch plus long context with live data — most workloads are now better served by the cheaper, newer-trained Grok 4.3. Provider: xAI. Released: 2026-03-10. Status: GA. Context: 1M tokens (see note). Max output: undisclosed. Modalities: text + image in, text out. Knowledge cutoff: November 2024. Headline price: $1.25 / $2.50 per 1M tokens (repriced from launch $2 / $6).

What's new

  • This entry covers what Grok 4.20 introduced when it launched, and what has changed about it since:
  • **Runtime reasoning toggle** — reasoning effort became a switch rather than a separate model, shipped as `-reasoning` / `-non-reasoning` slugs and an effort parameter. One integration, two operating modes.
  • **Large context window** — headline 2M tokens at launch (largest xAI had shipped), used for full-codebase and multi-document analysis.
  • **Best-in-class non-hallucination at release** — 78% non-hallucination on Artificial Analysis Omniscience, the highest of any model tested at the time; Grok 4.20 still leads this single metric even after Grok 4.3.
  • **Strict prompt adherence** — improved instruction-following for production pipelines.
  • **Native agentic tool calling** with web + live X search built in.
  • **Changed since launch:** xAI repriced it from $2 / $6 down to $1.25 / $2.50 (matching Grok 4.3), and docs.x.ai now lists the reasoning/non-reasoning slugs at **1M** context rather than the launch's 2M (the 2M figure persists on Artificial Analysis and OpenRouter and on the multi-agent sibling). xAI now recommends Grok 4.3 as the default.

Benchmarks

BenchmarkScoreSource
IFEval81%Artificial Analysis (IFBench)2026-04-30T00:00:00.000Z
MATH-50087.3%xAI launch / secondary coverage2026-03-10T00:00:00.000Z
TAU-bench93%Artificial Analysis (tau-2-Bench Telecom; ~5pts below 4.3's 98)2026-04-30T00:00:00.000Z
LMArena Elo1491LMArena / LMSYS (grok-4.20-beta1, Mar-Apr 2026, top-4; +31 May mover)2026-04-30T00:00:00.000Z
GPQA Diamond78.5%xAI launch / secondary coverage2026-03-10T00:00:00.000Z
Artificial Analysis Index49artificialanalysis.ai 2026-05-28T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker7/10
A capable, supported legacy flagship — but if my workload fits 1M tokens, I migrate to the cheaper, newer Grok 4.3.

Strategically, Grok 4.20 is now a continuity choice rather than a new bet. It retains real value where a runtime reasoning toggle, long context, and best-in-class non-hallucination matter together — but xAI itself recommends 4.3 as the default, signaling where investment and roadmap attention go. Vendor risk is the same as the family: thin disclosure, no certs, no published safety framework. Lock-in is low (SDK compatibility). The decision is simple: keep 4.20 only for the specific niches; otherwise plan migration. Roadmap confidence in 4.20 specifically is declining as 4.3 absorbs its use cases.

Strategic Fit 6Vendor Risk 6Roadmap Confidence 6
Pros
  • Reasoning toggle
  • non-hallucination lead
  • live data
Cons
  • Superseded
  • older cutoff
  • pricing ambiguity
Right for: Teams needing the toggle or low-error summarization
Avoid if: Your workload fits 1M tokens and you can just use 4.3
Domain Strategist6.5/10
Its one durable edge is the lowest hallucination rate at launch — but the market has moved to 4.3, and that's where the moat now lives.

In positioning terms, Grok 4.20 carries the same structural moat as the family — live X data — but the differentiation has migrated to Grok 4.3, which is cheaper, fresher, and adds video. 4.20's distinct selling point is reliability: the best-in-class non-hallucination rate gives it a credible pitch to error-sensitive verticals. But market timing works against it: launched March, superseded by April's 4.3, repriced to match it. As a standalone competitive play it has limited runway; its role is to retain users until they migrate. Differentiation versus non-xAI rivals rests entirely on the X-data and reliability angles.

Competitive Positioning 6Differentiation 7Market Timing 5
Pros
  • Reliability niche
  • live-data moat
Cons
  • Out-positioned by its own successor
  • short runway
Right for: Reliability-first verticals
Avoid if: You want the model with momentum
Finance Lead6.5/10
Now repriced to match 4.3 — so there's no cost reason to choose 4.20 over the newer model unless you specifically need its niche.

At launch, 4.20's $2 / $6 made it a clear loser to 4.3's $1.25 / $2.50. xAI has since repriced 4.20 down to $1.25 / $2.50 / $0.20 cached on the docs card, erasing the cost penalty — but that also erases any financial reason to prefer it, since 4.3 is the same price with newer training and a higher AA Index. The lingering finance risk is the source conflict: AA still bills it mentally at $2 / $6 with $1.10 cached, so anyone modeling from aggregators will misprice. Rule for finance: confirm the live docs.x.ai rate, then ask whether 4.3 wouldn't simply be the better-value choice at identical pricing.

Cost Efficiency 7Pricing Transparency 5Value per Dollar 6
Pros
  • Now priced same as 4.3
Cons
  • No value edge over 4.3
  • cross-source pricing conflict
Right for: Teams already integrated who won't re-test
Avoid if: You're choosing fresh — pick 4.3
Domain Practitioner7.5/10
The reasoning toggle is the real win — ship one integration, flip a boolean between fast and deep, and parse cleaner output thanks to strict adherence.

For builders, 4.20's standout is the runtime reasoning toggle: one integration covers both a fast conversational mode and a deep reasoning mode, switched by a parameter or slug rather than a model swap. Strict prompt adherence means less defensive parsing code — when you say "JSON only," you get JSON. Tool calling, structured outputs, and the live-X search tool all work cleanly. The 1M context simplifies long-context plumbing. Friction: SDK/docs polish trails OpenAI and Anthropic, reasoning visibility is summary-only, and coding is better done elsewhere. Reliability is decent; rate limits are spend-tiered.

API Ergonomics 8Tool/Agent Support 8Reliability 7
Pros
  • Reasoning toggle
  • strict adherence
  • X-search tool
Cons
  • Thinner docs
  • weak coding
  • summary-only reasoning
Right for: Pipelines wanting one model, two modes
Avoid if: You need a coding backbone or low latency
Power User6.5/10
It was the standard Grok through April — looser, opinionated, live X data — but consumer surfaces have since moved everyone to 4.3.

For everyday users on grok.com or X Premium, Grok 4.20 was the default Grok experience through April 2026: the looser personality, ready opinions, and live-X integration that define the brand. Its 2M/1M context is invisible to typical chat use. Once 4.3 became the consumer default, users migrated automatically, so most direct 4.20 use today is via teams pinned to a specific API slug for stability. As a daily driver it was fine; it is simply no longer the one most people touch.

Output Quality 7Speed 7Everyday Usefulness 6
Pros
  • Personality
  • live data
  • faster first token than 4.3
Cons
  • Superseded on consumer surfaces
  • older knowledge
Right for: Users on a pinned slug
Avoid if: You just want the current default — that's 4.3
Skeptic6/10
The 2M-context headline is already contradicted by xAI's own current docs (1M), and the pricing it launched at quietly evaporated — read the live card, not the launch post.

Adversarially, Grok 4.20 is a case study in why xAI's thin disclosure matters. Its marquee launch claim — 2M context — now conflicts with xAI's own docs card showing 1M for the reasoning/non-reasoning slugs, while AA and OpenRouter still show 2M; nobody should cite a Grok context number without checking the live source. The launch pricing ($2 / $6) silently dropped to $1.25 / $2.50, so any cost analysis older than a few weeks is wrong. Architecture is undisclosed, no SWE-bench, no safety framework. The one claim that holds up well is the non-hallucination leadership (AA-Omniscience 78%) — that is independently sourced and genuinely strong.

Claim Accuracy 6Weakness Severity 6Hype vs Reality 6
Pros
  • Non-hallucination claim verifies
Cons
  • Context + price claims drift across sources
  • zero architecture transparency
Right for: Buyers who verify against live docs
Avoid if: You trust launch-post specs months later

Strengths

  • Runtime reasoning toggle — one integration, fast and reasoning modes.
  • Best non-hallucination rate at release (AA-Omniscience 78%), still the family leader on that metric.
  • Strict prompt adherence — production-friendly.
  • Long context (1M on docs; 2M on the multi-agent sibling and on AA/OpenRouter listings).
  • Native live X + web data access.

Limitations

  • Functionally superseded by Grok 4.3 for most workloads (cheaper to run, newer-trained, adds video, higher AA Index).
  • Older training cutoff (November 2024) — weaker on recent world events than Grok 4.3 (December 2025).
  • Mid-pack agentic Elo (1179) versus Grok 4.3's 1500.
  • Coding lags Claude Sonnet / Opus; no published SWE-bench Verified.
  • Live pricing/context conflict across sources creates cost-modeling ambiguity.
  • Thin benchmark transparency overall.

Best use cases

- **Production pipelines needing a runtime reasoning toggle** without maintaining two model integrations. - **Low-hallucination-critical summarization** (legal, medical) where the AA-Omniscience leadership matters more than the freshest knowledge. - **Long-context analysis** where 1M+ tokens with live-X access is the requirement. - **Continuity** for teams already pinned to `grok-4.20-0309` slugs who don't need video input.

Buyer questions

What does Grok 4.20 cost now?

xAI docs list $1.25 / $2.50 / $0.20 cached — repriced down from the launch $2 / $6. Artificial Analysis still shows the old $2 / $6 with $1.10 cached; trust docs.x.ai.

Is the context 1M or 2M?

xAI's docs card lists 1M for the reasoning/non-reasoning slugs; AA and OpenRouter still show 2M, and the multi-agent sibling is 2M. Verify against your account's live limits.

How is the reasoning toggle used?

Either pick the `-reasoning` vs `-non-reasoning` slug, or set the reasoning-effort parameter — one integration, two modes.

Should I use 4.20 or 4.3?

For almost everything, 4.3: same price, newer training, video, higher scores. Keep 4.20 only for its non-hallucination edge or if you're pinned for stability.

Does xAI train on my data?

API: only via irreversible opt-in data sharing. X consumer surface: by default, no opt-out.

Is it certified for enterprise compliance?

No SOC2/HIPAA/ISO certs are publicly verified on the direct API; route via a managed cloud if you need them.

Comparable models

**Grok 4.3** — xAI's own newer flagship: same price, newer cutoff (Dec 2025), adds video, higher AA Index (53 vs 49), much higher agentic Elo (1500 vs 1179); 4.20 only wins on non-hallucination rate and (per AA/OpenRouter) raw context size.
**Gemini 3.1 Pro** — Larger verified context with stronger multimodal breadth and higher AA Index; loses on live-X access.
**Claude Opus 4.7** — Better hard coding and reasoning ceiling and published safety; pricier and no real-time data.

Model specs

Input price
$1.25 / Mtok
Output price
$2.50 / Mtok
Cached input
$0.20 / Mtok
Batch (in/out)
Context window
1M tokens
Max output
— tokens
Knowledge cutoff
2024-11
Released
2026-03-09
Modalities
text, image → text
Output speed
~171.4 tok/s
License
Proprietary
Clouds
First-party API

Last verified 2026-05-27