by xAI · Grok 4 family · best for runtime reasoning toggle with long context and live X data
Grok 4.20 is xAI's reasoning model released to full GA on 2026-03-10 (beta 2026-02-17), and was the flagship between Grok 4 / Grok 4 Fast and the newer Grok 4.3. Its defining traits: a runtime reasoning toggle (exposed as separate `grok-4.20-0309-reasoning` and `grok-4.20-0309-non-reasoning` slugs plus a reasoning-effort parameter), a large context window, the best non-hallucination rate of any model at its release on Artificial Analysis's Omniscience benchmark, and xAI's live-X data access. The single sentence a buyer needs: it is a legacy-but-supported flagship whose niche today is a reasoning on/off switch plus long context with live data — most workloads are now better served by the cheaper, newer-trained Grok 4.3. Provider: xAI. Released: 2026-03-10. Status: GA. Context: 1M tokens (see note). Max output: undisclosed. Modalities: text + image in, text out. Knowledge cutoff: November 2024. Headline price: $1.25 / $2.50 per 1M tokens (repriced from launch $2 / $6).
| Benchmark | Score | Source |
|---|---|---|
| IFEval | 81% | Artificial Analysis (IFBench)2026-04-30T00:00:00.000Z |
| MATH-500 | 87.3% | xAI launch / secondary coverage2026-03-10T00:00:00.000Z |
| TAU-bench | 93% | Artificial Analysis (tau-2-Bench Telecom; ~5pts below 4.3's 98)2026-04-30T00:00:00.000Z |
| LMArena Elo | 1491 | LMArena / LMSYS (grok-4.20-beta1, Mar-Apr 2026, top-4; +31 May mover)2026-04-30T00:00:00.000Z |
| GPQA Diamond | 78.5% | xAI launch / secondary coverage2026-03-10T00:00:00.000Z |
| Artificial Analysis Index | 49 | artificialanalysis.ai 2026-05-28T00:00:00.000Z |
Six personas, six verdicts — the same panel that reviews every product on TopReviewed.
“A capable, supported legacy flagship — but if my workload fits 1M tokens, I migrate to the cheaper, newer Grok 4.3.”
Strategically, Grok 4.20 is now a continuity choice rather than a new bet. It retains real value where a runtime reasoning toggle, long context, and best-in-class non-hallucination matter together — but xAI itself recommends 4.3 as the default, signaling where investment and roadmap attention go. Vendor risk is the same as the family: thin disclosure, no certs, no published safety framework. Lock-in is low (SDK compatibility). The decision is simple: keep 4.20 only for the specific niches; otherwise plan migration. Roadmap confidence in 4.20 specifically is declining as 4.3 absorbs its use cases.
“Its one durable edge is the lowest hallucination rate at launch — but the market has moved to 4.3, and that's where the moat now lives.”
In positioning terms, Grok 4.20 carries the same structural moat as the family — live X data — but the differentiation has migrated to Grok 4.3, which is cheaper, fresher, and adds video. 4.20's distinct selling point is reliability: the best-in-class non-hallucination rate gives it a credible pitch to error-sensitive verticals. But market timing works against it: launched March, superseded by April's 4.3, repriced to match it. As a standalone competitive play it has limited runway; its role is to retain users until they migrate. Differentiation versus non-xAI rivals rests entirely on the X-data and reliability angles.
“Now repriced to match 4.3 — so there's no cost reason to choose 4.20 over the newer model unless you specifically need its niche.”
At launch, 4.20's $2 / $6 made it a clear loser to 4.3's $1.25 / $2.50. xAI has since repriced 4.20 down to $1.25 / $2.50 / $0.20 cached on the docs card, erasing the cost penalty — but that also erases any financial reason to prefer it, since 4.3 is the same price with newer training and a higher AA Index. The lingering finance risk is the source conflict: AA still bills it mentally at $2 / $6 with $1.10 cached, so anyone modeling from aggregators will misprice. Rule for finance: confirm the live docs.x.ai rate, then ask whether 4.3 wouldn't simply be the better-value choice at identical pricing.
“The reasoning toggle is the real win — ship one integration, flip a boolean between fast and deep, and parse cleaner output thanks to strict adherence.”
For builders, 4.20's standout is the runtime reasoning toggle: one integration covers both a fast conversational mode and a deep reasoning mode, switched by a parameter or slug rather than a model swap. Strict prompt adherence means less defensive parsing code — when you say "JSON only," you get JSON. Tool calling, structured outputs, and the live-X search tool all work cleanly. The 1M context simplifies long-context plumbing. Friction: SDK/docs polish trails OpenAI and Anthropic, reasoning visibility is summary-only, and coding is better done elsewhere. Reliability is decent; rate limits are spend-tiered.
“It was the standard Grok through April — looser, opinionated, live X data — but consumer surfaces have since moved everyone to 4.3.”
For everyday users on grok.com or X Premium, Grok 4.20 was the default Grok experience through April 2026: the looser personality, ready opinions, and live-X integration that define the brand. Its 2M/1M context is invisible to typical chat use. Once 4.3 became the consumer default, users migrated automatically, so most direct 4.20 use today is via teams pinned to a specific API slug for stability. As a daily driver it was fine; it is simply no longer the one most people touch.
“The 2M-context headline is already contradicted by xAI's own current docs (1M), and the pricing it launched at quietly evaporated — read the live card, not the launch post.”
Adversarially, Grok 4.20 is a case study in why xAI's thin disclosure matters. Its marquee launch claim — 2M context — now conflicts with xAI's own docs card showing 1M for the reasoning/non-reasoning slugs, while AA and OpenRouter still show 2M; nobody should cite a Grok context number without checking the live source. The launch pricing ($2 / $6) silently dropped to $1.25 / $2.50, so any cost analysis older than a few weeks is wrong. Architecture is undisclosed, no SWE-bench, no safety framework. The one claim that holds up well is the non-hallucination leadership (AA-Omniscience 78%) — that is independently sourced and genuinely strong.
- **Production pipelines needing a runtime reasoning toggle** without maintaining two model integrations. - **Low-hallucination-critical summarization** (legal, medical) where the AA-Omniscience leadership matters more than the freshest knowledge. - **Long-context analysis** where 1M+ tokens with live-X access is the requirement. - **Continuity** for teams already pinned to `grok-4.20-0309` slugs who don't need video input.
xAI docs list $1.25 / $2.50 / $0.20 cached — repriced down from the launch $2 / $6. Artificial Analysis still shows the old $2 / $6 with $1.10 cached; trust docs.x.ai.
xAI's docs card lists 1M for the reasoning/non-reasoning slugs; AA and OpenRouter still show 2M, and the multi-agent sibling is 2M. Verify against your account's live limits.
Either pick the `-reasoning` vs `-non-reasoning` slug, or set the reasoning-effort parameter — one integration, two modes.
For almost everything, 4.3: same price, newer training, video, higher scores. Keep 4.20 only for its non-hallucination edge or if you're pinned for stability.
API: only via irreversible opt-in data sharing. X consumer surface: by default, no opt-out.
No SOC2/HIPAA/ISO certs are publicly verified on the direct API; route via a managed cloud if you need them.
Last verified 2026-05-27