$1.25 / $2.50 per 1M tokens, multiplied by the number of active agents (4 at low/medium effort, 16 at high/extra-high). Budget for the worst case.

How many agents run, and what do they do?

4 to 16 in parallel. Secondary sources describe roles: a coordinator, a researcher (live X), a logic/math/code agent, and a creative-contrarian agent that challenges consensus before synthesis.

When should I use this over Grok 4.3?

For deep, multi-source research and long-form synthesis where breadth of parallel search matters. For chat, coding, or low-latency work, use a single-pass model.

How long does a run take?

Minutes, not seconds — it fans out, searches, reasons, then synthesizes. Treat it as an async job.

Yes, more than the rest of the Grok line: the multi-agent semantics have no SDK-compatible equivalent, so migrating off means rebuilding orchestration.

Can I get it on a managed cloud?

Not as a confirmed Azure AI Foundry SKU; it's a direct x.ai API product (also on OpenRouter).

Grok 4.20 Multi-Agent Review — Benchmarks, Pricing & AI Panel Verdict

Benchmark	Score	Source
MATH-500	87.3%	inherits Grok 4.20 base (xAI launch / secondary)2026-03-10T00:00:00.000Z
LMArena Elo	1505	NextBigFuture (estimated 1505-1535 range at launch; not an official LMArena posting)2026-03-31T00:00:00.000Z
GPQA Diamond	78.5%	inherits Grok 4.20 base (xAI launch / secondary)2026-03-10T00:00:00.000Z

Architecture

The base model is undisclosed like the rest of the Grok line; what is distinctive is the orchestration layer bolted on top. Rather than a single inference pass, a request spawns multiple sub-agents that work concurrently — each with its own tool access (web, X search) — before a coordinator agent merges their findings into one synthesized response. The agent count is controlled by reasoning effort (4 at low/medium, 16 at high/extra-high). Secondary sources describe a role-specialized team (coordinator, researcher, logic/code, creative-contrarian), which means the "contrarian" agent is a deliberate design choice to stress-test consensus before the final answer. Parameter counts, base architecture, and training details remain null/unknown. This is the one Grok variant whose multi-agent semantics have no direct OpenAI/Anthropic-SDK equivalent, which raises migration cost.

Capabilities

Agentic (9.0): The whole point. AA agentic index of 68.7 is top-tier, and native fan-out replaces a custom orchestration framework for breadth-of-search tasks.
Long context (9.0): 2M input plus up to ~2M output — genuinely long-horizon, suitable for book-length synthesis.
Real-time data (9.5): X + web search inside every parallel agent makes multi-source freshness exceptional — arguably the best multi-source live-data shape in the lineup.
Reasoning / math (7.5 / 7.5): Inherits Grok 4.20's base profile (GPQA ~78.5%, MATH-500 ~87.3%); the gain over base is breadth via parallelism, not single-pass depth.
Instruction following (8.0) / function calling (8.0): Work on the final synthesis step; structured outputs apply to the merged result.
Coding (6.0): Inherits the base model's soft coding; not a coding-agent substitute (use Grok Build 0.1).
Creative writing (7.0): The "contrarian" agent can surface non-obvious angles useful for research framing.
Safety calibration (6.0): Inherits base posture; the built-in contrarian agent is a mild internal check, not an external safety framework.

Benchmark analysis

Benchmark	Score	vs Predecessor	vs Top Competitor	Source
Artificial Analysis Agentic Index	68.7	Above single-pass 4.20	Among highest agentic indices available	Verdent guide
GPQA Diamond (inherited)	~78.5%	Same as base 4.20	Behind Opus 4.7 / GPT-5.5	Launch coverage
MATH-500 (inherited)	~87.3%	Same as base 4.20	Competitive	Launch coverage
LMArena Elo (estimated)	~1505-1535	Estimate at launch	Top-5 band if confirmed	NextBigFuture (estimate)

(Per-agent benchmark scores are not separately published, and standard single-turn evals understate a multi-agent design. The standard Artificial Analysis Intelligence Index is not posted for this variant — AA tracks it on an agentic index instead. The LMArena number is a launch-time estimate, not an official posting. Rows left null where no verifiable figure exists; nothing invented.)

Speed & latency

Latency is the defining trade-off: multi-agent passes take minutes, not seconds, because the system fans out, runs parallel searches and reasoning, then synthesizes. A single output_speed_tps figure is not meaningful for a fan-out architecture, so it is null. Latency tier: slow. This is acceptable for asynchronous deep-research jobs and entirely wrong for chat-style or interactive use.

Pricing analysis

Surface	Cost	Notes
API input	$1.25 / 1M tok	Multiplied by active agent count (4-16)
API output	$2.50 / 1M tok	Multiplied by active agent count
Cached input	$0.20 / 1M tok	84% discount on cache reads
Direct UI	n/a directly	Surfaces via SuperGrok Heavy "Deep Research" flows
SuperGrok Heavy	$300 / mo	Where most consumer multi-agent usage happens
Rate limits	tighter than single-agent slugs	Fan-out cost forces lower limits

Cost reality: the headline $1.25 / $2.50 is per-agent. A high-effort 16-agent run can burn roughly 10-20x what a single Grok 4.3 call costs on the same input. The right comparison is not "Grok 4.3 once" but "running a model 4-16 times and paying an engineer to merge the results" — on that basis the economics can win for genuine research; for chat they collapse. (Same docs-vs-aggregator pricing caveat as the rest of the 4.x line: docs.x.ai is canonical.)

Deployment & access

Proprietary, API-only, no open weights, not self-hostable. Resold on OpenRouter. Not a confirmed Azure AI Foundry SKU. The key deployment caveat: the multi-agent semantics have no drop-in OpenAI/Anthropic-SDK equivalent, so unlike the rest of the Grok line, migrating off xAI would mean rebuilding the orchestration yourself — a real, if moderate, lock-in. Rate limits are tighter than single-agent slugs because each request fans out.

Safety & privacy

Inherits the Grok 4.20 base posture: no published safety framework, governance via Acceptable Use Policy, moderation tightened January 2026, best-in-class non-hallucination at the base model's release. Training-on-inputs: API opt-in (irreversible) / X consumer default with no opt-out. A novel governance wrinkle is the built-in "contrarian" agent (Lucas), which deliberately challenges the emerging consensus before synthesis — a mild internal red-team, not a substitute for external safety controls. No verified compliance certs.

Ecosystem & tooling

Python/TypeScript SDKs; LangChain integration; resold on OpenRouter. Primary surface for consumers is SuperGrok Heavy's Deep Research. Popularity is niche — a specialist tool for research-heavy workloads rather than a general model.

Grok 4.20 Multi-Agent

What's new

Benchmarks

AI Panel Review

Strengths

Limitations

Best use cases

Deep dive

Architecture

Capabilities

Benchmark analysis

Speed & latency

Pricing analysis

Deployment & access

Safety & privacy

Ecosystem & tooling

Buyer questions

Comparable models

Sources

Model specs

Other Grok 4 versions