How much does Maverick cost?

There is no single Meta price; representative inference is ~$0.15–$0.59 input and ~$0.60–$2.19 output per 1M tokens depending on provider (DeepInfra cheapest, Together's output highest). Self-hosting trades per-token cost for GPU capex.

Yes. Download the FP8 weights from Hugging Face and serve on an 8xH100 node, or quantize to INT4 (~240GB VRAM) to fit smaller hardware.

Is it really multimodal?

Yes — image understanding is native (early fusion), not a bolted-on adapter, with strong DocVQA/ChartQA scores. It does not generate images.

Is the 1M context usable?

For retrieval, largely yes; for reasoning across the full window, no — comprehension degrades well before the ceiling, so chunk and test on your workload.

What about safety and compliance?

The weights have no built-in moderation; add Llama Guard 4 / Prompt Guard 2. Compliance certifications come from your host or your own infra, not the model.

Are there usage restrictions?

The Llama 4 Community License allows commercial use but requires a separate Meta license above 700M MAU and forbids training non-Llama models on its outputs.

Should I pick Maverick or Scout?

Maverick for higher quality and 128 experts on a node; Scout for single-GPU deploy and the 10M context. Both share the same 17B-active speed profile.

Llama 4 Maverick Review — Benchmarks, Pricing & AI Panel Verdict

Benchmark	Score	Source
MMLU	85.5%	Meta / llm-stats aggregator2025-04-05T00:00:00.000Z
MMMU	73.4%	Meta Llama 4 model card2025-04-05T00:00:00.000Z
MATH-500	61.2%	Meta (MATH-Hard)2025-04-05T00:00:00.000Z
MMLU-Pro	80.5%	Meta Llama 4 model card2025-04-05T00:00:00.000Z
HumanEval	85.8%	llm-stats aggregator2025-04-05T00:00:00.000Z
LMArena Elo	1271	LMArena (released Instruct; experimental chat ranked higher pre-release)2025-04-15T00:00:00.000Z
GPQA Diamond	69.8%	Meta Llama 4 model card2025-04-05T00:00:00.000Z
LiveCodeBench	43.4%	Meta Llama 4 model card2025-04-05T00:00:00.000Z
Aider Polyglot	15.6%	Aider leaderboard (community)2025-04-10T00:00:00.000Z
Artificial Analysis Index	18	Artificial Analysis2026-05

Architecture

Maverick is a sparse MoE transformer: 400B total parameters across 128 routed experts plus one shared expert, with only ~17B parameters active per token (the shared expert plus one routed expert). MoE layers alternate with dense layers. Attention uses the iRoPE scheme — three of every four decoder layers apply rotary embeddings with chunked local attention (8K-token chunks), and every fourth "NoPE" layer drops positional encoding entirely and attends over the full context with temperature-scaled softmax to fight long-range probability fade. Vision is early-fused from a dedicated encoder into the same token stream. Meta discloses total/active params, expert count, training-token scale (30T+), and the 200-language mixture; exact layer counts, training compute, and the full data recipe are not fully published. The released checkpoint is Instruct-tuned; the FP8 variant fits an 8xH100 node.

Capabilities

Maverick scores as a strong-but-not-frontier generalist. Coding (cap_coding 6.5) is good for open weights — HumanEval ~85.8, LiveCodeBench 43.4 — but the Aider polyglot score (~15.6) exposes weakness on real multi-file edit workflows, so the editorial score sits mid-pack. Reasoning (6.0) and math (5.5) are competent on classic benchmarks (MMLU 85.5, MMLU-Pro 80.5, MATH-Hard ~61) but there is no native reasoning mode, so it loses badly to o-series, Claude extended thinking, and DeepSeek R1 on hard proofs and AIME-class problems. GPQA Diamond 69.8 is genuinely strong for an open non-reasoning model. Multilingual (8.0) is a real strength across 200 languages. Vision and document OCR (7.0 / 7.5) handle charts (ChartQA 85.3), documents (DocVQA 91.6), and MMMU 73.4 well. Long-context (4.5) is the honest weak spot: needle-in-haystack retrieval looks near-perfect, but comprehension benchmarks (Fiction.LiveBench) rank Llama 4 near the bottom, meaning the 1M window degrades on reasoning-over-context well before the ceiling. Function-calling and instruction-following (6.5 each) are reliable but trail the best closed models. No real-time data (0.0).

Benchmark analysis

Benchmark	Score	vs Predecessor (3.1 405B)	vs Top Competitor	Source
MMLU	85.5	-3.1 (405B 88.6)	~ GPT-4o	Meta/llm-stats
MMLU-Pro	80.5	+7.2 (405B 73.3)	trails Claude/GPT-5	Meta
GPQA Diamond	69.8	+18.7 (405B 51.1)	beats GPT-4o (53.6)	Meta
MATH (Hard)	61.2	comparable	trails reasoning models	Meta
HumanEval	85.8	~ 405B (89.0)	~ GPT-4o	llm-stats
LiveCodeBench	43.4	new	beats GPT-4o (~32)	Meta
Aider Polyglot	15.6	new	trails Qwen2.5-Coder-32B	Aider community
MMMU (vision)	73.4	new (405B no vision)	competitive	Meta
LMArena Elo	1271	n/a	mid-pack; experimental chat ranked higher	LMArena
Artificial Analysis Index	18	+1 (405B 17)	below reasoning frontier	AA

Note the LMArena gap: Meta submitted an unreleased "experimental chat" checkpoint that ranked high; the actually-downloadable Instruct weights rank ~1271, materially lower. Treat any pre-release LMArena claim with suspicion.

Speed & latency

Median output speed is ~104 tokens/sec across providers, with time-to-first-token around 0.66s on DeepInfra FP8. The MoE design means it generates roughly as fast as a dense 17B model despite the 400B pool. Specialty silicon pushes it far faster: SambaNova benchmarks ~645 tokens/sec, and Groq/Cerebras deliver sub-second interactive feel. For batch and high-throughput agent loops the economics are excellent; latency tier is fast.

Pricing analysis

Surface	Cost	Notes
API input (representative)	~$0.20 / 1M tok	DeepInfra/Fireworks FP8 floor
API output (representative)	~$0.85 / 1M tok	spread $0.60–$2.19 across providers
DeepInfra	$0.15 in / $0.60 out	cheapest mainstream
Fireworks	$0.22 in / $0.88 out
Groq	$0.59 in / $0.79 out	fastest interactive
Together	$0.55 in / $2.19 out	runs notably higher on output
SambaNova	blended ~$0.75	645 tps
AWS Bedrock	~$0.50 in / ~$1.50 out	82–93% cheaper than 3.1 405B on Bedrock
Self-hosted	8xH100 node (FP8)	~240GB VRAM to hold 400B at INT4
Rate limits	provider-specific	no single Meta limit; managed tiers ~600 RPM / 200K TPM

Open weights mean there is no single Meta price; the spread above is the inference-provider market as of May 2026. Together's output rate is an outlier 2–3x the floor — provider choice materially changes unit cost.

Deployment & access

Open weights under the Llama 4 Community License. Download from Hugging Face (meta-llama/Llama-4-Maverick-17B-128E-Instruct, FP8 variant available). Self-hostable on a single 8xH100 node in FP8; INT4 quantization fits the full 400B parameter set in ~240GB VRAM (e.g. 4xH100 or an RTX 5090-class consumer box for aggressive quants). Managed availability spans AWS Bedrock, Google Vertex AI, Azure AI Foundry, OCI, and IBM watsonx. Inference providers include Together, Fireworks, Groq, DeepInfra, OpenRouter, Novita, SambaNova, and Hyperbolic — wide redundancy means you are never locked to one host. The Llama 4 Community License permits commercial use but requires a separate license from Meta if your products exceed 700 million monthly active users, and prohibits using outputs to train non-Llama models.

Safety & privacy

The weights ship without built-in content moderation; Meta provides Llama Guard 4 (a 12B natively-multimodal safety classifier for text + image, released April 2025) and Prompt Guard 2 (22M/86M jailbreak and prompt-injection detectors) as optional pre/post filters designed for the Llama 4 line. Because the model is self-hostable, "trains on inputs" is not applicable when you run it yourself; managed providers vary, but Meta's own terms do not train on your data. There are no model-level compliance certifications (SOC2/HIPAA/etc.) — those attach to whichever host or your own infrastructure. Refusal calibration is moderate and fully tunable via system prompt or fine-tune, which is precisely the point of open weights for regulated buyers. Governance falls under Meta's Frontier AI Framework.

Ecosystem & tooling

First-class support across Hugging Face Transformers, vLLM, llama.cpp, Ollama, SGLang, TensorRT-LLM, and Unsloth, plus LangChain and LlamaIndex integrations. Available on Bedrock, Vertex AI, Azure AI Foundry, OCI, and IBM watsonx, and on Together, Fireworks, Groq, DeepInfra, OpenRouter, Novita, SambaNova, and Hyperbolic. Powers Meta's own consumer AI across WhatsApp, Instagram, and Messenger. Popularity is mainstream — the default open-weights multimodal pick for self-hosting teams in 2026.

Llama 4 Maverick

What's new

Benchmarks

AI Panel Review

Strengths

Limitations

Best use cases

Deep dive

Architecture

Capabilities

Benchmark analysis

Speed & latency

Pricing analysis

Deployment & access

Safety & privacy

Ecosystem & tooling

Buyer questions

Comparable models

Sources

Model specs

Other Llama 4 versions