Llama 4 Maverick

GALatest Maverick

by Meta · Llama 4 family · best for self-hosted multimodal workhorse with no vendor lock-in

Open-WeightsMultimodalCost-OptimizedLong-Context
7.7
AI Panel Score
Value 9.0/10

Llama 4 Maverick is Meta's flagship open-weights model, released April 5, 2025 as the first Llama to ship as a Mixture-of-Experts. It pairs a 400-billion-parameter knowledge pool with only 17 billion active parameters per token, is natively multimodal (text + image from pre-training), and serves a 1M-token context window. The one-sentence buyer takeaway: it is not the smartest model on any leaderboard, but it is the strongest open-weights workhorse you can self-host and run on every major inference provider for roughly a tenth the price of a closed frontier API — making it the default hedge against vendor lock-in. - Provider: Meta - Release: 2025-04-05 (GA, open weights) - Status: GA, latest in its tier (no successor shipped as of May 2026) - Context: 1,000,000 tokens (256K native pre-training, extended via iRoPE) - Max output: 8,192 tokens (provider-dependent) - Modalities: text + image in, text out - Knowledge cutoff: August 2024 - Headline price: ~$0.20 in / ~$0.85 out per 1M tokens (representative across providers)

What's new

  • First Llama family member with a Mixture-of-Experts architecture: 128 routed experts plus a shared expert, dropping per-token compute to ~17B active while keeping a 400B parameter pool.
  • Natively multimodal via early-fusion (text and vision in a single backbone) rather than a bolted-on vision adapter as in Llama 3.2 Vision.
  • New iRoPE attention: interleaved NoPE (no positional embedding, full-context) layers roughly every fourth layer, with chunked RoPE attention (8,192-token chunks) in the other three, enabling the 1M context.
  • Trained on more than 30 trillion tokens (over double Llama 3), covering 200 languages, codistilled from the unreleased Llama 4 Behemoth teacher model.

Benchmarks

BenchmarkScoreSource
MMLU85.5%Meta / llm-stats aggregator2025-04-05T00:00:00.000Z
MMMU73.4%Meta Llama 4 model card2025-04-05T00:00:00.000Z
MATH-50061.2%Meta (MATH-Hard)2025-04-05T00:00:00.000Z
MMLU-Pro80.5%Meta Llama 4 model card2025-04-05T00:00:00.000Z
HumanEval85.8%llm-stats aggregator2025-04-05T00:00:00.000Z
LMArena Elo1271LMArena (released Instruct; experimental chat ranked higher pre-release)2025-04-15T00:00:00.000Z
GPQA Diamond69.8%Meta Llama 4 model card2025-04-05T00:00:00.000Z
LiveCodeBench43.4%Meta Llama 4 model card2025-04-05T00:00:00.000Z
Aider Polyglot15.6%Aider leaderboard (community)2025-04-10T00:00:00.000Z
Artificial Analysis Index18Artificial Analysis2026-05

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker8.5/10
The smartest hedge in the market: open weights, every cloud, one-node deploy. I trade a little capability for total freedom from price hikes and lock-in.

For a buyer weighing capability against vendor risk, Maverick is the strongest no-lock-in play in May 2026. Open weights plus availability on Bedrock, Vertex, Azure, and eight inference providers means you can never be held hostage on price or availability — the lesson every team that survived a closed-API price hike has internalized. Capability is good enough for the large majority of production workloads, and the sovereignty story satisfies EU and regulated-industry mandates. Roadmap confidence is the soft spot: Behemoth never shipped and Meta's Llama cadence is now uncertain, so treat Maverick as a durable present-day asset, not a guaranteed upgrade path.

Strategic Fit 9Vendor Risk 9Roadmap Confidence 6
Pros
  • No lock-in
  • multi-cloud + multi-provider
  • sovereign deploy
  • excellent TCO
Cons
  • Not frontier
  • uncertain successor roadmap
Right for: regulated/sovereign orgs, anyone burned by closed-API pricing
Avoid if: you need top-of-leaderboard reasoning or a vendor SLA on capability gains
Domain Strategist7.5/10
Maverick owns the 'open and multimodal on one node' square. Its moat is distribution and economics, not raw IQ — and that square is wide.

Positioning-wise, Maverick wins where the market values control and cost over peak intelligence. Against closed frontier models it loses on capability; against other open weights (DeepSeek V3.x, Qwen 3) it competes on multimodality and the breadth of its provider ecosystem rather than benchmark wins. Its differentiation is being natively multimodal AND single-node deployable AND on every major cloud — a combination few open models match. Market timing is good: sovereignty and cost pressure are tailwinds. The risk is that DeepSeek and Qwen iterate faster, so Maverick's open-weights leadership is contestable.

Competitive Positioning 8Differentiation 7Market Timing 8
Pros
  • Unique multimodal+single-node+multi-cloud combo
  • strong distribution
Cons
  • Faster-moving open rivals
  • not a benchmark leader
Right for: platform teams standardizing on open weights
Avoid if: you optimize purely for benchmark rank
Finance Lead9/10
This is where the math flips. Ten-x cheaper to serve than closed frontier, and self-hosting pays back the GPUs in a quarter at volume.

Maverick is the clearest TCO story in the lineup. On Bedrock it runs ~82–93% cheaper than Llama 3.1 405B for equal-or-better quality; DeepInfra floors at $0.15/$0.60 and Groq pushes per-call cost under a cent at chat lengths. The decision is API vs self-host: at sub-100M tokens/month, managed providers win on simplicity; above ~500M tokens/month on steady load, self-hosting on reserved 8xH100 typically amortizes the hardware in 3–5 months and then dominates. MoE keeps inference cheap relative to dense 400B. Watch the provider spread — Together's $2.19 output is 2–3x the floor, so naive provider choice can triple your bill.

Cost Efficiency 9Pricing Transparency 8Value per Dollar 9
Pros
  • ~10x cheaper than closed frontier
  • self-host payback in months
  • MoE efficiency
Cons
  • provider price spread is wide
  • self-host needs utilization to pay off
Right for: high-volume workloads, teams with GPU capacity
Avoid if: low volume where managed API simplicity outweighs unit cost
Domain Practitioner7.5/10
Portable weights, mature tooling, 1M context — but I still write evals because the chat template drifts between Groq, Bedrock, and Together.

For hands-on builders, Maverick is genuinely portable: documented Hugging Face checkpoints, mature fine-tuning recipes (Together, Fireworks, Unsloth), and native support in vLLM, llama.cpp, Ollama, SGLang, and TensorRT-LLM. LoRA adapters land in hours. Structured output and JSON mode work but you often bolt on grammars/outlines yourself rather than relying on a first-class API. The recurring friction is provider drift — subtle chat-template and tool-format differences across hosts break agent loops if you do not test per provider. Tool use is reliable for single-step calls, shakier on long multi-step chains.

API Ergonomics 7Tool/Agent Support 7Reliability 8
Pros
  • portable weights
  • deep framework support
  • fast fine-tuning
Cons
  • provider chat-template drift
  • structured output needs scaffolding
  • multi-step tool chains wobble
Right for: teams that own their inference stack
Avoid if: you want a single managed API with guaranteed-consistent behavior
Power User6.5/10
Fast and fluent, never preachy — but the 'wow' moments belong to Claude and GPT-5. It's a solid backstage model, not a star.

In daily use Maverick is competent and pleasant: fluent prose, sensible refusal rates, and sub-second first-token latency on Groq or Cerebras. Where it falls short is the top end — nuanced humor, emotional intelligence, and genuinely creative writing trail Claude and GPT-5 by a clear notch. For apps where the model sits behind a workflow, users never feel the gap. For a flagship chat product where the model IS the experience, the ceiling shows within an extended session. Long, context-heavy conversations also surface the comprehension degradation.

Output Quality 6Speed 8Everyday Usefulness 7
Pros
  • fast
  • fluent
  • non-preachy
  • multilingual
Cons
  • no creative "wow"
  • long-context comprehension fades
Right for: embedded/backstage assistants
Avoid if: personality and creativity are the product
Skeptic5.5/10
Behemoth never shipped, the LMArena number came from a checkpoint you can't download, and 1M context is marketing once you test comprehension.

Adversarially, Maverick has three claims to distrust. First, the splashy LMArena ranking used an unreleased "experimental chat" model; the actual Instruct weights rank materially lower — a textbook benchmark-presentation gap. Second, the 1M (and Scout's 10M) context is real for needle retrieval but Fiction.LiveBench ranks Llama 4 near the bottom on genuine long-context comprehension, so the headline number oversells usable capability. Third, the teacher model Behemoth (~2T params) was announced in April 2025 and still has not shipped amid reported capability concerns — the family's top end is vaporware. Real weaknesses: no reasoning mode, weak multi-file coding (Aider ~15.6), and a 2024 cutoff. The honest pitch is "good, cheap, open" — not "frontier."

Claim Accuracy 5Weakness Severity 6Hype vs Reality 5
Pros
  • open and cheap is genuinely true
  • retrieval works
Cons
  • LMArena overstated
  • long-context comprehension weak
  • Behemoth unshipped
  • no reasoning
Right for: skeptics who price it as a mid-tier open model
Avoid if: you believe the leaderboard or context headlines at face value

Strengths

  • Open weights with permissive commercial use, deployable on a single 8xH100 node.
  • Best open-weights GPQA Diamond (69.8) and a strong LiveCodeBench at launch.
  • Native multimodality (image + text) with strong chart/document understanding.
  • 1M-token context and very wide provider availability (Bedrock, Vertex, Together, Fireworks, Groq, DeepInfra, SambaNova, OpenRouter).
  • Outstanding TCO: ~10x cheaper to serve than a closed frontier model at comparable general quality.

Limitations

  • Long-context comprehension degrades well before 1M tokens (Fiction.LiveBench ranks Llama 4 near the bottom despite near-perfect needle retrieval).
  • No native reasoning mode; loses decisively to o-series/Claude-thinking/DeepSeek R1 on hard math and proofs.
  • Weak on real multi-file coding edits (Aider polyglot ~15.6) despite a good LiveCodeBench.
  • The headline LMArena ranking came from an unreleased experimental checkpoint; the shipped Instruct ranks lower.
  • 8K output ceiling on most managed providers; provider chat-template drift can trip agent loops.

Best use cases

Self-hosted or sovereign-cloud agent and RAG stacks where data must not leave the customer's infrastructure — Maverick is the strongest open option that runs on one node. Multilingual content pipelines across 100+ languages at high throughput. Document- and chart-heavy vision workflows where DocVQA/ChartQA-class accuracy matters. Cost-sensitive backends where a closed frontier API's per-token price or licensing is a non-starter and "good enough for 80% of production" is the bar.

Buyer questions

How much does Maverick cost?

There is no single Meta price; representative inference is ~$0.15–$0.59 input and ~$0.60–$2.19 output per 1M tokens depending on provider (DeepInfra cheapest, Together's output highest). Self-hosting trades per-token cost for GPU capex.

Can I run it myself?

Yes. Download the FP8 weights from Hugging Face and serve on an 8xH100 node, or quantize to INT4 (~240GB VRAM) to fit smaller hardware.

Is it really multimodal?

Yes — image understanding is native (early fusion), not a bolted-on adapter, with strong DocVQA/ChartQA scores. It does not generate images.

Is the 1M context usable?

For retrieval, largely yes; for reasoning across the full window, no — comprehension degrades well before the ceiling, so chunk and test on your workload.

What about safety and compliance?

The weights have no built-in moderation; add Llama Guard 4 / Prompt Guard 2. Compliance certifications come from your host or your own infra, not the model.

Are there usage restrictions?

The Llama 4 Community License allows commercial use but requires a separate Meta license above 700M MAU and forbids training non-Llama models on its outputs.

Should I pick Maverick or Scout?

Maverick for higher quality and 128 experts on a node; Scout for single-GPU deploy and the 10M context. Both share the same 17B-active speed profile.

Comparable models

DeepSeek V3.x / R1 — closer to frontier on reasoning, also open weights, similar TCO; Maverick wins on native multimodality and provider breadth, loses on hard reasoning.
Qwen 3 (Max / 235B-A22B) — competitive on multilingual and coding, open weights, often cheaper; Qwen frequently edges Maverick on coding benchmarks while Maverick holds the wider cloud-managed footprint.
GPT-5 / Claude Opus 4.7 — far higher capability ceiling but closed weights at 10–50x the per-token cost; Maverick wins only on openness, sovereignty, and price, never on raw intelligence.

Model specs

Input price
$0.20 / Mtok
Output price
$0.85 / Mtok
Cached input
Batch (in/out)
Context window
1M tokens
Max output
8K tokens
Knowledge cutoff
2024-08
Released
2025-04-04
Modalities
text, image → text
Output speed
~104.3 tok/s
License
Open weights (Llama-4-Community)
Clouds
Bedrock, Vertex AI, Azure AI Foundry, GCP, OCI, IBM watsonx

Does not train on API inputs by default

Last verified 2026-05-27