by Meta · Llama 4 family · best for self-hosted multimodal workhorse with no vendor lock-in
Llama 4 Maverick is Meta's flagship open-weights model, released April 5, 2025 as the first Llama to ship as a Mixture-of-Experts. It pairs a 400-billion-parameter knowledge pool with only 17 billion active parameters per token, is natively multimodal (text + image from pre-training), and serves a 1M-token context window. The one-sentence buyer takeaway: it is not the smartest model on any leaderboard, but it is the strongest open-weights workhorse you can self-host and run on every major inference provider for roughly a tenth the price of a closed frontier API — making it the default hedge against vendor lock-in. - Provider: Meta - Release: 2025-04-05 (GA, open weights) - Status: GA, latest in its tier (no successor shipped as of May 2026) - Context: 1,000,000 tokens (256K native pre-training, extended via iRoPE) - Max output: 8,192 tokens (provider-dependent) - Modalities: text + image in, text out - Knowledge cutoff: August 2024 - Headline price: ~$0.20 in / ~$0.85 out per 1M tokens (representative across providers)
| Benchmark | Score | Source |
|---|---|---|
| MMLU | 85.5% | Meta / llm-stats aggregator2025-04-05T00:00:00.000Z |
| MMMU | 73.4% | Meta Llama 4 model card2025-04-05T00:00:00.000Z |
| MATH-500 | 61.2% | Meta (MATH-Hard)2025-04-05T00:00:00.000Z |
| MMLU-Pro | 80.5% | Meta Llama 4 model card2025-04-05T00:00:00.000Z |
| HumanEval | 85.8% | llm-stats aggregator2025-04-05T00:00:00.000Z |
| LMArena Elo | 1271 | LMArena (released Instruct; experimental chat ranked higher pre-release)2025-04-15T00:00:00.000Z |
| GPQA Diamond | 69.8% | Meta Llama 4 model card2025-04-05T00:00:00.000Z |
| LiveCodeBench | 43.4% | Meta Llama 4 model card2025-04-05T00:00:00.000Z |
| Aider Polyglot | 15.6% | Aider leaderboard (community)2025-04-10T00:00:00.000Z |
| Artificial Analysis Index | 18 | Artificial Analysis2026-05 |
Six personas, six verdicts — the same panel that reviews every product on TopReviewed.
“The smartest hedge in the market: open weights, every cloud, one-node deploy. I trade a little capability for total freedom from price hikes and lock-in.”
For a buyer weighing capability against vendor risk, Maverick is the strongest no-lock-in play in May 2026. Open weights plus availability on Bedrock, Vertex, Azure, and eight inference providers means you can never be held hostage on price or availability — the lesson every team that survived a closed-API price hike has internalized. Capability is good enough for the large majority of production workloads, and the sovereignty story satisfies EU and regulated-industry mandates. Roadmap confidence is the soft spot: Behemoth never shipped and Meta's Llama cadence is now uncertain, so treat Maverick as a durable present-day asset, not a guaranteed upgrade path.
“Maverick owns the 'open and multimodal on one node' square. Its moat is distribution and economics, not raw IQ — and that square is wide.”
Positioning-wise, Maverick wins where the market values control and cost over peak intelligence. Against closed frontier models it loses on capability; against other open weights (DeepSeek V3.x, Qwen 3) it competes on multimodality and the breadth of its provider ecosystem rather than benchmark wins. Its differentiation is being natively multimodal AND single-node deployable AND on every major cloud — a combination few open models match. Market timing is good: sovereignty and cost pressure are tailwinds. The risk is that DeepSeek and Qwen iterate faster, so Maverick's open-weights leadership is contestable.
“This is where the math flips. Ten-x cheaper to serve than closed frontier, and self-hosting pays back the GPUs in a quarter at volume.”
Maverick is the clearest TCO story in the lineup. On Bedrock it runs ~82–93% cheaper than Llama 3.1 405B for equal-or-better quality; DeepInfra floors at $0.15/$0.60 and Groq pushes per-call cost under a cent at chat lengths. The decision is API vs self-host: at sub-100M tokens/month, managed providers win on simplicity; above ~500M tokens/month on steady load, self-hosting on reserved 8xH100 typically amortizes the hardware in 3–5 months and then dominates. MoE keeps inference cheap relative to dense 400B. Watch the provider spread — Together's $2.19 output is 2–3x the floor, so naive provider choice can triple your bill.
“Portable weights, mature tooling, 1M context — but I still write evals because the chat template drifts between Groq, Bedrock, and Together.”
For hands-on builders, Maverick is genuinely portable: documented Hugging Face checkpoints, mature fine-tuning recipes (Together, Fireworks, Unsloth), and native support in vLLM, llama.cpp, Ollama, SGLang, and TensorRT-LLM. LoRA adapters land in hours. Structured output and JSON mode work but you often bolt on grammars/outlines yourself rather than relying on a first-class API. The recurring friction is provider drift — subtle chat-template and tool-format differences across hosts break agent loops if you do not test per provider. Tool use is reliable for single-step calls, shakier on long multi-step chains.
“Fast and fluent, never preachy — but the 'wow' moments belong to Claude and GPT-5. It's a solid backstage model, not a star.”
In daily use Maverick is competent and pleasant: fluent prose, sensible refusal rates, and sub-second first-token latency on Groq or Cerebras. Where it falls short is the top end — nuanced humor, emotional intelligence, and genuinely creative writing trail Claude and GPT-5 by a clear notch. For apps where the model sits behind a workflow, users never feel the gap. For a flagship chat product where the model IS the experience, the ceiling shows within an extended session. Long, context-heavy conversations also surface the comprehension degradation.
“Behemoth never shipped, the LMArena number came from a checkpoint you can't download, and 1M context is marketing once you test comprehension.”
Adversarially, Maverick has three claims to distrust. First, the splashy LMArena ranking used an unreleased "experimental chat" model; the actual Instruct weights rank materially lower — a textbook benchmark-presentation gap. Second, the 1M (and Scout's 10M) context is real for needle retrieval but Fiction.LiveBench ranks Llama 4 near the bottom on genuine long-context comprehension, so the headline number oversells usable capability. Third, the teacher model Behemoth (~2T params) was announced in April 2025 and still has not shipped amid reported capability concerns — the family's top end is vaporware. Real weaknesses: no reasoning mode, weak multi-file coding (Aider ~15.6), and a 2024 cutoff. The honest pitch is "good, cheap, open" — not "frontier."
Self-hosted or sovereign-cloud agent and RAG stacks where data must not leave the customer's infrastructure — Maverick is the strongest open option that runs on one node. Multilingual content pipelines across 100+ languages at high throughput. Document- and chart-heavy vision workflows where DocVQA/ChartQA-class accuracy matters. Cost-sensitive backends where a closed frontier API's per-token price or licensing is a non-starter and "good enough for 80% of production" is the bar.
There is no single Meta price; representative inference is ~$0.15–$0.59 input and ~$0.60–$2.19 output per 1M tokens depending on provider (DeepInfra cheapest, Together's output highest). Self-hosting trades per-token cost for GPU capex.
Yes. Download the FP8 weights from Hugging Face and serve on an 8xH100 node, or quantize to INT4 (~240GB VRAM) to fit smaller hardware.
Yes — image understanding is native (early fusion), not a bolted-on adapter, with strong DocVQA/ChartQA scores. It does not generate images.
For retrieval, largely yes; for reasoning across the full window, no — comprehension degrades well before the ceiling, so chunk and test on your workload.
The weights have no built-in moderation; add Llama Guard 4 / Prompt Guard 2. Compliance certifications come from your host or your own infra, not the model.
The Llama 4 Community License allows commercial use but requires a separate Meta license above 700M MAU and forbids training non-Llama models on its outputs.
Maverick for higher quality and 128 experts on a node; Scout for single-GPU deploy and the 10M context. Both share the same 17B-active speed profile.
Does not train on API inputs by default
Last verified 2026-05-27