by Meta · Llama 4 family · best for single-GPU long-context open-weights deploy
Llama 4 Scout is the small, deployable member of Meta's Llama 4 herd, released April 5, 2025. It is a 109B-total / 17B-active Mixture-of-Experts model (16 experts), natively multimodal, with a headline 10,000,000-token context window — the largest of any openly available model at release. The one-sentence buyer takeaway: it is the only model in 2026 that combines a 10M context, native vision, and single-GPU deployability, making it the obvious open-weights pick when context size and on-prem economics matter more than peak intelligence. - Provider: Meta - Release: 2025-04-05 (GA, open weights) - Status: GA, latest in its tier (no successor shipped as of May 2026) - Context: 10,000,000 tokens (256K native pre-training, extended via iRoPE) - Max output: 8,192 tokens (provider-dependent) - Modalities: text + image in, text out - Knowledge cutoff: August 2024 - Headline price: ~$0.08–$0.11 in / ~$0.30–$0.34 out per 1M tokens
| Benchmark | Score | Source |
|---|---|---|
| MMLU | 79.6% | Meta / llm-stats aggregator2025-04-05T00:00:00.000Z |
| MMMU | 69.4% | Meta Llama 4 model card2025-04-05T00:00:00.000Z |
| MATH-500 | 50.3% | Meta (MATH-Hard)2025-04-05T00:00:00.000Z |
| MMLU-Pro | 74.3% | Meta Llama 4 model card2025-04-05T00:00:00.000Z |
| HumanEval | 82% | llm-stats aggregator (approx)2025-04-05T00:00:00.000Z |
| GPQA Diamond | 57.2% | Meta Llama 4 model card2025-04-05T00:00:00.000Z |
| LiveCodeBench | 32.8% | community aggregator2025-04-10T00:00:00.000Z |
| Artificial Analysis Index | 14 | Artificial Analysis2026-05 |
Six personas, six verdicts — the same panel that reviews every product on TopReviewed.
“The easiest 'go open weights' call I can make: one H100, 10M context, every cloud carries it. Just test the long-context cliff before you bet on it.”
Scout is the lowest-friction open-weights adoption in 2026. The deployment story is genuinely simple — pull from Hugging Face, quantize to INT4, run on a single GPU — and every major cloud and inference provider carries it, so vendor risk is minimal. For organizations that need on-prem data control without frontier capability, it is close to ideal. The strategic optionality is high and the risk surface small. The one decision-maker-level caveat is the long-context quality cliff: the 10M number is real for retrieval but not for deep reasoning, so do not architect around 10M of usable comprehension without testing your workload.
“Scout owns 'biggest context that fits on one GPU.' That's a defensible, specific square — even if rivals are closing in on quality.”
Positioning, Scout's wedge is the unique intersection of 10M context, native vision, and single-GPU deploy — no other open model in 2026 offers all three. Against closed long-context models (Gemini Flash) it is the open-weights answer; against other open small models (Qwen 3 30B-A3B, Mistral Small 3) it wins on context and multimodality, loses on some reasoning. Market timing rides the same sovereignty/cost tailwinds as Maverick. The durability risk is real — competitors are catching up on context, and the comprehension cliff undercuts the headline — but the deployability story keeps it relevant.
“Cheapest serious open-weights model on the market, and the 10M context can delete an entire RAG pipeline's cost. The math is unambiguous at scale.”
Scout is the strongest pure unit-economics story in the Meta lineup. DeepInfra runs it at $0.08/$0.30; self-hosted on a single rented H100 at $2–3/hour, it beats any closed API by 20–100x at volume. The hidden lever is the 10M context: workloads that previously required a vector DB, embedding compute, and a re-ranker can sometimes collapse into a single Scout call, removing whole line items from the bill — though the prefill cost of a truly huge context must be modeled (it is not free). Above ~100M tokens/month, self-hosted Scout dominates on $/Mtok.
“The smallest Llama 4 with the biggest party trick. Fine-tune on one box, serve on one GPU — just keep evals for the long-context cliff.”
Builders love Scout because it is the most accessible Llama 4: fine-tuning fits on a single 8xA100 box, inference on one H100, and the 10M context lets you skip RAG indexing for small-to-mid corpora and just dump everything in. Native support across Transformers, vLLM, llama.cpp, Ollama, SGLang, and MLX makes local iteration fast. The catches are familiar: provider chat-template inconsistencies, the long-context quality cliff (write evals, do not trust the 10M number blindly), and tool-use that trails Maverick on multi-step loops. Genuinely fun and forgiving to build with.
“It quietly disappears into the background — fast, polite, multilingual. Not the model you pick when the chatbot itself is the show.”
For consumer-facing chat Scout is the model that gets out of the way: good latency (sub-second on Groq), sensible refusal rates, clean casual conversation. It is not a flagship-personality model — Claude, GPT-5, or even Maverick feel smarter in extended use — but for embedded SaaS chat, support assistants, and in-app coaches where the model is one feature among many, users get a reliable, polite, multilingual helper. As with all Llama 4, long context-heavy sessions surface the comprehension degradation.
“Ten million tokens is the number on the box. Fiction.LiveBench puts Llama 4 near the bottom on actually understanding long text — buy the GPU story, not the 10M.”
Adversarially, Scout's headline is its biggest overclaim. The 10M context is real for needle retrieval and demos, but independent comprehension evals (Fiction.LiveBench, RULER-style) rank Llama 4 near the bottom — so "10M context" oversells usable reasoning by a wide margin. Same family caveats apply: no reasoning mode, mid-pack coding and math, a 2024 cutoff, and a teacher model (Behemoth) that never shipped. The defensible, honest pitch is "the cheapest open multimodal model that fits on one GPU" — a genuinely good thing. The "10M context champion" framing should be treated as marketing until you have tested comprehension on your own data.
Long-document RAG over legal contracts, technical specs, and multi-PDF research where dumping the corpus into one prompt beats building a retrieval pipeline. Whole-codebase analysis where the full repo fits in context. Single-GPU self-hosted chatbots and assistants where Maverick's 8xH100 node is overkill. Edge and sovereign deployments — a quantized Scout runs on one modern workstation GPU, keeping data fully on-prem.
No single Meta price; representative inference is ~$0.08–$0.11 input and ~$0.30–$0.34 output per 1M tokens (DeepInfra/Groq cheapest). Self-hosting on one rented H100 runs $2–3/hour.
Yes — INT4 fits the 109B params in ~55–60GB, so a single H100 80GB serves it; aggressive GGUF quants run on 24–48GB cards for light use.
For retrieval, largely; for reasoning across the full window, no — comprehension degrades well before 10M. Chunk and test on your data.
Yes, natively (early fusion) with strong DocVQA/ChartQA; it does not generate images.
Scout is smaller, cheaper, single-GPU, and has a bigger context; Maverick is smarter with 128 experts but needs a full node. Both run at ~17B-active speed.
No built-in moderation; add Llama Guard 4 / Prompt Guard 2. Certifications come from your host/infra, not the model.
Commercial use allowed; separate Meta license required above 700M MAU; cannot train non-Llama models on its outputs.
Does not train on API inputs by default
Last verified 2026-05-27