Llama 4 Scout

GALatest Scout

by Meta · Llama 4 family · best for single-GPU long-context open-weights deploy

Open-WeightsMultimodalCost-OptimizedLong-ContextEdge / On-Device
7.5
AI Panel Score
Value 9.5/10

Llama 4 Scout is the small, deployable member of Meta's Llama 4 herd, released April 5, 2025. It is a 109B-total / 17B-active Mixture-of-Experts model (16 experts), natively multimodal, with a headline 10,000,000-token context window — the largest of any openly available model at release. The one-sentence buyer takeaway: it is the only model in 2026 that combines a 10M context, native vision, and single-GPU deployability, making it the obvious open-weights pick when context size and on-prem economics matter more than peak intelligence. - Provider: Meta - Release: 2025-04-05 (GA, open weights) - Status: GA, latest in its tier (no successor shipped as of May 2026) - Context: 10,000,000 tokens (256K native pre-training, extended via iRoPE) - Max output: 8,192 tokens (provider-dependent) - Modalities: text + image in, text out - Knowledge cutoff: August 2024 - Headline price: ~$0.08–$0.11 in / ~$0.30–$0.34 out per 1M tokens

What's new

  • 10M-token context window — the largest of any openly available model at release.
  • Fits on a single H100-class GPU with INT4 quantization, making it the most deployable Llama 4 variant.
  • First Llama small/mid-tier to ship as MoE (16 experts) rather than dense, sharing the same 17B-active speed profile as Maverick.
  • Natively multimodal via the same early-fusion vision tower architecture as Maverick.

Benchmarks

BenchmarkScoreSource
MMLU79.6%Meta / llm-stats aggregator2025-04-05T00:00:00.000Z
MMMU69.4%Meta Llama 4 model card2025-04-05T00:00:00.000Z
MATH-50050.3%Meta (MATH-Hard)2025-04-05T00:00:00.000Z
MMLU-Pro74.3%Meta Llama 4 model card2025-04-05T00:00:00.000Z
HumanEval82%llm-stats aggregator (approx)2025-04-05T00:00:00.000Z
GPQA Diamond57.2%Meta Llama 4 model card2025-04-05T00:00:00.000Z
LiveCodeBench32.8%community aggregator2025-04-10T00:00:00.000Z
Artificial Analysis Index14Artificial Analysis2026-05

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker8.5/10
The easiest 'go open weights' call I can make: one H100, 10M context, every cloud carries it. Just test the long-context cliff before you bet on it.

Scout is the lowest-friction open-weights adoption in 2026. The deployment story is genuinely simple — pull from Hugging Face, quantize to INT4, run on a single GPU — and every major cloud and inference provider carries it, so vendor risk is minimal. For organizations that need on-prem data control without frontier capability, it is close to ideal. The strategic optionality is high and the risk surface small. The one decision-maker-level caveat is the long-context quality cliff: the 10M number is real for retrieval but not for deep reasoning, so do not architect around 10M of usable comprehension without testing your workload.

Strategic Fit 9Vendor Risk 9Roadmap Confidence 6
Pros
  • trivially deployable
  • multi-cloud
  • sovereign
  • 10M context
Cons
  • long-context cliff
  • uncertain Llama roadmap
Right for: on-prem/sovereign teams needing big context cheaply
Avoid if: you need frontier reasoning or guaranteed comprehension at extreme context
Domain Strategist7.5/10
Scout owns 'biggest context that fits on one GPU.' That's a defensible, specific square — even if rivals are closing in on quality.

Positioning, Scout's wedge is the unique intersection of 10M context, native vision, and single-GPU deploy — no other open model in 2026 offers all three. Against closed long-context models (Gemini Flash) it is the open-weights answer; against other open small models (Qwen 3 30B-A3B, Mistral Small 3) it wins on context and multimodality, loses on some reasoning. Market timing rides the same sovereignty/cost tailwinds as Maverick. The durability risk is real — competitors are catching up on context, and the comprehension cliff undercuts the headline — but the deployability story keeps it relevant.

Competitive Positioning 8Differentiation 8Market Timing 7
Pros
  • unique context+vision+single-GPU combo
Cons
  • comprehension cliff dents the headline
  • rivals closing
Right for: teams who genuinely need huge context cheaply
Avoid if: you only need 128K — cheaper dense models suffice
Finance Lead9/10
Cheapest serious open-weights model on the market, and the 10M context can delete an entire RAG pipeline's cost. The math is unambiguous at scale.

Scout is the strongest pure unit-economics story in the Meta lineup. DeepInfra runs it at $0.08/$0.30; self-hosted on a single rented H100 at $2–3/hour, it beats any closed API by 20–100x at volume. The hidden lever is the 10M context: workloads that previously required a vector DB, embedding compute, and a re-ranker can sometimes collapse into a single Scout call, removing whole line items from the bill — though the prefill cost of a truly huge context must be modeled (it is not free). Above ~100M tokens/month, self-hosted Scout dominates on $/Mtok.

Cost Efficiency 10Pricing Transparency 8Value per Dollar 10
Pros
  • cheapest serious open multimodal
  • single-GPU
  • context can replace RAG infra
Cons
  • huge-context prefill cost is real
  • needs utilization to amortize self-host
Right for: high-volume, long-context, cost-sensitive workloads
Avoid if: low volume where managed simplicity wins
Domain Practitioner7.5/10
The smallest Llama 4 with the biggest party trick. Fine-tune on one box, serve on one GPU — just keep evals for the long-context cliff.

Builders love Scout because it is the most accessible Llama 4: fine-tuning fits on a single 8xA100 box, inference on one H100, and the 10M context lets you skip RAG indexing for small-to-mid corpora and just dump everything in. Native support across Transformers, vLLM, llama.cpp, Ollama, SGLang, and MLX makes local iteration fast. The catches are familiar: provider chat-template inconsistencies, the long-context quality cliff (write evals, do not trust the 10M number blindly), and tool-use that trails Maverick on multi-step loops. Genuinely fun and forgiving to build with.

API Ergonomics 8Tool/Agent Support 7Reliability 8
Pros
  • single-GPU serve
  • single-box fine-tune
  • huge context simplifies RAG
Cons
  • long-context cliff
  • provider template drift
  • weaker multi-step tools
Right for: builders who own their stack and want big context cheaply
Avoid if: you need rock-solid agent chains out of the box
Power User6.5/10
It quietly disappears into the background — fast, polite, multilingual. Not the model you pick when the chatbot itself is the show.

For consumer-facing chat Scout is the model that gets out of the way: good latency (sub-second on Groq), sensible refusal rates, clean casual conversation. It is not a flagship-personality model — Claude, GPT-5, or even Maverick feel smarter in extended use — but for embedded SaaS chat, support assistants, and in-app coaches where the model is one feature among many, users get a reliable, polite, multilingual helper. As with all Llama 4, long context-heavy sessions surface the comprehension degradation.

Output Quality 6Speed 8Everyday Usefulness 7
Pros
  • fast
  • polite
  • multilingual
  • reliable backstage
Cons
  • no personality wow
  • long-context comprehension fades
Right for: embedded assistants
Avoid if: the chatbot is the product
Skeptic5.5/10
Ten million tokens is the number on the box. Fiction.LiveBench puts Llama 4 near the bottom on actually understanding long text — buy the GPU story, not the 10M.

Adversarially, Scout's headline is its biggest overclaim. The 10M context is real for needle retrieval and demos, but independent comprehension evals (Fiction.LiveBench, RULER-style) rank Llama 4 near the bottom — so "10M context" oversells usable reasoning by a wide margin. Same family caveats apply: no reasoning mode, mid-pack coding and math, a 2024 cutoff, and a teacher model (Behemoth) that never shipped. The defensible, honest pitch is "the cheapest open multimodal model that fits on one GPU" — a genuinely good thing. The "10M context champion" framing should be treated as marketing until you have tested comprehension on your own data.

Claim Accuracy 5Weakness Severity 6Hype vs Reality 5
Pros
  • single-GPU and cheap is genuinely true
  • retrieval works
Cons
  • 10M comprehension oversold
  • mid-pack reasoning
  • Behemoth unshipped
Right for: skeptics who value it as a cheap single-GPU model
Avoid if: you believe the 10M headline at face value

Strengths

  • 10M-token context window — unmatched in open weights at release.
  • Single-GPU deployable (one H100 at INT4) — runs on a ~$30K box or a rented GPU.
  • Native vision; strong DocVQA (91.6) and ChartQA (85.3) for the size.
  • Cheapest serious open-weights multimodal model on most providers (~$0.08 in).
  • Day-zero availability on Groq, Together, Fireworks, DeepInfra, Bedrock, Vertex.

Limitations

  • Long-context comprehension degrades well before 10M tokens (Fiction.LiveBench / RULER-style evals); the headline is retrieval capacity, not usable reasoning depth.
  • Trails Maverick by ~6 points on MMLU-Pro and ~12 on GPQA Diamond — not a frontier reasoner.
  • No native reasoning mode; loses to DeepSeek R1, o-series, Claude extended thinking on hard math.
  • 8K output ceiling on most managed providers; filling the 10M context spikes prefill cost and latency.

Best use cases

Long-document RAG over legal contracts, technical specs, and multi-PDF research where dumping the corpus into one prompt beats building a retrieval pipeline. Whole-codebase analysis where the full repo fits in context. Single-GPU self-hosted chatbots and assistants where Maverick's 8xH100 node is overkill. Edge and sovereign deployments — a quantized Scout runs on one modern workstation GPU, keeping data fully on-prem.

Buyer questions

What does Scout cost?

No single Meta price; representative inference is ~$0.08–$0.11 input and ~$0.30–$0.34 output per 1M tokens (DeepInfra/Groq cheapest). Self-hosting on one rented H100 runs $2–3/hour.

Can it really run on one GPU?

Yes — INT4 fits the 109B params in ~55–60GB, so a single H100 80GB serves it; aggressive GGUF quants run on 24–48GB cards for light use.

Is the 10M context usable?

For retrieval, largely; for reasoning across the full window, no — comprehension degrades well before 10M. Chunk and test on your data.

Does it do vision?

Yes, natively (early fusion) with strong DocVQA/ChartQA; it does not generate images.

How does it compare to Maverick?

Scout is smaller, cheaper, single-GPU, and has a bigger context; Maverick is smarter with 128 experts but needs a full node. Both run at ~17B-active speed.

What about safety/compliance?

No built-in moderation; add Llama Guard 4 / Prompt Guard 2. Certifications come from your host/infra, not the model.

Any license limits?

Commercial use allowed; separate Meta license required above 700M MAU; cannot train non-Llama models on its outputs.

Comparable models

Llama 4 Maverick — same family and 17B-active speed; smarter (more experts) but needs an 8xH100 node and offers a smaller 1M context. Scout wins on single-GPU deploy and context length.
Llama 3.3 70B — dense, text-only, 128K context; similar text quality but no vision and a far smaller window. Scout is usually the upgrade for new builds.
Qwen 3 30B-A3B / Mistral Small 3 — open small models competitive on text at a similar price; Scout wins on context length and native vision, may lose on specific reasoning benchmarks.

Model specs

Input price
$0.10 / Mtok
Output price
$0.34 / Mtok
Cached input
Batch (in/out)
Context window
10M tokens
Max output
8K tokens
Knowledge cutoff
2024-08
Released
2025-04-04
Modalities
text, image → text
Output speed
~106.1 tok/s
License
Open weights (Llama-4-Community)
Clouds
Bedrock, Vertex AI, Azure AI Foundry, GCP, OCI, IBM watsonx

Does not train on API inputs by default

Last verified 2026-05-27