What does Scout cost?

No single Meta price; representative inference is ~$0.08–$0.11 input and ~$0.30–$0.34 output per 1M tokens (DeepInfra/Groq cheapest). Self-hosting on one rented H100 runs $2–3/hour.

Can it really run on one GPU?

Yes — INT4 fits the 109B params in ~55–60GB, so a single H100 80GB serves it; aggressive GGUF quants run on 24–48GB cards for light use.

Is the 10M context usable?

For retrieval, largely; for reasoning across the full window, no — comprehension degrades well before 10M. Chunk and test on your data.

Yes, natively (early fusion) with strong DocVQA/ChartQA; it does not generate images.

How does it compare to Maverick?

Scout is smaller, cheaper, single-GPU, and has a bigger context; Maverick is smarter with 128 experts but needs a full node. Both run at ~17B-active speed.

What about safety/compliance?

No built-in moderation; add Llama Guard 4 / Prompt Guard 2. Certifications come from your host/infra, not the model.

Commercial use allowed; separate Meta license required above 700M MAU; cannot train non-Llama models on its outputs.

Llama 4 Scout Review — Benchmarks, Pricing & AI Panel Verdict

Benchmark	Score	Source
MMLU	79.6%	Meta / llm-stats aggregator2025-04-05T00:00:00.000Z
MMMU	69.4%	Meta Llama 4 model card2025-04-05T00:00:00.000Z
MATH-500	50.3%	Meta (MATH-Hard)2025-04-05T00:00:00.000Z
MMLU-Pro	74.3%	Meta Llama 4 model card2025-04-05T00:00:00.000Z
HumanEval	82%	llm-stats aggregator (approx)2025-04-05T00:00:00.000Z
GPQA Diamond	57.2%	Meta Llama 4 model card2025-04-05T00:00:00.000Z
LiveCodeBench	32.8%	community aggregator2025-04-10T00:00:00.000Z
Artificial Analysis Index	14	Artificial Analysis2026-05

Architecture

Scout is a sparse MoE transformer: 109B total parameters across 16 routed experts plus a shared expert, ~17B active per token. It uses the iRoPE attention scheme — chunked RoPE local attention (8K chunks) in three of four layers and a full-context NoPE (no positional embedding) layer every fourth, which is what makes the 10M window mechanically possible. Scout adds QK-normalization (RMS norm of query/key states, no learnable params) and temperature-scaled softmax in NoPE layers to preserve attention over very long sequences. Vision is early-fused. Meta discloses total/active params, expert count, the iRoPE design, and a training-token scale of up to ~40T; exact layer counts, compute, and full data recipe are not published. The released checkpoint is Instruct-tuned and is the only Llama 4 variant designed to fit a single server-grade GPU.

Capabilities

Scout is a competent generalist with one standout trick — context length. Text reasoning (cap_reasoning 5.5) is roughly on par with Llama 3.1 70B: MMLU 79.6, MMLU-Pro 74.3, GPQA Diamond 57.2 — solid for its tier but not frontier. Math (5.0) and coding (5.5) are mid-pack open-weights; HumanEval ~82, LiveCodeBench ~33. Multilingual (7.5) is a genuine strength across 200 languages. Vision and document OCR (6.5 / 7.0) are strong for the size, with ChartQA 85.3, DocVQA 91.6, and MMMU 69.4. The long-context score (4.5) is the honest tension at the heart of this model: needle-in-haystack retrieval over the 10M window looks near-perfect, but independent comprehension benchmarks (Fiction.LiveBench, RULER-style) show meaningful quality degradation long before the ceiling — the 10M number is a retrieval capacity, not a reasoning-over-10M guarantee. Function-calling and instruction-following (6.0 each) are reliable for single steps, weaker on long agent chains. No reasoning mode, no real-time data (0.0).

Benchmark analysis

Benchmark	Score	vs Predecessor (3.3 70B)	vs Top Competitor	Source
MMLU	79.6	-6.4 (3.3 70B 86.0)	trails Maverick (85.5)	llm-stats
MMLU-Pro	74.3	+5.4 (3.3 70B 68.9)	trails Maverick (80.5)	Meta
GPQA Diamond	57.2	+6.7 (3.3 70B 50.5)	competitive at tier	Meta
MATH (Hard)	50.3	comparable	trails Maverick (61.2)	Meta
HumanEval	~82	~ 3.3 70B (88.4)	competitive	llm-stats
MMMU (vision)	69.4	new (3.3 70B no vision)	strong for size	Meta
ChartQA	85.3	new	competitive	Meta
DocVQA	91.6	new	strong	Meta
Artificial Analysis Index	14	= (3.3 70B 14)	above open non-reasoning median (13)	AA

Speed & latency

Median output speed is ~106 tokens/sec across providers, with time-to-first-token around 0.56s on DeepInfra and 0.72s on Google Vertex. Because only ~17B params are active, it generates at small-model speed despite the 109B pool. Groq is the throughput leader at ~449 tokens/sec, giving a snappy sub-second interactive feel. Latency tier is fast; the practical caveat is that filling the 10M context dramatically raises prefill latency and cost, so the long-context superpower is best used selectively.

Pricing analysis

Surface	Cost	Notes
API input (representative)	~$0.08–$0.11 / 1M tok	DeepInfra / Groq floor
API output (representative)	~$0.30–$0.34 / 1M tok
DeepInfra	$0.08 in / $0.30 out	cheapest mainstream
Groq	$0.11 in / $0.34 out	fastest (449 tps)
Fireworks	$0.15 in / $0.60 out
Together	~$0.18 in / ~$0.59 out
Amazon Bedrock	~$0.22 blended	on-demand
Google Vertex AI	available	0.72s TTFT
Self-hosted	1x H100 80GB (INT4)	~55–60GB VRAM to hold 109B at INT4
Rate limits	provider-specific	often 1000+ RPM on managed tiers

Open weights mean no single Meta price; the figures above are the May 2026 inference market. Scout is the cheapest serious open-weights multimodal model on most providers.

Deployment & access

Open weights under the Llama 4 Community License. Download from Hugging Face (meta-llama/Llama-4-Scout-17B-16E-Instruct). The headline deployment property: it fits a single server-grade GPU. At INT4 the full 109B parameter set needs roughly 55–60GB, so one H100 80GB serves it comfortably; aggressive GGUF quants (Q4) run on 24–48GB consumer cards for hobbyist/edge use, though storing all 109B params is what sets the floor. Managed availability spans AWS Bedrock, Google Vertex AI, Azure AI Foundry, OCI, and IBM watsonx. Inference providers include Together, Fireworks, Groq, DeepInfra, OpenRouter, and Novita. Self-host economics are the strongest selling point — a single rented H100 at $2–3/hour serves millions of tokens/day. The Llama 4 Community License permits commercial use but requires a separate Meta license above 700M MAU and forbids training non-Llama models on outputs.

Safety & privacy

Identical posture to Maverick: the weights carry no built-in moderation, and Meta offers Llama Guard 4 (12B multimodal) plus Prompt Guard 2 (22M/86M) as optional pre/post filters for the Llama 4 line. "Trains on inputs" is not applicable when self-hosted; managed-provider terms vary; Meta's own terms do not train on your data. No model-level compliance certifications — those attach to your host or infrastructure. Refusal calibration is moderate and tunable, which is the intended advantage for regulated and sovereign deployments. Governance under Meta's Frontier AI Framework.

Ecosystem & tooling

Native support across Hugging Face Transformers, vLLM, llama.cpp, Ollama, SGLang, MLX, plus LangChain and LlamaIndex. Available on Bedrock, Vertex AI, Azure AI Foundry, OCI, and IBM watsonx, and on Together, Fireworks, Groq, DeepInfra, OpenRouter, and Novita. Used inside Meta's consumer AI surfaces. Popularity is mainstream — the go-to open-weights pick when single-GPU deploy or very large context is the requirement.

Llama 4 Scout

What's new

Benchmarks

AI Panel Review

Strengths

Limitations

Best use cases

Deep dive

Architecture

Capabilities

Benchmark analysis

Speed & latency

Pricing analysis

Deployment & access

Safety & privacy

Ecosystem & tooling

Buyer questions

Comparable models

Sources

Model specs

Other Llama 4 versions