Mistral Small 4

GALatest Small

by Mistral AI · Mistral Small family · best for best price-to-capability open multimodal model

Cost-OptimizedOpen-WeightsReasoningMultimodalLong-Context

8.5

AI Panel Score

Value 9.5/10

Mistral Small 4 (release 26.03, shipped 16 March 2026) is the cost-disruption release of the 2026 lineup: a 119B-parameter Mixture-of-Experts with only 6.5B active per token (128 experts, 4 active), priced at $0.15/$0.60 per 1M tokens, under genuine Apache 2.0. Despite the "Small" label it scores 78.0% on MMLU-Pro and 71.2% on GPQA Diamond — within reach of Medium 3.1 and Large 3 — while running on a single GPU. It unifies chat, reasoning, and coding in one efficient model. The buyer's sentence: the best price-to-capability open multimodal model in 2026, with Apache 2.0 as a structural moat.

Compare this model All Mistral Small versions

What's new

First Mistral Small to use Mixture-of-Experts: 119B total but only 6.5B active per token, versus the dense 24B of the Small 3.x line. Despite 5x the total parameters, only 6.5B are active (vs 24B for Small 3), so a workflow handling 100 req/s on Small 3 handles ~300 on Small 4 on the same hardware.
Unified hybrid: chat, reasoning, and coding in one model with configurable reasoning depth.
Context expanded from 128K (Small 3.2) to 256K.
Output throughput ~180 tps — well above the price-tier median.
Native multimodal vision input via the MoE's image path.

Benchmarks

Benchmark	Score	Source
MMLU-Pro	78%	venturebeat.com 2026-03-16T00:00:00.000Z
GPQA Diamond	71.2%	huggingface.co 2026-03-16T00:00:00.000Z
Artificial Analysis Index	28	artificialanalysis.ai 2026-05-28T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker8.5/10

“It re-anchors the 'cheap and good' tier: single-GPU production, Apache 2.0, and one endpoint for chat and reasoning. My new EU default unless workload justifies more.”

Small 4 changes the cost curve. A 119B MoE with 6.5B active runs in production on a single A100 and absorbs meaningful traffic without flagship spend. Apache 2.0 means I fine-tune and self-host with zero license friction — a genuinely cleaner story than Medium 3.5's modified-MIT. The unified reasoning toggle collapses architecture: one endpoint covers chat and lightweight reasoning. For any EU SaaS shipping multilingual features, Small 4 is now my default; I only step up to Medium 3.5 or Large 3 when a workload specifically justifies it. The vision encoder is the one soft spot.

Strategic Fit 9Vendor Risk 9Roadmap Confidence 8

Pros

single-GPU, clean Apache 2.0, reasoning toggle, low price

Cons

modest vision
young tooling

Right for: EU multilingual SaaS defaults

Avoid if: you need top vision or peak reasoning

Domain Strategist8.5/10

“Small 4 is Mistral's most disruptive product on a per-dollar basis — it makes 'good enough' frontier capability nearly free to self-host.”

Strategically this is the model that pressures the entire cost-optimised tier. By delivering MMLU-Pro 78 and GPQA 71 at $0.15/$0.60 and single-GPU self-host under Apache 2.0, Mistral undercuts both closed small models (Haiku, GPT-5 mini) and forces open competitors (Qwen, Llama) to answer on price-per-intelligence. The AA Index 28 exceeding the flagship's 23 is a clean talking point. The wedge is the combination — cheap + open + reasoning + vision + EU residency — which no single competitor matches on all axes. Market timing aligns with the 2026 push toward agentic, cost-controlled deployments.

Competitive Positioning 9Differentiation 9Market Timing 8

Pros

unmatched price-per-intelligence combo
clean license

Cons

moat is on cost, not peak quality

Right for: cost-led, open-weight strategies

Avoid if: your buyers pay for frontier ceiling

Finance Lead9.5/10

“At $0.15/$0.60 with reasoning, vision, and 256K context, the value-per-dollar crushes anything closed at this capability — and self-host makes it nearly free at scale.”

This is the cost story of the lineup. $0.15/$0.60 with a ~50% batch discount delivers reasoning-capable multimodal at a price closed competitors can't match at the same capability. The deeper lever is Apache 2.0 self-host on a single GPU, which converts opex to capex for steady-state workloads with no license fee — unlike Medium 3.5. Annual forecasts become trivially predictable. For any high-volume chat or agent product chasing a unit-economics target, Small 4 is the obvious starting point and frequently the finish line.

Cost Efficiency 10Pricing Transparency 9Value per Dollar 10

Pros

lowest cost for the capability
free-and-clear self-host

Cons

none material on cost

Right for: unit-economics-driven products

Avoid if: you need ceiling capability regardless of price

Domain Practitioner9/10

“Same API as the rest of the family, a genuinely useful reasoning dial, real 256K context, and Apache 2.0 so I ship private fine-tunes without a second thought.”

Ergonomically Small 4 is identical to the rest of Mistral, so swapping with Medium 3.5 is a model-name change. The reasoning-effort dial is the standout: lean fast for simple tasks, crank it up for hard ones, no routing to a different model. 256K context is real. Apache 2.0 lets me ship private fine-tunes with confidence and the MLA attention keeps long-context serving affordable. Negatives: vision is "fine" not "great," and structured-output stability under heavy reasoning is still being shaken out. Best price-to-capability open-weight multimodal small model in the market.

API Ergonomics 9Tool/Agent Support 8Reliability 8

Pros

reasoning dial, real long context, clean fine-tune license

Cons

soft vision
reasoning-mode output stability

Right for: cost-sensitive builders

Avoid if: vision is central to your product

Power User7.5/10

“180 tps makes it feel snappier than Medium, and for everyday tasks I rarely notice it isn't a flagship.”

In Le Chat the small tier feels fast — ~180 tps is noticeably snappier than Medium. For summarising, drafting, simple coding, and multilingual help the quality is good enough that the gap to a flagship rarely shows. Vision works on screenshots and receipts well enough. With reasoning toggled on, responses take longer but become markedly more careful. Conversational warmth is below Claude or GPT-5 — efficient rather than friendly. Strong everyday utility at a price low enough that product teams can give it away in a free tier.

Output Quality 7.5Speed 8.5Everyday Usefulness 8

Pros

fast, capable enough, strong EU languages

Cons

not the warmest
soft vision

Right for: free-tier and everyday product features

Avoid if: you want flagship polish

Skeptic7.5/10

“The price and the Apache license are real wins — but 'matches GPT-OSS 120B on AIME' is a curated comparison, and the vision encoder is genuinely weak.”

Small 4 is the rare Mistral launch where the value claim mostly survives scrutiny: the price is real, the license is genuinely Apache 2.0, and AA Index 28 is independently measured. Where I push back is the framing. Mistral benchmarks against GPT-OSS 120B (a deliberately chosen open peer) rather than the strongest closed small models, and headlines AIME at high reasoning effort, which masks the latency and token-count cost of that mode. The vision encoder is small and the vision quality shows it. The honest claim is "best open price-to-capability multimodal," not "matches the frontier" — but for once the gap between marketing and reality is narrow.

Claim Accuracy 8Weakness Severity 7Hype vs Reality 8

Pros

claims largely hold
clean license

Cons

curated peer comparison
weak vision

Right for: buyers who value honest price-per-intelligence

Avoid if: you need strong vision or take "matches frontier" literally

Strengths

5-7x cheaper than peer multimodal reasoning models, with frontier-adjacent MMLU-Pro/GPQA.
Genuine Apache 2.0 open weights — self-hostable on a single GPU, no revenue carve-out.
256K context at the small tier is rare.
Configurable reasoning effort — one SKU covers chat through hard math.
Native vision at this price is unusual.
~180 tps throughput; AA Index 28 beats the flagship Large 3's 23.

Limitations

Mistral published an unconventional benchmark mix (MMLU-Pro, GPQA, AIME) and skipped plain MMLU and a standardized LiveCodeBench number, complicating head-to-head.
High-reasoning mode dramatically increases latency and output token count.
Small vision encoder — vision is acceptable, not Pixtral-class.
English conversational polish trails Claude Haiku and GPT-5 mini.
New enough that community fine-tunes and production tooling are still catching up.

Best use cases

High-volume agent and tool-use loops where per-call cost dominates.
Edge / self-hosted deployment on single-GPU servers or strong workstations.
Multilingual chat in European languages at consumer-grade pricing.
Long-context document Q&A on a budget.
Workloads needing both chat and lightweight reasoning without paying for two SKUs.

Deep dive

The full research notes behind this review — verified against primary sources.

Architecture Capabilities Benchmark analysis Speed & latency Pricing analysis Deployment & access Safety & privacy Ecosystem & tooling

Architecture

Small 4 is a sparse Mixture-of-Experts: 119B total parameters, 128 experts of which 4 activate per token for ~6.5B active. The Hugging Face card specifies an MLA (multi-head latent attention) backend (FLASH_ATTN_MLA in the vLLM serving command), which is part of what keeps the active footprint and KV-cache cost low at 256K context. It ships in FP8 with NVFP4/BF16/GGUF variants. Tokenizer is mistral_common. The sparse design is the whole point: it delivers capability associated with a much larger model at the inference cost of a ~7B dense model, which is what unlocks the price and the single-GPU self-host story. Layer count, training-token count, and vocab size are undisclosed.

Capabilities

Small 4 punches far above its size class. On knowledge and reasoning it scores MMLU-Pro 78.0% and GPQA Diamond 71.2% — close to Medium 3.1 and Large 3 (cap_reasoning 7.5, cap_math 7.5). Its configurable reasoning effort lets one SKU cover fast chat through AIME-class math (it matches or surpasses GPT-OSS 120B on AIME 2025 at high reasoning). Coding is decent (cap_coding 7.0) and on LiveCodeBench it reportedly beats GPT-OSS 120B while emitting ~20% less output. Native vision is present but the encoder is small, so vision quality is acceptable rather than excellent (cap_vision 6.5). The 256K context is rare at this tier (cap_long_context 8.0). Multilingual European quality carries through (cap_multilingual 8.5). High reasoning effort sharply increases latency and output tokens. No native real-time retrieval (cap_realtime_data 0.0).

Benchmark analysis

Benchmark	Score	vs Predecessor	vs Top Competitor	Source
MMLU-Pro	78.0%	well above Small 3.2	near Medium 3.1 / Large 3	VentureBeat
GPQA Diamond	71.2%	new	strong for the tier	HF card
AIME 2025 (high reasoning)	matches/surpasses GPT-OSS 120B	new	leader at price tier	VentureBeat
Artificial Analysis Index	28	+~13 vs Small 3.2	well above price-tier median (~15)	Artificial Analysis

Notably, Small 4's AA Index of 28 exceeds Large 3's 23 — the small MoE with reasoning effort scores higher on the aggregate intelligence index than the flagship generalist. Mistral published GPQA Diamond and MMLU-Pro but left some standard slots (MMLU, standardized LiveCodeBench number) unpublished; coverage is partial.

Speed & latency

Output throughput is ~180 tokens/sec — well above the price-tier median and noticeably snappier than Medium or Large. The sparse 6.5B-active design is what enables it. Time-to-first-token is not separately published (null). Fast latency tier. With reasoning effort high, latency and output token count rise sharply, so the fast feel applies to the chat/low-effort path.

Pricing analysis

Surface	Cost	Notes
API input	$0.15 / 1M tok	La Plateforme
API output	$0.60 / 1M tok	La Plateforme
Batch (in/out)	$0.075 / $0.30	~50% async discount
Direct UI	free	Le Chat free tier
Free tier	~25 msg/day (Le Chat); La Plateforme quota	no card
Self-host	Apache 2.0	weights on Hugging Face, single-GPU viable
Cloud	Bedrock, Azure AI Foundry, Vertex AI	managed

Deployment & access

Apache 2.0 weights on Hugging Face (FP8/NVFP4/BF16/GGUF) — a clean, unrestricted license, unlike Medium 3.5. The sparse architecture means it self-hosts on a single GPU (~80GB at FP8, less at NVFP4; viable on one A100/H100 or strong consumer pairs). Managed on Bedrock, Azure AI Foundry, and Vertex AI; La Plateforme EU-hosted by default. For EU SaaS shipping multilingual features on commodity hardware, this is the default open option — frontier-adjacent capability at single-GPU cost with no license friction.

Safety & privacy

Standard Mistral posture: GDPR-native, SOC 2 Type II, ISO 27001/27701, EU AI Act aligned, EU residency by default, 30-day abuse retention, no training on inputs unless opt-in, ZDR available. No built-in moderation; separate Mistral Moderation API. Moderate, consistent refusals.

Ecosystem & tooling

SDKs in Python and TypeScript/JavaScript; integrations with LangChain, LlamaIndex, Vercel AI SDK, vLLM, and Ollama. Powers Le Chat and Mistral AI Studio. Apache 2.0 weights drive a fast-growing self-host and fine-tune community (FP8/NVFP4/BF16/GGUF derivatives). Popularity is growing quickly given the price-per-intelligence story.

Buyer questions

Is the license really clean?

Yes — genuine Apache 2.0 on the Hugging Face card, no revenue threshold (unlike Medium 3.5). Fine-tune, self-host, and redistribute freely.

Can it really run on one GPU?

Yes — sparse 6.5B-active means ~80GB at FP8 (one A100/H100), less at NVFP4, despite 119B total parameters.

How does it compare to Medium 3.1?

Close on MMLU-Pro and reasoning at a fraction of the cost; Medium 3.1 has a more polished overall profile but no reasoning toggle and no self-host.

Does the reasoning mode cost more?

Not per token, but high reasoning effort produces far more output tokens and higher latency — budget for that on hard queries.

Is the vision good?

Acceptable for screenshots/receipts/simple charts; not Pixtral-class for complex documents.

Where does my data live?

EU by default on La Plateforme; or fully on your hardware via self-host.

What clouds host it?

Bedrock, Azure AI Foundry, and Vertex AI, plus self-host and La Plateforme.

Comparable models

Claude Haiku 4.5:

Closed weights, similar everyday capability, ~2-3x the price, no self-host.

GPT-5 mini:

Closed weights, broader ecosystem, several times the price; no open-weight option.

Qwen 3 30B-A3B:

Comparable MoE small tier and price; weaker EU-language quality.

Llama 4 Scout:

Open weights but dense; weaker multilingual and a less clean small-MoE efficiency story.

Sources

Primary references used to verify this review.

Model specs

Input price: $0.15 / Mtok
Output price: $0.60 / Mtok
Cached input: —
Batch (in/out): $0.07 / $0.30
Context window: 256K tokens
Max output: 16K tokens
Knowledge cutoff: 2026-01
Released: 2026-03-15
Modalities: text, image → text
Output speed: ~180 tok/s
License: Open weights (Apache-2.0)
Clouds: Bedrock, Azure AI Foundry, Vertex AI

Does not train on API inputs by default

Last verified 2026-05-27