by Mistral AI · Mistral Small family · best for best price-to-capability open multimodal model
Mistral Small 4 (release 26.03, shipped 16 March 2026) is the cost-disruption release of the 2026 lineup: a 119B-parameter Mixture-of-Experts with only 6.5B active per token (128 experts, 4 active), priced at $0.15/$0.60 per 1M tokens, under genuine Apache 2.0. Despite the "Small" label it scores 78.0% on MMLU-Pro and 71.2% on GPQA Diamond — within reach of Medium 3.1 and Large 3 — while running on a single GPU. It unifies chat, reasoning, and coding in one efficient model. The buyer's sentence: the best price-to-capability open multimodal model in 2026, with Apache 2.0 as a structural moat. - Provider: Mistral AI (Paris, France) - Release: 2026-03-16, status GA - Context: 256,000 tokens; max output 16,384 - Modalities: text + image in, text out (native multimodal) - Knowledge cutoff: ~January 2026 - Headline price: $0.15 input / $0.60 output per 1M tokens - Architecture: MoE, 119B total / 6.5B active, 128 experts (4 active), MLA attention
| Benchmark | Score | Source |
|---|---|---|
| MMLU-Pro | 78% | venturebeat.com 2026-03-16T00:00:00.000Z |
| GPQA Diamond | 71.2% | huggingface.co 2026-03-16T00:00:00.000Z |
| Artificial Analysis Index | 28 | artificialanalysis.ai 2026-05-28T00:00:00.000Z |
Six personas, six verdicts — the same panel that reviews every product on TopReviewed.
“It re-anchors the 'cheap and good' tier: single-GPU production, Apache 2.0, and one endpoint for chat and reasoning. My new EU default unless workload justifies more.”
Small 4 changes the cost curve. A 119B MoE with 6.5B active runs in production on a single A100 and absorbs meaningful traffic without flagship spend. Apache 2.0 means I fine-tune and self-host with zero license friction — a genuinely cleaner story than Medium 3.5's modified-MIT. The unified reasoning toggle collapses architecture: one endpoint covers chat and lightweight reasoning. For any EU SaaS shipping multilingual features, Small 4 is now my default; I only step up to Medium 3.5 or Large 3 when a workload specifically justifies it. The vision encoder is the one soft spot.
“Small 4 is Mistral's most disruptive product on a per-dollar basis — it makes 'good enough' frontier capability nearly free to self-host.”
Strategically this is the model that pressures the entire cost-optimised tier. By delivering MMLU-Pro 78 and GPQA 71 at $0.15/$0.60 and single-GPU self-host under Apache 2.0, Mistral undercuts both closed small models (Haiku, GPT-5 mini) and forces open competitors (Qwen, Llama) to answer on price-per-intelligence. The AA Index 28 exceeding the flagship's 23 is a clean talking point. The wedge is the combination — cheap + open + reasoning + vision + EU residency — which no single competitor matches on all axes. Market timing aligns with the 2026 push toward agentic, cost-controlled deployments.
“At $0.15/$0.60 with reasoning, vision, and 256K context, the value-per-dollar crushes anything closed at this capability — and self-host makes it nearly free at scale.”
This is the cost story of the lineup. $0.15/$0.60 with a ~50% batch discount delivers reasoning-capable multimodal at a price closed competitors can't match at the same capability. The deeper lever is Apache 2.0 self-host on a single GPU, which converts opex to capex for steady-state workloads with no license fee — unlike Medium 3.5. Annual forecasts become trivially predictable. For any high-volume chat or agent product chasing a unit-economics target, Small 4 is the obvious starting point and frequently the finish line.
“Same API as the rest of the family, a genuinely useful reasoning dial, real 256K context, and Apache 2.0 so I ship private fine-tunes without a second thought.”
Ergonomically Small 4 is identical to the rest of Mistral, so swapping with Medium 3.5 is a model-name change. The reasoning-effort dial is the standout: lean fast for simple tasks, crank it up for hard ones, no routing to a different model. 256K context is real. Apache 2.0 lets me ship private fine-tunes with confidence and the MLA attention keeps long-context serving affordable. Negatives: vision is "fine" not "great," and structured-output stability under heavy reasoning is still being shaken out. Best price-to-capability open-weight multimodal small model in the market.
“180 tps makes it feel snappier than Medium, and for everyday tasks I rarely notice it isn't a flagship.”
In Le Chat the small tier feels fast — ~180 tps is noticeably snappier than Medium. For summarising, drafting, simple coding, and multilingual help the quality is good enough that the gap to a flagship rarely shows. Vision works on screenshots and receipts well enough. With reasoning toggled on, responses take longer but become markedly more careful. Conversational warmth is below Claude or GPT-5 — efficient rather than friendly. Strong everyday utility at a price low enough that product teams can give it away in a free tier.
“The price and the Apache license are real wins — but 'matches GPT-OSS 120B on AIME' is a curated comparison, and the vision encoder is genuinely weak.”
Small 4 is the rare Mistral launch where the value claim mostly survives scrutiny: the price is real, the license is genuinely Apache 2.0, and AA Index 28 is independently measured. Where I push back is the framing. Mistral benchmarks against GPT-OSS 120B (a deliberately chosen open peer) rather than the strongest closed small models, and headlines AIME at high reasoning effort, which masks the latency and token-count cost of that mode. The vision encoder is small and the vision quality shows it. The honest claim is "best open price-to-capability multimodal," not "matches the frontier" — but for once the gap between marketing and reality is narrow.
- High-volume agent and tool-use loops where per-call cost dominates. - Edge / self-hosted deployment on single-GPU servers or strong workstations. - Multilingual chat in European languages at consumer-grade pricing. - Long-context document Q&A on a budget. - Workloads needing both chat and lightweight reasoning without paying for two SKUs.
Yes — genuine Apache 2.0 on the Hugging Face card, no revenue threshold (unlike Medium 3.5). Fine-tune, self-host, and redistribute freely.
Yes — sparse 6.5B-active means ~80GB at FP8 (one A100/H100), less at NVFP4, despite 119B total parameters.
Close on MMLU-Pro and reasoning at a fraction of the cost; Medium 3.1 has a more polished overall profile but no reasoning toggle and no self-host.
Not per token, but high reasoning effort produces far more output tokens and higher latency — budget for that on hard queries.
Acceptable for screenshots/receipts/simple charts; not Pixtral-class for complex documents.
EU by default on La Plateforme; or fully on your hardware via self-host.
Bedrock, Azure AI Foundry, and Vertex AI, plus self-host and La Plateforme.
Does not train on API inputs by default
Last verified 2026-05-27