Mistral Medium 3.1 Review — Benchmarks, Pricing & AI Panel Verdict

What's new

Incremental update over Medium 3 (May 2025): better instruction following, sharper tool use, improved vision.

Retains Medium 3's pricing — one of the cheapest frontier-adjacent multimodal models available.

Remains Premier tier (proprietary, closed weights); open weights only arrived in this family with Medium 3.5 in April 2026.

No configurable reasoning effort — for extended thinking, callers escalate to Magistral or Medium 3.5.

Benchmark	Score	Source
HumanEval	92%	artificialanalysis.ai 2026-05-28T00:00:00.000Z
GPQA Diamond	57%	artificialanalysis.ai 2026-05-28T00:00:00.000Z
Artificial Analysis Index	21	artificialanalysis.ai 2026-05-28T00:00:00.000Z

Benchmark

Score

Source

HumanEval

92%

artificialanalysis.ai 2026-05-28T00:00:00.000Z

GPQA Diamond

57%

artificialanalysis.ai 2026-05-28T00:00:00.000Z

Artificial Analysis Index

artificialanalysis.ai 2026-05-28T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker7.5/10

“The model I've actually been deploying since late 2025 — cheap, multilingual, multimodal, stable — even if Medium 3.5 now outclasses it for agents.”

Medium 3.1 has been the dependable production workhorse: inexpensive, multilingual, multimodal, and operationally reliable on La Plateforme. The strategic weakness is closed weights — no on-prem, so the sovereignty story is limited to EU-hosted API. With Medium 3.5 now offering open weights and a reasoning knob, I would only choose 3.1 when budget is the binding constraint and the workload is chat/Q&A rather than agentic coding. For mid-sized European SaaS where EU residency matters and budget is real, it remains a sound default for the chat surface.

Strategic Fit 7Vendor Risk 7Roadmap Confidence 8

Pros

cheap, stable, multimodal, multilingual

Cons

closed weights
superseded for agents

Right for: cost-bound chat/Q&A

Avoid if: you need self-host or agentic coding

Domain Strategist7/10

“It owns the 'cheap capable multimodal' slot in Europe, but its successor is already eating its strategic mindshare.”

Medium 3.1's positioning was sharp at launch: frontier-adjacent multimodal capability at a price that unlocked high-volume European-language use cases US flagships couldn't touch economically. That slot is still valuable, but the strategic narrative has moved to Medium 3.5 (open weights, merged architecture). 3.1 now reads as the mature, boring, cost-tier option rather than the headline. Against GPT-5 mini and Claude Haiku it competes on EU-language quality and price; against its own successor it competes only on cost. A solid incumbent in a slot it no longer leads.

Competitive Positioning 7Differentiation 6Market Timing 6

Pros

proven cost-tier slot, EU-language edge

Cons

outshone by its own successor

Right for: cost-led European deployments

Avoid if: you want the current flagship narrative

Finance Lead9/10

“$0.40 in, $2.00 out — at high volume the monthly bill is a fraction of GPT-5 or Sonnet, and forecasting is boring in the best way.”

Pricing is the entire story. $0.40/$2.00 makes Medium 3.1 a standout value among multimodal models with frontier-adjacent quality. For high-volume support tickets, summarisation, classification, and content variants, the monthly bill is a fraction of US flagships. The $0.04 cached-input rate and ~50% batch discount sharpen it further, and predictable throughput makes forecasts simple. There is no self-host lever (closed weights), so this is a pure API-economics play — but at this price, for the right workload, the unit economics are excellent.

Cost Efficiency 9Pricing Transparency 9Value per Dollar 9

Pros

very low price, cache/batch discounts, predictable

Cons

no self-host capex option

Right for: high-volume API chat

Avoid if: you wanted to amortise GPUs

Domain Practitioner7.5/10

“The most boring-in-a-good-way Mistral model I've shipped — clean API, predictable JSON, reliable vision parsing.”

After extended production use, Medium 3.1 is the dependable one: clean OpenAI-compatible API, predictable JSON tool use, reliable vision-input document parsing. No reasoning toggle, so for hard problems I route to Magistral or 3.5. Lower output throughput is occasionally annoying for streaming UX but fine for backend pipelines. It is the "good enough at minimum cost" default for builders who don't need agentic firepower. Docs are thinner than Anthropic's but adequate.

API Ergonomics 8Tool/Agent Support 7Reliability 8

Pros

stable, predictable, reliable vision

Cons

no reasoning toggle
slow streaming

Right for: backend chat/extraction pipelines

Avoid if: you need agentic tool depth

Power User7/10

“Responsive and helpful for everyday questions; for routine help I can't tell it isn't a flagship — except the stream is a touch slower.”

In Le Chat at the Medium tier, Medium 3.1 is responsive and helpful for everyday questions, with excellent European-language quality. It feels slightly less polished than Large 3 or top US models on nuanced creative tasks, but for routine help the difference is hard to notice. Throughput is the main felt downside — replies stream a touch slower than competitors. Refusal rate is moderate and reasonable. A capable, unremarkable daily driver.

Output Quality 7Speed 6.5Everyday Usefulness 7.5

Pros

helpful, strong EU languages

Cons

slower streaming
not the most polished

Right for: routine everyday help

Avoid if: you want flagship polish or speed

Skeptic7/10

“Cheap and competent, but it's a closed model with thin published benchmarks and a successor that already beats it — buy it for price, nothing else.”

There is little to over-claim here, which makes the skeptic's job easy. Medium 3.1 is honestly positioned as a cost-tier model and delivers on that. The caveats: it is closed-weight, so the "EU sovereignty" halo around Mistral doesn't fully apply (no on-prem); Mistral published almost no numeric benchmarks, so the quality picture leans on third-party aggregation; and its own successor outperforms it on agentic/coding. None of this is deceptive — just a reminder that the value is price-per-token and not capability leadership. Use it where the bill dominates the decision.

Claim Accuracy 8Weakness Severity 6Hype vs Reality 7

Pros

honestly cheap, no inflated claims

Cons

closed
thin benchmarks
superseded

Right for: cost-driven workloads

Avoid if: you need sovereignty-by-on-prem or top capability

Deep dive

The full research notes behind this review — verified against primary sources.

Architecture

Medium 3.1 is a dense transformer, but as a Premier/closed-weight model Mistral does not disclose parameter count, layer count, attention mechanism, or training scale — an honest contrast to the open Mistral models, where these are published. Context is 131K. Tokenizer is mistral_common. Architecture detail is deliberately undisclosed; we record what is verifiable (dense, 131K, multimodal) and null everything Mistral withholds rather than estimate.

Capabilities

Medium 3.1 is a generalist cost-tier model: strong general chat, solid coding (cap_coding 7.0), native vision input, and reliable JSON tool use (cap_function_calling 7.5). The 131K context covers most enterprise documents without paying for 256K (cap_long_context 7.0). Multilingual quality is excellent across European languages (cap_multilingual 8.5). It has no configurable reasoning (cap_reasoning 6.5, cap_math 6.0) — for extended thinking, callers route to Magistral or Medium 3.5. Throughput is on the slow side of its price tier (~45 tps), but the per-token economics more than compensate for batch and asynchronous workloads. No native real-time retrieval (cap_realtime_data 0.0).

Benchmark analysis

Benchmark	Score	vs Predecessor	vs Top Competitor	Source
Artificial Analysis Index	21	+~2 vs Medium 3	above price-tier median (~16)	Artificial Analysis
HumanEval	~92%	+~1pp	parity with Llama 4 Maverick	Artificial Analysis
GPQA Diamond	~57%	flat	trails reasoning models	Artificial Analysis

Mistral published little numeric benchmark detail for Medium 3.1; the values above lean on Artificial Analysis aggregation. MMLU-Pro and standardized coding benchmarks are unpublished. Benchmark coverage is partial; confidence in the aggregate scores is medium-to-high but they are secondary-sourced.

Speed & latency

Output throughput is roughly 45 tokens/sec — the slow side of the price tier, which can be felt in streaming chat UX but is a non-issue for backend pipelines and batch jobs. Time-to-first-token is not separately published (null). Medium latency tier overall; the model's appeal is economics, not speed.

Pricing analysis

Surface	Cost	Notes
API input	$0.40 / 1M tok	La Plateforme
API output	$2.00 / 1M tok	La Plateforme
Cached input	$0.04 / 1M tok	cache read
Batch (in/out)	$0.20 / $1.00	~50% async discount
Direct UI	EUR 14.99/mo (~$15)	Le Chat Pro
Free tier	~25 msg/day (Le Chat)	no card
Cloud	Bedrock, Azure AI Foundry, Vertex AI	managed

Deployment & access

Medium 3.1 is API-only and closed-weight — no self-host, no Hugging Face weights, in contrast to Medium 3.5's open release. It is available on La Plateforme (EU-hosted by default, US options) and managed on Amazon Bedrock, Azure AI Foundry, and Google Vertex AI. For buyers who need self-hosting or sovereignty-by-on-prem, this model does not deliver it; the EU-residency story here is limited to La Plateforme's EU hosting, not on-prem control.

Safety & privacy

Standard Mistral posture: GDPR-native, SOC 2 Type II, ISO 27001/27701, EU AI Act aligned, EU data residency by default, 30-day abuse retention, no training on inputs unless opt-in, Zero Data Retention available. No built-in moderation; separate Mistral Moderation API available. Moderate, consistent refusal calibration.

Ecosystem & tooling

SDKs in Python and TypeScript/JavaScript; integrations with LangChain, LlamaIndex, and Vercel AI SDK. Powers Le Chat and Mistral AI Studio. As a mature, widely-deployed cost-tier model it sits at mainstream popularity within Mistral's user base, though its successor is drawing new projects away.

Buyer questions

Can I self-host it?

No — Medium 3.1 is Premier/closed-weight, API-only. For open weights at this tier, use Medium 3.5 (modified-MIT) or step down to Small 4 (Apache 2.0).

Does it reason?

No reasoning toggle. Route hard analytical work to Magistral or Medium 3.5 with high reasoning effort.

Why pick it over Medium 3.5?

Pure cost: $2.00 output vs $7.50. For chat/Q&A at volume that don't need agentic strength, 3.1 is far cheaper.

Is the data EU-resident?

Yes on La Plateforme (EU default), with 30-day abuse retention, no training on inputs unless opt-in, ZDR available — but only as an API, not on-prem.

How fast is it?

~45 tps — adequate for backend, a touch slow for snappy streaming chat.

Which clouds?

Bedrock, Azure AI Foundry, and Vertex AI, plus La Plateforme.

Comparable models

Medium 3.5: — Mistral

Its successor — open weights, reasoning knob, far stronger on agentic coding, but ~3.75x the output price ($7.50 vs $2.00).

GPT-5 mini:

Similar price tier, weaker EU-language quality, broader ecosystem and tooling.

Claude Haiku 4.5:

Comparable cost tier; weaker multilingual and a different (Anthropic) safety/ecosystem trade.

V3.2: — DeepSeek

Cheaper still, weaker European-language quality, Chinese-origin residency.

Model specs

Input price

$0.40 / Mtok

Output price

$2 / Mtok

Cached input

$0.04 / Mtok

Batch (in/out)

$0.20 / $1

Context window

131K tokens

Max output

16K tokens

Knowledge cutoff

2025-04

Released

2025-08-12

Modalities

text, image → text

Output speed

~45 tok/s

License

Proprietary

Clouds

Bedrock, Azure AI Foundry, Vertex AI

Does not train on API inputs by default

Last verified 2026-05-27

Mistral Medium 3.1

What's new

Benchmarks

AI Panel Review

Strengths

Limitations

Best use cases

Deep dive

Architecture

Capabilities

Benchmark analysis

Speed & latency

Pricing analysis

Deployment & access

Safety & privacy

Ecosystem & tooling

Buyer questions

Comparable models

Sources

Model specs

Other Mistral Medium versions