Mistral Medium 3.5

Q: Do I still need Magistral or Devstral 2?

Usually not — reasoning_effort=high covers reasoning and Medium 3.5 beats Devstral 2 on SWE-bench. Devstral 2 remains relevant only as a cheaper output-token option.

GALatest Medium

by Mistral AI · Mistral Medium family · best for open-weight agentic coding at sub-frontier price

FrontierCodingOpen-WeightsMultimodalLong-Context

8.4

AI Panel Score

Value 8.5/10

Mistral Medium 3.5 (released 29 April 2026) is Mistral's most strategically important model of the year: a single 128B dense set of weights that merges what used to be three separate models — chat (Medium 3.1), reasoning (Magistral), and the agentic coder (Devstral 2) — into one SKU with a configurable reasoning effort knob. It scores 77.6% on SWE-bench Verified, within two points of Claude Sonnet 4.5, at $1.50/$7.50 per 1M tokens, and ships under a modified-MIT license that is open for nearly everyone except large-revenue enterprises. The buyer's sentence: the best open-weight agentic coding model at the Medium tier, if you can live with a license carve-out and a high output price.

Compare this model All Mistral Medium versions

What's new

First Mistral "merged" model: one set of weights replaces Medium 3.1 (chat), Magistral (reasoning), and Devstral 2 (coding agent). This is the architectural story.
reasoning_effort parameter selects between fast chat (none) and extended thinking (high) per request — one endpoint, two behaviours.
Open weights under a modified-MIT license — a shift from Medium 3.1, which was Premier-only (closed). NOTE: the modified-MIT carve-out requires large-revenue companies to negotiate a separate commercial arrangement; it is not a clean Apache 2.0.
128B dense (not MoE), self-hostable on roughly 4 GPUs in quantised form — far lighter than Large 3's 675B MoE.
SWE-bench Verified 77.6% (up from Devstral 2's 72.2%) and tau3-Telecom 91.4%.
Ships paired with the Mistral Vibe CLI, a terminal-native coding agent that can open PRs; Mistral migrated Vibe from Devstral 2 to Medium 3.5.

Benchmarks

Benchmark	Score	Source
TAU-bench	91.4%	huggingface.co 2026-04-29T00:00:00.000Z
SWE-bench Verified	77.6%	huggingface.co 2026-04-29T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker8.5/10

“One open-weight endpoint that retires my chat, reasoning, and coding models at once — but I had to read the license twice before I trusted it.”

Medium 3.5 is the most consequential consolidation play of 2026: a single SKU lands within two points of Claude Sonnet 4.5 on SWE-bench at under half the cost and lets me retire a multi-model routing layer. For an EU-sovereign agent platform it is the new default. The catch is governance, not capability: the modified-MIT license carves out large-revenue companies, so an enterprise buyer must verify whether they fall above the threshold and budget for a commercial deal if so. Mistral also withheld the hard reasoning benchmarks, signalling deliberate optimisation for coding/agentic over pure analysis. For coding agents, ideal; for analytical workloads, benchmark against Magistral first.

Strategic Fit 9Vendor Risk 8Roadmap Confidence 8

Pros

SKU consolidation, near-Sonnet coding, EU residency

Cons

modified-MIT carve-out
withheld reasoning benchmarks

Right for: EU agent platforms below the revenue threshold

Avoid if: you assumed Apache 2.0 and you're a large enterprise

Domain Strategist8.5/10

“Mistral's bet is that 'merged model + open weights + EU residency' beats a zoo of specialist endpoints — and on developer mindshare it's working.”

The merged-model strategy is a sharp competitive move: it reframes Mistral's lineup from a confusing family of specialists into one coherent flagship, which is easier to sell and adopt. Positioned against Claude Sonnet and GPT-5 mini, the wedge is open weights plus EU residency at near-parity coding capability. The modified-MIT license is a calculated middle path — open enough to win indie and mid-market mindshare, monetised at the enterprise tier where Mistral needs revenue. The risk is messaging: "open weights" with a revenue asterisk invites the same criticism Meta's Llama license drew. Market timing is strong, riding the agentic-coding wave and the EU AI Act compliance tailwind.

Competitive Positioning 9Differentiation 8Market Timing 9

Pros

coherent flagship narrative
agentic-coding timing

Cons

"open-with-asterisk" messaging risk

Right for: products selling EU-sovereign coding

Avoid if: your buyers demand truly unrestricted licensing

Finance Lead8.5/10

“Coding capability at a third of Sonnet's price — but the $7.50 output rate and a possible enterprise license fee both need to be in the model.”

The API headline is strong: $1.50/$7.50 puts near-Sonnet agentic coding at roughly a third of Claude's cost, and for a team running thousands of agent calls daily the monthly delta is tens of thousands of dollars at scale. Two cautions temper it. First, the $7.50 output rate is high for the Medium tier and agentic loops are output-heavy, so cache and structured-output discipline matter. Second, the modified-MIT license may carry a separate commercial fee for large enterprises — a line item that does not exist for Apache models like Large 3 or Small 4. Self-host on ~4 GPUs converts opex to capex for steady-state teams below the threshold. Strong unit economics with two asterisks.

Cost Efficiency 8Pricing Transparency 8Value per Dollar 9

Pros

a third of Sonnet's price
~4-GPU self-host

Cons

high output rate
possible enterprise license fee

Right for: high-volume coding below the revenue threshold

Avoid if: enterprise-scale where the license fee erodes the saving

Domain Practitioner9/10

“The first Mistral that genuinely competes with Sonnet for agentic coding — and Vibe CLI is a real Claude Code rival with first-party integration.”

For a builder this is the standout. SWE-bench 77.6% is real-world useful: it plans multi-file edits and opens credible PRs. Vibe CLI is a legitimate Cursor/Claude Code competitor with native integration. The reasoning_effort dial means I don't fire a separate Magistral call when I want deeper thinking — one model, one schema. Function calling and JSON output are the best in the family, and open weights give a self-host escape hatch. Negatives are familiar: docs are thinner than Anthropic's, function-calling occasionally drifts on complex schemas, and 256K context can spike latency. Best price-to-capability ratio in the open-weight coding tier.

API Ergonomics 9Tool/Agent Support 9Reliability 8

Pros

near-Sonnet coding, Vibe CLI, one-SKU reasoning

Cons

thinner docs, occasional FC drift

Right for: agentic coding builders

Avoid if: you need the deepest tooling ecosystem and SLAs

Power User7.5/10

“I rarely talk to it directly — it lives behind my IDE and agents — but with high effort it's noticeably more careful.”

End users mostly meet Medium 3.5 through IDEs and agents rather than chat. In Le Chat with reasoning_effort=high, responses are slower than the chat default but markedly more careful on multi-step questions; at none it is responsive. European-language quality remains excellent and refusals are moderate. It feels engineered for tasks, not conversation — less warm and less expressive than Claude or GPT-5 for casual use. As a daily driver it is a strong work tool and a merely-adequate chat companion.

Output Quality 7.5Speed 7Everyday Usefulness 8

Pros

careful with high effort, strong EU languages

Cons

task-engineered, not warm
high-effort latency

Right for: developer daily use

Avoid if: you want a friendly general chat partner

Skeptic7/10

“'Open-weight frontier coder' — except the license has a revenue cliff and the hard reasoning benchmarks went unpublished. Two asterisks on one launch.”

The coding numbers are credible and Vibe CLI is genuinely good, so this isn't a hollow launch. But two claims deserve scrutiny. First, "open weights": the modified-MIT license carves out large-revenue companies, so the unqualified "open" framing misleads exactly the enterprise buyers who most need to know. Second, "frontier": Mistral published SWE-bench and tau3-Telecom — where it leads — and withheld GPQA Diamond, MMLU-Pro, and LiveCodeBench, the reasoning benchmarks where it would be measured against Claude and GPT-5. The honest claim is "best open-weight agentic coder at this price, for non-enterprise users." Buy it for coding economics, read the license, and don't assume reasoning parity with US frontier models.

Claim Accuracy 6Weakness Severity 6Hype vs Reality 7

Pros

real coding capability, real CLI

Cons

license asterisk
selective benchmarks

Right for: non-enterprise coding teams who read the fine print

Avoid if: you took "open frontier" at face value

Strengths

One of very few open-weight models above 77% SWE-bench Verified at sub-$10 output pricing.
Single SKU replaces a three-model stack (chat + reasoning + coding) — major architecture simplification.
reasoning_effort dial removes the need for a separate reasoning model call.
256K context handles full-repo agentic tasks; best-in-class function calling and JSON.
Self-hostable on ~4 GPUs — far lighter than Large 3.
Ships with first-party Vibe CLI.

Limitations

License is modified-MIT, NOT Apache 2.0 — large-revenue enterprises must pay for a commercial arrangement. This is a real procurement gotcha.
Output price of $7.50 is high for the Medium tier — nearly 4x Medium 3.1's $2.00.
Mistral withheld GPQA Diamond, MMLU-Pro, and LiveCodeBench at launch — hard to benchmark on pure reasoning.
128B dense raises the self-host memory floor versus a sparse model of similar nominal size.
Still trails Claude Sonnet 4.5 and GPT-5 on the hardest reasoning evals.
Younger in production than Large 3; less third-party tooling so far.

Best use cases

Agentic coding pipelines: PR generation, repo refactoring, multi-file edits via Vibe CLI.
Replacing a multi-model stack (chat + reasoning + coding) with one endpoint.
EU enterprises (below the revenue threshold) needing sovereign, self-hostable agent infrastructure.
Long-context code review across 256K tokens of repo context.
Tool-heavy agent workflows where tau-style benchmarks predict real-world performance.

Deep dive

The full research notes behind this review — verified against primary sources.

Architecture Capabilities Benchmark analysis Speed & latency Pricing analysis Deployment & access Safety & privacy Ecosystem & tooling

Architecture

Medium 3.5 is a dense 128B-parameter transformer — notably not a Mixture-of-Experts, unlike Large 3 and Small 4. The vision encoder was trained from scratch to handle variable image sizes and aspect ratios. The defining design choice is the merge: rather than maintaining separate chat, reasoning, and coding checkpoints, Mistral folded all three into one model whose behaviour is steered at inference via reasoning_effort (none for fast responses, high for chain-of-thought on complex/agentic prompts). 256K context. Tokenizer is mistral_common. Layer count, attention type, training-token count, and vocab size are undisclosed. Being dense rather than sparse, all 128B parameters are active per token, which raises the self-host memory floor relative to a sparse model of similar nominal size but simplifies serving.

Capabilities

Medium 3.5 is engineered for agentic and coding workloads (cap_agentic 9.0, cap_coding 8.5, cap_function_calling 9.0). SWE-bench Verified 77.6% lands within two points of Claude Sonnet 4.5 at under half the price, and tau3-Telecom 91.4% leads its price tier on agentic tool use. With reasoning_effort=high it produces extended reasoning before answering (cap_reasoning 7.5); at none it behaves as a fast chat model. Native function calling and JSON output are best-in-class for the family. The 256K context handles full-repo agentic tasks (cap_long_context 8.0). Multilingual quality carries through the Mistral family across 24 languages (cap_multilingual 8.5). Native vision (cap_vision 7.0) handles charts and screenshots. It is less of a creative-writing or pure-chat model than Large 3 — it is tuned for tasks (cap_creative_writing 7.0). No native real-time retrieval (cap_realtime_data 0.0).

Benchmark analysis

Benchmark	Score	vs Predecessor	vs Top Competitor	Source
SWE-bench Verified	77.6%	+5.4pp vs Devstral 2 (72.2%)	-2pp vs Claude Sonnet 4.5 (~77-79%)	HF card
tau3-Telecom (agentic)	91.4%	new	category leader at price tier	HF card

Mistral describes "strong results" on instruction-following, reasoning, and coding but withheld numeric GPQA Diamond, MMLU-Pro, LiveCodeBench, and AIME scores at launch. Benchmark coverage is partial; only the two sourced agentic/coding scores are recorded, the rest null. This is a recurring Mistral pattern: publish the metrics where the model leads (agentic/coding), withhold the standard reasoning suite.

Speed & latency

Mistral has not published official throughput figures and Artificial Analysis had not posted a stable measurement at time of verification, so output_speed_tps and time_to_first_token_s are null. Behaviourally, reasoning_effort=none is fast chat-tier; reasoning_effort=high is meaningfully slower and produces more output tokens. As a 128B dense model it sits in the medium latency tier — heavier than Small 4, lighter to serve interactively than Large 3's MoE in many configurations.

Pricing analysis

Surface	Cost	Notes
API input	$1.50 / 1M tok	La Plateforme
API output	$7.50 / 1M tok	La Plateforme; high for the Medium tier
Batch (in/out)	$0.75 / $3.75	~50% async discount
Direct UI	EUR 14.99/mo (~$15)	Le Chat Pro
Vibe CLI	included	first-party terminal agent
Free tier	~25 msg/day (Le Chat)	no card
Self-host	modified-MIT	weights on Hugging Face, ~4 GPUs quantised
Cloud	Bedrock, Azure AI Foundry	managed

Deployment & access

Weights are published on Hugging Face (FP8/NVFP4/BF16/GGUF) under a modified-MIT license. CRITICAL LICENSE NOTE: this is not Apache 2.0. The modified-MIT terms are open for research and commercial use by most parties but require companies above a revenue threshold to negotiate a separate commercial agreement with Mistral. For startups and mid-market this is effectively open; for large enterprises it is a paid arrangement — a two-tier model that buyers must check against their own revenue. Dense 128B self-hosts on roughly 4 GPUs quantised (~80GB+ VRAM viable at FP8/NVFP4). Managed on Bedrock and Azure AI Foundry; La Plateforme EU-hosted by default. Pairs with the open-source Vibe CLI for terminal-native agentic coding.

Safety & privacy

Same posture as the Mistral lineup: GDPR-native French company, SOC 2 Type II, ISO 27001/27701, EU AI Act aligned, EU data residency by default, 30-day abuse-monitoring retention, no training on inputs unless opt-in, Zero Data Retention available. No built-in moderation; the separate Mistral Moderation API (mistral-moderation-2603) is the guardrail layer. Refusal calibration is moderate and task-oriented.

Ecosystem & tooling

SDKs in Python and TypeScript/JavaScript; integrations with LangChain, LlamaIndex, Vercel AI SDK, and the first-party Mistral Vibe CLI. Powers Le Chat, Mistral Code, and Vibe CLI. Open weights (modified-MIT) drive a growing self-host community with 300k+ monthly Hugging Face downloads. Popularity is growing, strongest among EU developers building agentic coding tools.

Buyer questions

Is it actually open weights?

Yes, but under a modified-MIT license, not Apache 2.0. It is open for research and commercial use by most parties; companies above a revenue threshold must negotiate a separate commercial agreement with Mistral. Verify your revenue against the threshold.

How good is it at coding?

SWE-bench Verified 77.6%, within ~2pp of Claude Sonnet 4.5, and tau3-Telecom 91.4% for agentic tool use. Strong, real-world useful.

Why is output so expensive?

$7.50/1M output is high for the Medium tier; agentic loops are output-heavy, so use caching, structured output, and batch where possible.

What's the self-host footprint?

Dense 128B runs on roughly 4 GPUs quantised (~80GB+ VRAM at FP8/NVFP4).

Do I still need Magistral or Devstral 2?

Usually not — reasoning_effort=high covers reasoning and Medium 3.5 beats Devstral 2 on SWE-bench. Devstral 2 remains relevant only as a cheaper output-token option.

Where does my data live?

EU by default on La Plateforme; 30-day abuse retention, no training on inputs unless opt-in, ZDR available.

What is Vibe CLI?

A first-party open-source terminal coding agent (Cursor/Claude Code style) that now defaults to Medium 3.5 and can open PRs.

Comparable models

Claude Sonnet 4.5:

Leads SWE-bench by ~2pp, costs ~2x, closed weights — the capability-vs-price-and-openness trade.

GPT-5 mini:

Similar price tier, weaker SWE-bench, no open weights, broader ecosystem.

Devstral 2 (Mistral):

Its predecessor in the coding role; 5.4pp lower SWE-bench but 3.75x cheaper on output ($0.90 vs $7.50) — the budget alternative inside Mistral's own lineup.

Large 3: — Mistral

Broader generalist with a cleaner Apache 2.0 license, but weaker on coding/agentic specifically.

Sources

Primary references used to verify this review.

Model specs

Input price: $1.50 / Mtok
Output price: $7.50 / Mtok
Cached input: —
Batch (in/out): $0.75 / $3.75
Context window: 256K tokens
Max output: 33K tokens
Knowledge cutoff: 2026-02
Released: 2026-04-28
Modalities: text, image → text
Output speed: Not profiled
License: Open weights (custom-modified-mit)
Clouds: Bedrock, Azure AI Foundry

Does not train on API inputs by default

Other Mistral Medium versions

Mistral Medium 3.17.7

Last verified 2026-05-27