by Mistral AI · Magistral family · best for auditable multilingual reasoning
Magistral Medium 1.2 (alias magistral-medium-2509, shipped 18 September 2025) is Mistral's flagship reasoning model: a Premier/closed-weight, multimodal model that produces visible chain-of-thought before answering, with a 40,960-token output ceiling for long traces. It posts AIME24 91.82% / AIME25 83.48% and GPQA Diamond 76.26%, and its reasoning traces stay in the user's language — a genuine multilingual-reasoning differentiator. Priced at $2.00/$5.00. The buyer's sentence: the right pick for auditable, multilingual hard-reasoning routing — though Medium 3.5's reasoning knob has narrowed its exclusive niche. - Provider: Mistral AI (Paris, France) - Release: 2025-09-18, status GA - Context: 131,072 tokens; max output 40,960 (long thinking traces) - Modalities: text + image in, text out (native multimodal) - Knowledge cutoff: ~June 2025 - Headline price: $2.00 input / $5.00 output per 1M tokens - Architecture: dense (parameter count undisclosed — Premier/closed)
| Benchmark | Score | Source |
|---|---|---|
| MMMU | 70% | arxiv.org 2025-09-18T00:00:00.000Z |
| AIME 2025 | 83.48% | apidog.com 2025-09-18T00:00:00.000Z |
| GPQA Diamond | 76.26% | apidog.com 2025-09-18T00:00:00.000Z |
| Artificial Analysis Index | 27 | artificialanalysis.ai 2026-05-28T00:00:00.000Z |
Six personas, six verdicts — the same panel that reviews every product on TopReviewed.
“It made strategic sense in late 2025 as our reasoning answer to o-series and R1; by mid-2026 Medium 3.5's reasoning knob has weakened the case for a separate endpoint.”
At launch Magistral Medium 1.2 was the obvious dedicated-reasoning bet — strong math, visible traces, multilingual reasoning. The strategic picture shifted when Mistral's own Medium 3.5 added a `reasoning_effort` knob in one open-weight SKU. I would still route the hardest math and PhD-class science to Magistral, the most reasoning-pure model in the lineup, but for general agent loops I now default to Medium 3.5. Closed weights at the Medium tier (vs the open Magistral Small) is a strike. The auditable-trace angle keeps it relevant for compliance-driven reasoning where you must show the work.
“A credible reasoning model whose distinct niche Mistral itself collapsed — its clearest remaining moat is auditable, in-language reasoning.”
Magistral's positioning as the European reasoning specialist was sound, but the strategic problem is internal cannibalisation: Medium 3.5 absorbed the reasoning use case into a cheaper-to-reason, open-weight generalist. Against DeepSeek R1 and Claude/GPT thinking it competes on multilingual traces and price-per-thought, not on top-end reasoning (AA Index 27). The durable differentiators are the visible, in-user-language reasoning trace (valuable for regulated, multilingual auditability) and the open Magistral Small baseline. As a standalone product its market timing has passed; as a compliance-reasoning niche it holds.
“$2/$5 is reasonable per token, but long traces inflate output 3-5x — a single hard question can cost 20K tokens, so predictability is poor.”
The per-token rate ($2.00/$5.00) is cheaper than Claude Opus thinking, but reasoning traces inflate output counts so real bills run 3-5x what an equivalent non-reasoning call would cost — and a single hard math question can emit 20K+ tokens. The ~50% batch discount helps for async work. The honest financial read: poor predictability makes this the wrong tool for high-volume use and the right tool for selective hard-problem routing where the extra cost buys a correct answer. Cache discipline matters. For volume reasoning, Small 4's reasoning mode or Medium 3.5 is cheaper.
“The visible reasoning trace is debugger-friendly — I can see what it's thinking and tune prompts — and in-language traces are a real win for EU teams.”
For a builder the traces are the feature: I can watch the model reason and adjust prompts accordingly, and vision-input reasoning genuinely helps with chart/diagram debugging. The long output ceiling lets it fully work a problem. Multilingual traces are useful when collaborating with EU-language teammates. Negatives: latency is meaningfully higher than non-reasoning models, output bills add up fast, and it occasionally over-thinks simple questions. For most coding work I now reach for Medium 3.5 at high reasoning effort, which feels more pragmatic — but for pure reasoning observability, Magistral is the better instrument.
“Noticeably smarter on hard questions, and the visible thinking trace is confidence-building — worth the 5-30 second wait when I want a careful answer.”
Magistral feels markedly smarter on hard questions — math, science, multi-step reasoning — than the chat-tier Mistrals. The visible thinking trace is novel and confidence-building for users who want to see the work. Latency is the felt cost: questions take 5-30 seconds before the final answer streams. For everyday chat it is overkill; for the moments I really want a careful answer, the wait is worth it. Multilingual reasoning (traces in my own language) is a unique strength I haven't found elsewhere.
“AIME 91.82% leads the press release; the AA Index of 27 against a 36 median is the line they didn't print. Math-strong, broadly mid-tier.”
Magistral published real numbers, which is creditable, and the AIME math results are genuinely strong. But the framing cherry-picks: AIME and GPQA headline, while the broad reasoning aggregate (Artificial Analysis Index 27, flagged "below average") and the slow 39 tps go unmentioned. "Frontier-class reasoning" is generous — it is frontier on competition math and mid-tier on the wider reasoning surface, behind DeepSeek R1 and Claude/GPT thinking. Add the trace-inflated, unpredictable cost and Medium 3.5's encroachment, and the honest claim is "strong, auditable math-reasoning specialist," not a frontier reasoner. Route hard math here; don't expect across-the-board reasoning leadership.
- Math-heavy applications (tutoring, scientific computing, quantitative analysis). - Complex coding problems where reasoning beats pattern-matching. - Auditable reasoning (compliance, legal analysis, due diligence) where the visible trace is the point. - Multilingual reasoning where traces should stay in a non-English language. - Agent loops needing a planning brain at an affordable price-per-thought.
No — Magistral Medium 1.2 is Premier/closed. The open-weight reasoning option is Magistral Small 1.2 (Apache 2.0), which fits on a MacBook.
Reasoning traces count as output tokens; a hard question can emit 20K+ tokens. Budget for 3-5x the output of an equivalent non-reasoning call.
For most agent/coding work, Medium 3.5 at high effort suffices and is open-weight. Keep Magistral for the hardest math and for auditable, in-language reasoning traces.
Yes — a key differentiator: reasoning traces stay in the user's language rather than defaulting to English.
Strong on competition math (AIME24 91.82%); mid-tier on the broad reasoning aggregate (AA Index 27).
No — ~39 tps and high per-answer latency. Use it for deliberate hard questions, not interactive chat.
Does not train on API inputs by default
Last verified 2026-05-27