Magistral Medium 1.2

GALatest Reasoning

by Mistral AI · Magistral family · best for auditable multilingual reasoning

ReasoningMultimodal
7.2
AI Panel Score
Value 7.0/10

Magistral Medium 1.2 (alias magistral-medium-2509, shipped 18 September 2025) is Mistral's flagship reasoning model: a Premier/closed-weight, multimodal model that produces visible chain-of-thought before answering, with a 40,960-token output ceiling for long traces. It posts AIME24 91.82% / AIME25 83.48% and GPQA Diamond 76.26%, and its reasoning traces stay in the user's language — a genuine multilingual-reasoning differentiator. Priced at $2.00/$5.00. The buyer's sentence: the right pick for auditable, multilingual hard-reasoning routing — though Medium 3.5's reasoning knob has narrowed its exclusive niche. - Provider: Mistral AI (Paris, France) - Release: 2025-09-18, status GA - Context: 131,072 tokens; max output 40,960 (long thinking traces) - Modalities: text + image in, text out (native multimodal) - Knowledge cutoff: ~June 2025 - Headline price: $2.00 input / $5.00 output per 1M tokens - Architecture: dense (parameter count undisclosed — Premier/closed)

What's new

  • Second iteration of Mistral's reasoning line (1.0 launched June 2025 alongside the open-weight Magistral Small).
  • 1.2 adds vision input — the original Magistral was text-only. Both Medium and Small 1.2 gained a vision encoder.
  • Large jumps over prior versions: AIME24 91.82% (from 73.59% on 1.0), AIME25 83.48% (from 60.99% on 1.1), GPQA Diamond 76.26% (from 70.83%).
  • Improved chain-of-thought stability and tool use during reasoning.
  • Positioned as Mistral's answer to OpenAI o-series, DeepSeek R1, and Claude/GPT thinking tiers.

Benchmarks

BenchmarkScoreSource
MMMU70%arxiv.org 2025-09-18T00:00:00.000Z
AIME 202583.48%apidog.com 2025-09-18T00:00:00.000Z
GPQA Diamond76.26%apidog.com 2025-09-18T00:00:00.000Z
Artificial Analysis Index27artificialanalysis.ai 2026-05-28T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker7/10
It made strategic sense in late 2025 as our reasoning answer to o-series and R1; by mid-2026 Medium 3.5's reasoning knob has weakened the case for a separate endpoint.

At launch Magistral Medium 1.2 was the obvious dedicated-reasoning bet — strong math, visible traces, multilingual reasoning. The strategic picture shifted when Mistral's own Medium 3.5 added a `reasoning_effort` knob in one open-weight SKU. I would still route the hardest math and PhD-class science to Magistral, the most reasoning-pure model in the lineup, but for general agent loops I now default to Medium 3.5. Closed weights at the Medium tier (vs the open Magistral Small) is a strike. The auditable-trace angle keeps it relevant for compliance-driven reasoning where you must show the work.

Strategic Fit 7Vendor Risk 7Roadmap Confidence 7
Pros
  • reasoning-pure, auditable, multilingual traces
Cons
  • closed weights
  • cannibalised by Medium 3.5
Right for: hard-reasoning and compliance routing
Avoid if: a reasoning knob on a generalist suffices
Domain Strategist6.5/10
A credible reasoning model whose distinct niche Mistral itself collapsed — its clearest remaining moat is auditable, in-language reasoning.

Magistral's positioning as the European reasoning specialist was sound, but the strategic problem is internal cannibalisation: Medium 3.5 absorbed the reasoning use case into a cheaper-to-reason, open-weight generalist. Against DeepSeek R1 and Claude/GPT thinking it competes on multilingual traces and price-per-thought, not on top-end reasoning (AA Index 27). The durable differentiators are the visible, in-user-language reasoning trace (valuable for regulated, multilingual auditability) and the open Magistral Small baseline. As a standalone product its market timing has passed; as a compliance-reasoning niche it holds.

Competitive Positioning 6Differentiation 7Market Timing 6
Pros
  • auditable in-language reasoning niche
Cons
  • cannibalised
  • sub-frontier aggregate
Right for: regulated multilingual reasoning
Avoid if: you want the strongest reasoning regardless
Finance Lead7/10
$2/$5 is reasonable per token, but long traces inflate output 3-5x — a single hard question can cost 20K tokens, so predictability is poor.

The per-token rate ($2.00/$5.00) is cheaper than Claude Opus thinking, but reasoning traces inflate output counts so real bills run 3-5x what an equivalent non-reasoning call would cost — and a single hard math question can emit 20K+ tokens. The ~50% batch discount helps for async work. The honest financial read: poor predictability makes this the wrong tool for high-volume use and the right tool for selective hard-problem routing where the extra cost buys a correct answer. Cache discipline matters. For volume reasoning, Small 4's reasoning mode or Medium 3.5 is cheaper.

Cost Efficiency 6Pricing Transparency 7Value per Dollar 7
Pros
  • cheaper per token than Opus thinking
  • batch discount
Cons
  • trace-inflated, unpredictable output
Right for: selective hard-problem routing
Avoid if: high-volume reasoning on a fixed budget
Domain Practitioner7.5/10
The visible reasoning trace is debugger-friendly — I can see what it's thinking and tune prompts — and in-language traces are a real win for EU teams.

For a builder the traces are the feature: I can watch the model reason and adjust prompts accordingly, and vision-input reasoning genuinely helps with chart/diagram debugging. The long output ceiling lets it fully work a problem. Multilingual traces are useful when collaborating with EU-language teammates. Negatives: latency is meaningfully higher than non-reasoning models, output bills add up fast, and it occasionally over-thinks simple questions. For most coding work I now reach for Medium 3.5 at high reasoning effort, which feels more pragmatic — but for pure reasoning observability, Magistral is the better instrument.

API Ergonomics 8Tool/Agent Support 7Reliability 8
Pros
  • auditable traces, vision reasoning, long output
Cons
  • latency, cost, over-thinks easy tasks
Right for: reasoning observability and debugging
Avoid if: you want low-latency throughput
Power User7.5/10
Noticeably smarter on hard questions, and the visible thinking trace is confidence-building — worth the 5-30 second wait when I want a careful answer.

Magistral feels markedly smarter on hard questions — math, science, multi-step reasoning — than the chat-tier Mistrals. The visible thinking trace is novel and confidence-building for users who want to see the work. Latency is the felt cost: questions take 5-30 seconds before the final answer streams. For everyday chat it is overkill; for the moments I really want a careful answer, the wait is worth it. Multilingual reasoning (traces in my own language) is a unique strength I haven't found elsewhere.

Output Quality 8Speed 5.5Everyday Usefulness 7
Pros
  • smart on hard problems, visible traces, in-language
Cons
  • slow
  • overkill for chat
Right for: deliberate hard-question sessions
Avoid if: you want fast everyday answers
Skeptic6.5/10
AIME 91.82% leads the press release; the AA Index of 27 against a 36 median is the line they didn't print. Math-strong, broadly mid-tier.

Magistral published real numbers, which is creditable, and the AIME math results are genuinely strong. But the framing cherry-picks: AIME and GPQA headline, while the broad reasoning aggregate (Artificial Analysis Index 27, flagged "below average") and the slow 39 tps go unmentioned. "Frontier-class reasoning" is generous — it is frontier on competition math and mid-tier on the wider reasoning surface, behind DeepSeek R1 and Claude/GPT thinking. Add the trace-inflated, unpredictable cost and Medium 3.5's encroachment, and the honest claim is "strong, auditable math-reasoning specialist," not a frontier reasoner. Route hard math here; don't expect across-the-board reasoning leadership.

Claim Accuracy 6Weakness Severity 6Hype vs Reality 7
Pros
  • real math numbers, auditable traces
Cons
  • selective framing
  • mid-tier aggregate
  • slow
Right for: math-reasoning routing
Avoid if: you took "frontier reasoning" literally

Strengths

  • Dedicated reasoning specialist with visible traces — useful for auditability and debugging.
  • Strong competition math: AIME24 91.82% edges DeepSeek R1.
  • Multilingual reasoning — traces stay in the user's language.
  • Vision input for diagram/chart reasoning.
  • Long 40K output ceiling for complex multi-step problems.
  • Apache 2.0 Magistral Small 1.2 available as a self-host baseline.

Limitations

  • Output cost ($5.00/M) plus long thinking traces means real-world bills run high and unpredictable.
  • AA Intelligence Index 27 (vs ~36 median) — strong on math, mid-tier on the broad reasoning aggregate; trails DeepSeek R1 and Claude/GPT thinking on the hardest evals.
  • Medium 3.5 now offers configurable reasoning in one SKU, eroding Magistral's positioning.
  • Closed weights at the Medium tier (only Magistral Small is open).
  • Slow: 38.9 tps, high per-answer latency.

Best use cases

- Math-heavy applications (tutoring, scientific computing, quantitative analysis). - Complex coding problems where reasoning beats pattern-matching. - Auditable reasoning (compliance, legal analysis, due diligence) where the visible trace is the point. - Multilingual reasoning where traces should stay in a non-English language. - Agent loops needing a planning brain at an affordable price-per-thought.

Buyer questions

Can I self-host it?

No — Magistral Medium 1.2 is Premier/closed. The open-weight reasoning option is Magistral Small 1.2 (Apache 2.0), which fits on a MacBook.

Why are my bills unpredictable?

Reasoning traces count as output tokens; a hard question can emit 20K+ tokens. Budget for 3-5x the output of an equivalent non-reasoning call.

Do I still need it now that Medium 3.5 reasons?

For most agent/coding work, Medium 3.5 at high effort suffices and is open-weight. Keep Magistral for the hardest math and for auditable, in-language reasoning traces.

Does it reason in my language?

Yes — a key differentiator: reasoning traces stay in the user's language rather than defaulting to English.

How does it compare on math vs reasoning broadly?

Strong on competition math (AIME24 91.82%); mid-tier on the broad reasoning aggregate (AA Index 27).

Is it fast?

No — ~39 tps and high per-answer latency. Use it for deliberate hard questions, not interactive chat.

Comparable models

**DeepSeek R1:** Cheaper output, stronger on the broad reasoning aggregate, weaker multilingual traces.
**Claude Opus 4.7 (thinking) / GPT-5 (thinking):** More expensive, stronger frontier reasoning, closed weights, no in-user-language visible trace by default.
**Mistral Medium 3.5 (high reasoning effort):** Now a viable substitute at a similar price tier, open weights, beats Magistral on agentic coding.
**Magistral Small 1.2 (Mistral):** The open-weight (Apache 2.0) sibling — self-hostable reasoning baseline at lower capability.

Model specs

Input price
$2 / Mtok
Output price
$5 / Mtok
Cached input
Batch (in/out)
$1 / $2.50
Context window
131K tokens
Max output
41K tokens
Knowledge cutoff
2025-06
Released
2025-09-17
Modalities
text, image → text
Output speed
~38.9 tok/s
License
Proprietary
Clouds
Azure AI Foundry

Does not train on API inputs by default

Last verified 2026-05-27