Magistral Medium 1.2

GALatest Reasoning

by Mistral AI · Magistral family · best for auditable multilingual reasoning

ReasoningMultimodal

7.2

AI Panel Score

Value 7.0/10

Magistral Medium 1.2 (alias magistral-medium-2509, shipped 18 September 2025) is Mistral's flagship reasoning model: a Premier/closed-weight, multimodal model that produces visible chain-of-thought before answering, with a 40,960-token output ceiling for long traces. It posts AIME24 91.82% / AIME25 83.48% and GPQA Diamond 76.26%, and its reasoning traces stay in the user's language — a genuine multilingual-reasoning differentiator. Priced at $2.00/$5.00. The buyer's sentence: the right pick for auditable, multilingual hard-reasoning routing — though Medium 3.5's reasoning knob has narrowed its exclusive niche.

Compare this model All Magistral versions

What's new

Second iteration of Mistral's reasoning line (1.0 launched June 2025 alongside the open-weight Magistral Small).
1.2 adds vision input — the original Magistral was text-only. Both Medium and Small 1.2 gained a vision encoder.
Large jumps over prior versions: AIME24 91.82% (from 73.59% on 1.0), AIME25 83.48% (from 60.99% on 1.1), GPQA Diamond 76.26% (from 70.83%).
Improved chain-of-thought stability and tool use during reasoning.
Positioned as Mistral's answer to OpenAI o-series, DeepSeek R1, and Claude/GPT thinking tiers.

Benchmarks

Benchmark	Score	Source
MMMU	70%	arxiv.org 2025-09-18T00:00:00.000Z
AIME 2025	83.48%	apidog.com 2025-09-18T00:00:00.000Z
GPQA Diamond	76.26%	apidog.com 2025-09-18T00:00:00.000Z
Artificial Analysis Index	27	artificialanalysis.ai 2026-05-28T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker7/10

“It made strategic sense in late 2025 as our reasoning answer to o-series and R1; by mid-2026 Medium 3.5's reasoning knob has weakened the case for a separate endpoint.”

At launch Magistral Medium 1.2 was the obvious dedicated-reasoning bet — strong math, visible traces, multilingual reasoning. The strategic picture shifted when Mistral's own Medium 3.5 added a reasoning_effort knob in one open-weight SKU. I would still route the hardest math and PhD-class science to Magistral, the most reasoning-pure model in the lineup, but for general agent loops I now default to Medium 3.5. Closed weights at the Medium tier (vs the open Magistral Small) is a strike. The auditable-trace angle keeps it relevant for compliance-driven reasoning where you must show the work.

Strategic Fit 7Vendor Risk 7Roadmap Confidence 7

Pros

reasoning-pure, auditable, multilingual traces

Cons

closed weights
cannibalised by Medium 3.5

Right for: hard-reasoning and compliance routing

Avoid if: a reasoning knob on a generalist suffices

Domain Strategist6.5/10

“A credible reasoning model whose distinct niche Mistral itself collapsed — its clearest remaining moat is auditable, in-language reasoning.”

Magistral's positioning as the European reasoning specialist was sound, but the strategic problem is internal cannibalisation: Medium 3.5 absorbed the reasoning use case into a cheaper-to-reason, open-weight generalist. Against DeepSeek R1 and Claude/GPT thinking it competes on multilingual traces and price-per-thought, not on top-end reasoning (AA Index 27). The durable differentiators are the visible, in-user-language reasoning trace (valuable for regulated, multilingual auditability) and the open Magistral Small baseline. As a standalone product its market timing has passed; as a compliance-reasoning niche it holds.

Competitive Positioning 6Differentiation 7Market Timing 6

Pros

auditable in-language reasoning niche

Cons

cannibalised
sub-frontier aggregate

Right for: regulated multilingual reasoning

Avoid if: you want the strongest reasoning regardless

Finance Lead7/10

“$2/$5 is reasonable per token, but long traces inflate output 3-5x — a single hard question can cost 20K tokens, so predictability is poor.”

The per-token rate ($2.00/$5.00) is cheaper than Claude Opus thinking, but reasoning traces inflate output counts so real bills run 3-5x what an equivalent non-reasoning call would cost — and a single hard math question can emit 20K+ tokens. The ~50% batch discount helps for async work. The honest financial read: poor predictability makes this the wrong tool for high-volume use and the right tool for selective hard-problem routing where the extra cost buys a correct answer. Cache discipline matters. For volume reasoning, Small 4's reasoning mode or Medium 3.5 is cheaper.

Cost Efficiency 6Pricing Transparency 7Value per Dollar 7

Pros

cheaper per token than Opus thinking
batch discount

Cons

trace-inflated, unpredictable output

Right for: selective hard-problem routing

Avoid if: high-volume reasoning on a fixed budget

Domain Practitioner7.5/10

“The visible reasoning trace is debugger-friendly — I can see what it's thinking and tune prompts — and in-language traces are a real win for EU teams.”

For a builder the traces are the feature: I can watch the model reason and adjust prompts accordingly, and vision-input reasoning genuinely helps with chart/diagram debugging. The long output ceiling lets it fully work a problem. Multilingual traces are useful when collaborating with EU-language teammates. Negatives: latency is meaningfully higher than non-reasoning models, output bills add up fast, and it occasionally over-thinks simple questions. For most coding work I now reach for Medium 3.5 at high reasoning effort, which feels more pragmatic — but for pure reasoning observability, Magistral is the better instrument.

API Ergonomics 8Tool/Agent Support 7Reliability 8

Pros

auditable traces, vision reasoning, long output

Cons

latency, cost, over-thinks easy tasks

Right for: reasoning observability and debugging

Avoid if: you want low-latency throughput

Power User7.5/10

“Noticeably smarter on hard questions, and the visible thinking trace is confidence-building — worth the 5-30 second wait when I want a careful answer.”

Magistral feels markedly smarter on hard questions — math, science, multi-step reasoning — than the chat-tier Mistrals. The visible thinking trace is novel and confidence-building for users who want to see the work. Latency is the felt cost: questions take 5-30 seconds before the final answer streams. For everyday chat it is overkill; for the moments I really want a careful answer, the wait is worth it. Multilingual reasoning (traces in my own language) is a unique strength I haven't found elsewhere.

Output Quality 8Speed 5.5Everyday Usefulness 7

Pros

smart on hard problems, visible traces, in-language

Cons

slow
overkill for chat

Right for: deliberate hard-question sessions

Avoid if: you want fast everyday answers

Skeptic6.5/10

“AIME 91.82% leads the press release; the AA Index of 27 against a 36 median is the line they didn't print. Math-strong, broadly mid-tier.”

Magistral published real numbers, which is creditable, and the AIME math results are genuinely strong. But the framing cherry-picks: AIME and GPQA headline, while the broad reasoning aggregate (Artificial Analysis Index 27, flagged "below average") and the slow 39 tps go unmentioned. "Frontier-class reasoning" is generous — it is frontier on competition math and mid-tier on the wider reasoning surface, behind DeepSeek R1 and Claude/GPT thinking. Add the trace-inflated, unpredictable cost and Medium 3.5's encroachment, and the honest claim is "strong, auditable math-reasoning specialist," not a frontier reasoner. Route hard math here; don't expect across-the-board reasoning leadership.

Claim Accuracy 6Weakness Severity 6Hype vs Reality 7

Pros

real math numbers, auditable traces

Cons

selective framing
mid-tier aggregate
slow

Right for: math-reasoning routing

Avoid if: you took "frontier reasoning" literally

Strengths

Dedicated reasoning specialist with visible traces — useful for auditability and debugging.
Strong competition math: AIME24 91.82% edges DeepSeek R1.
Multilingual reasoning — traces stay in the user's language.
Vision input for diagram/chart reasoning.
Long 40K output ceiling for complex multi-step problems.
Apache 2.0 Magistral Small 1.2 available as a self-host baseline.

Limitations

Output cost ($5.00/M) plus long thinking traces means real-world bills run high and unpredictable.
AA Intelligence Index 27 (vs ~36 median) — strong on math, mid-tier on the broad reasoning aggregate; trails DeepSeek R1 and Claude/GPT thinking on the hardest evals.
Medium 3.5 now offers configurable reasoning in one SKU, eroding Magistral's positioning.
Closed weights at the Medium tier (only Magistral Small is open).
Slow: 38.9 tps, high per-answer latency.

Best use cases

Math-heavy applications (tutoring, scientific computing, quantitative analysis).
Complex coding problems where reasoning beats pattern-matching.
Auditable reasoning (compliance, legal analysis, due diligence) where the visible trace is the point.
Multilingual reasoning where traces should stay in a non-English language.
Agent loops needing a planning brain at an affordable price-per-thought.

Deep dive

The full research notes behind this review — verified against primary sources.

Architecture Capabilities Benchmark analysis Speed & latency Pricing analysis Deployment & access Safety & privacy Ecosystem & tooling

Architecture

Magistral Medium 1.2 is a dense, reasoning-tuned model. As a Premier/closed-weight model, Mistral does not disclose parameter count, layers, attention type, or training scale — only the open-weight Magistral Small variant exposes its internals. What is verifiable: it is multimodal (a vision encoder was added in 1.2), has a 131K context and a 40,960-token output ceiling for extended reasoning traces, and was trained with reinforcement learning on top of a Mistral base (per the Magistral paper). The defining behaviour is always-on visible reasoning: it emits a full chain-of-thought before the final answer. Architecture internals are deliberately undisclosed and recorded as null.

Capabilities

Magistral Medium 1.2 is purpose-built for extended chain-of-thought: math, science, complex coding, multi-step planning, agent reasoning (cap_reasoning 8.5, cap_math 8.5). It produces visible reasoning traces up to ~40K output tokens, useful for auditability and debugging. AIME24 91.82% slightly edges DeepSeek R1, and GPQA Diamond 76.26% is competitive science reasoning. Vision input (cap_vision 7.0) lets it reason over diagrams, charts, and screenshots; the Magistral paper reports MMMU ~70%. A standout: reasoning traces stay in the user's language, a real improvement for multilingual users versus models that internally "think" in English (cap_multilingual 8.5). Coding is solid but not its headline (cap_coding 7.0). With Medium 3.5 now offering adjustable reasoning, Magistral's exclusive niche has narrowed but it remains the dedicated specialist. No native real-time retrieval (cap_realtime_data 0.0).

Benchmark analysis

Benchmark	Score	vs Predecessor	vs Top Competitor	Source
AIME 2024	91.82%	+18pp vs 1.0 (73.59%)	edges DeepSeek R1 (91.40%)	Apidog
AIME 2025 (pass@1)	83.48%	+22pp vs 1.1 (60.99%)	strong	Apidog
GPQA Diamond	76.26%	+5.4pp vs 1.0 (70.83%)	trails frontier thinking models	Apidog
MMMU (vision)	~70.0%	+5pp	mid-tier multimodal	Magistral paper
Artificial Analysis Index	27	up	below median (~36); "below average" per AA	Artificial Analysis

Magistral published more numeric benchmarks than most Mistral launches (AIME, GPQA), so coverage here is better than the family norm. The AA Index of 27 (vs ~36 median) is the honest counterweight: strong on math specifically, mid-tier on the broad reasoning aggregate.

Speed & latency

Artificial Analysis measures 38.9 output tokens/sec with a 1.70s time-to-first-token — placing it in the slow tier, and AA notes it as notably slow versus a ~72 tps peer average. Combined with verbose reasoning traces (43M output tokens generated during the AA eval), latency-per-answer is high: a hard question may take 5-30 seconds before the final answer streams. This is the cost of always-on visible reasoning; it is the wrong tool for interactive chat and the right tool for selective hard-problem routing.

Pricing analysis

Surface	Cost	Notes
API input	$2.00 / 1M tok	La Plateforme
API output	$5.00 / 1M tok	reasoning traces count as output
Batch (in/out)	$1.00 / $2.50	~50% async discount
Direct UI	EUR 14.99/mo (~$15)	Le Chat Pro (Think mode)
Free tier	Think mode via Le Chat; La Plateforme quota
Magistral Small 1.2	Apache 2.0 weights	self-host reasoning baseline
Cloud	Azure AI Foundry	managed

Deployment & access

Magistral Medium 1.2 is API-only and closed-weight (Premier) — no self-host. The companion Magistral Small 1.2 is open-weight under Apache 2.0 and fits on a MacBook, giving a self-host reasoning baseline at lower capability. La Plateforme is EU-hosted by default; managed availability on Azure AI Foundry. For sovereignty-by-on-prem reasoning, the path is Magistral Small (Apache 2.0), not Medium.

Safety & privacy

Standard Mistral posture: GDPR-native, SOC 2 Type II, ISO 27001/27701, EU AI Act aligned, EU residency by default, 30-day abuse retention, no training on inputs unless opt-in, ZDR available. No built-in moderation; separate Mistral Moderation API. The visible reasoning trace is itself a governance asset — auditable thought for compliance and due-diligence use cases. Refusals moderate and consistent.

Ecosystem & tooling

SDKs in Python and TypeScript/JavaScript; integrations with LangChain, LlamaIndex, and Vercel AI SDK. Powers Le Chat's "Think" mode and is available via Mistral AI Studio and Azure AI Foundry. The open Magistral Small 1.2 has a growing self-host community. Popularity is growing but narrower than the generalist Mistrals, given the reasoning-specialist scope.

Buyer questions

Can I self-host it?

No — Magistral Medium 1.2 is Premier/closed. The open-weight reasoning option is Magistral Small 1.2 (Apache 2.0), which fits on a MacBook.

Why are my bills unpredictable?

Reasoning traces count as output tokens; a hard question can emit 20K+ tokens. Budget for 3-5x the output of an equivalent non-reasoning call.

Do I still need it now that Medium 3.5 reasons?

For most agent/coding work, Medium 3.5 at high effort suffices and is open-weight. Keep Magistral for the hardest math and for auditable, in-language reasoning traces.

Does it reason in my language?

Yes — a key differentiator: reasoning traces stay in the user's language rather than defaulting to English.

How does it compare on math vs reasoning broadly?

Strong on competition math (AIME24 91.82%); mid-tier on the broad reasoning aggregate (AA Index 27).

Is it fast?

No — ~39 tps and high per-answer latency. Use it for deliberate hard questions, not interactive chat.

Comparable models

R1: — DeepSeek

Cheaper output, stronger on the broad reasoning aggregate, weaker multilingual traces.

Claude Opus 4.7 (thinking) / GPT-5 (thinking):

More expensive, stronger frontier reasoning, closed weights, no in-user-language visible trace by default.

Medium 3.5 (high reasoning effort): — Mistral

Now a viable substitute at a similar price tier, open weights, beats Magistral on agentic coding.

Magistral Small 1.2 (Mistral):

The open-weight (Apache 2.0) sibling — self-hostable reasoning baseline at lower capability.

Sources

Primary references used to verify this review.

Model specs

Input price: $2 / Mtok
Output price: $5 / Mtok
Cached input: —
Batch (in/out): $1 / $2.50
Context window: 131K tokens
Max output: 41K tokens
Knowledge cutoff: 2025-06
Released: 2025-09-17
Modalities: text, image → text
Output speed: ~38.9 tok/s
License: Proprietary
Clouds: Azure AI Foundry

Does not train on API inputs by default

Last verified 2026-05-27