Devstral 2

GALatest Coder

by Mistral AI · Devstral family · best for budget open-weight agentic coding

CodingOpen-WeightsCost-OptimizedLong-Context

7.4

AI Panel Score

Value 9.0/10

Devstral 2 (model ID devstral-2-2512, shipped 9 December 2025) is Mistral's open-weight agentic-coding model, launched alongside the Mistral Vibe CLI. It is a 125B-parameter dense transformer with 256K context, built for multi-file edits, repo planning, PR generation, and test running, scoring 72.2% on SWE-bench Verified. Priced at $0.40/$0.90 — meaningfully cheaper on output than Medium 3.5 ($7.50). The license is modified-MIT on the 123B (open with a large-revenue carve-out), and a companion Devstral Small 2 (24B) is clean Apache 2.0. The buyer's sentence: the budget open-weight agentic coder in Mistral's lineup, still in the 70%-SWE-bench tier after Medium 3.5 took the crown.

Compare this model All Devstral versions

What's new

Mistral's frontier agentic-coding model at launch, released alongside the open-source Mistral Vibe CLI.
125B dense with 256K context; companion Devstral Small 2 (24B, Apache 2.0) runs locally.
Positioned as ~7x more cost-efficient than Claude Sonnet on real-world coding tasks (Mistral's claim).
SWE-bench Verified 72.2%, SWE-bench Multilingual 61.3%, Terminal-Bench 2 32.6%.
Output price corrected to $0.90/1M (an earlier draft listed $2.00 — the verified La Plateforme rate is $0.40/$0.90).
Subsequently surpassed by Medium 3.5 (April 2026) on SWE-bench Verified (77.6% vs 72.2%); Mistral migrated the Vibe CLI default to Medium 3.5, but Devstral 2 remains GA and cheaper on output.

Benchmarks

Benchmark	Score	Source
Terminal-Bench	32.6%	huggingface.co 2025-12-09T00:00:00.000Z
SWE-bench Verified	72.2%	huggingface.co 2025-12-09T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker7/10

“At launch it was our open-weight agentic coder; now it's the budget option under Medium 3.5 — cheaper output keeps it on the roster.”

In December 2025 Devstral 2 was strategically significant: a credible open-weight agentic coder with a first-party CLI. By mid-2026 Mistral's own Medium 3.5 beat it on the headline benchmark and took over the Vibe CLI default. Devstral 2 stays relevant because it is materially cheaper on output ($0.90 vs $7.50), which matters for high-volume batch coding. For new architecture decisions I would default to Medium 3.5 unless output cost forces otherwise. The modified-MIT license on the 125B carries the same enterprise carve-out as Medium 3.5; for clean licensing I'd reach for Devstral Small 2.

Strategic Fit 7Vendor Risk 7Roadmap Confidence 6

Pros

cheap output, open weights, first-party CLI

Cons

superseded
modified-MIT carve-out
update cadence unclear

Right for: budget agentic coding

Avoid if: you want the current best or a clean license at 125B

Domain Strategist7/10

“Mistral's open agentic-coder play — now repositioned as the value tier beneath its own merged flagship.”

Devstral 2 established Mistral's open agentic-coding credibility and seeded the Vibe CLI ecosystem. Strategically it has been repositioned from "frontier coder" to "value coder" beneath Medium 3.5, which is a coherent ladder but blunts its standalone narrative. Against Qwen 3 Coder and DeepSeek Coder it competes on tool-use quality and the Vibe CLI; against closed coders it competes on price and openness. The durable asset is the open-weight + EU-residency + cheap-output combination for budget agentic workloads. The 24B Apache-2.0 sibling broadens the strategic surface to laptops and unrestricted fine-tuning.

Competitive Positioning 7Differentiation 7Market Timing 6

Pros

open agentic-coder slot, Vibe ecosystem, cheap output

Cons

outshone by Medium 3.5
carve-out license

Right for: value-tier coding products

Avoid if: you want the flagship story

Finance Lead8.5/10

“$0.90 output vs Medium 3.5's $7.50 for a model within ~5pp on SWE-bench — for high-volume batch coding, that delta is the whole decision.”

The financial case is sharp. Devstral 2's $0.40/$0.90 is 3.75x cheaper on output than Medium 3.5 for a model within ~5pp on SWE-bench. For high-volume batch agentic workloads — automated refactoring, test generation across thousands of files — the cost delta dominates and Devstral 2 is the rational choice when peak quality isn't required. Devstral Small 2 self-hosted on a laptop has effectively zero marginal cost. The caveat is the modified-MIT carve-out at the 125B tier for large enterprises; the clean-license, fixed-cost route is Devstral Small 2. Excellent unit economics for budget-sensitive coding.

Cost Efficiency 9Pricing Transparency 8Value per Dollar 9

Pros

3.75x cheaper output than Medium 3.5
near-zero-cost 24B self-host

Cons

125B license carve-out

Right for: high-volume batch coding

Avoid if: enterprise scale where the license fee applies

Domain Practitioner7.5/10

“The SWE-bench numbers translate to real productivity via Vibe CLI — multi-file edits and sensible PR scaffolding — and Devstral Small 2 on a laptop is genuinely useful.”

Through Q1 2026 Devstral 2 was my agentic-coding workhorse via Vibe CLI: 72.2% SWE-bench translates to real multi-file edits, sensible PR scaffolding, and decent test generation, and 256K context handles realistic repos. Tool-calling (git/terminal/Python) is reliable. After Medium 3.5 I migrated most new work, but Devstral 2 still runs my batch refactoring jobs because output tokens are far cheaper. Devstral Small 2 (Apache 2.0) on a laptop is a genuinely useful local agent. The lack of vision is the main felt gap when I want to paste a stack-trace screenshot.

API Ergonomics 8Tool/Agent Support 8Reliability 7

Pros

real productivity, reliable tools, laptop-viable 24B

Cons

no vision
superseded on quality

Right for: agentic coding on a budget

Avoid if: you need vision or peak SWE-bench

Power User7/10

“I see Devstral 2's output — PRs, refactors, generated tests — more than the model itself, and the results are useful more often than not.”

End users meet Devstral 2 through the artifacts its agents produce: PRs, refactors, generated tests via Vibe CLI. Indirectly the experience is good — useful PRs more often than not, with reasonable latency for agent loops. There is no chat or vision dimension to evaluate; it is a coding engine, not a conversational partner. The absence of vision is a felt limitation when a stack-trace screenshot would help. For developers living in an agentic CLI, a dependable engine; for anyone expecting a chatbot, the wrong tool.

Output Quality 7Speed 7Everyday Usefulness 7

Pros

useful agent output, reasonable loop latency

Cons

no vision
not conversational

Right for: CLI-driven coding

Avoid if: you want chat or image input

Skeptic7/10

“A solid coder whose own maker beat it four months later — and the '7x cheaper than Sonnet' line needs the SWE-bench gap printed next to it.”

Devstral 2 is genuinely capable, so the skepticism is about positioning and license. The "7x more cost-efficient than Claude Sonnet" claim is true on price but elides the ~5-7pp SWE-bench gap — it's cheaper and slightly weaker, not cheaper and equal. The "open weights" framing again carries an asterisk: the 125B is modified-MIT with a large-revenue carve-out, only the 24B is clean Apache 2.0. And Mistral itself superseded it with Medium 3.5 within four months, migrating the CLI default away. The honest claim is "good budget open-weight agentic coder with a license caveat at the large size." Buy the 24B for clean self-host, the 125B for cheap output if you read the license.

Claim Accuracy 7Weakness Severity 6Hype vs Reality 7

Pros

real coding, real price advantage

Cons

cost claim elides quality gap
125B license asterisk
quickly superseded

Right for: budget coders who check the license

Avoid if: you assumed parity-with-Sonnet or clean Apache 2.0 at 125B

Strengths

Frontier-adjacent agentic coding at $0.40/$0.90 — 3.75x cheaper on output than Medium 3.5.
The ~7x cost-efficiency-vs-Sonnet claim holds for many real coding workloads.
256K context fits real repos; strong tool-calling and multi-file editing.
Open-weight (modified-MIT) at 125B is rare for an agentic coder.
Devstral Small 2 (Apache 2.0, 24B) gives a clean, laptop-viable self-host baseline.

Limitations

Surpassed by Medium 3.5 on SWE-bench Verified (77.6% vs 72.2%); Mistral migrated Vibe CLI's default to Medium 3.5.
125B license is modified-MIT, NOT Apache 2.0 — large-revenue enterprises need a commercial deal (the clean-license option is Devstral Small 2).
No vision — no screenshot-to-code or stack-trace-image workflows.
Trails Claude Sonnet 4.5 on the hardest multi-step engineering benchmarks.
Strategic uncertainty: with Medium 3.5 absorbing the role, the cadence of future Devstral updates is unclear.

Best use cases

Coding agents and CLIs (especially Vibe CLI) where output cost matters.
Self-hosted in-product code agents where the modified-MIT terms are acceptable (or Devstral Small 2 for clean Apache 2.0).
Long-context repo refactoring.
Budget-constrained agentic coding where Medium 3.5's $7.50 output is too high.
Devstral Small 2 specifically: local laptop-based agent loops.

Deep dive

The full research notes behind this review — verified against primary sources.

Architecture Capabilities Benchmark analysis Speed & latency Pricing analysis Deployment & access Safety & privacy Ecosystem & tooling

Architecture

Devstral 2 is a 125B-parameter dense transformer (the HF card states 125B; informally "123B"), purpose-built and fine-tuned for software-engineering agent loops. It is text-only with no vision. Context is 256K. The companion Devstral Small 2 is a 24B dense model under Apache 2.0, designed for local/laptop agentic work. Devstral 2 ships in FP8 with BF16/GGUF variants. Tokenizer is mistral_common. Layer count, attention type, and training scale are undisclosed. The model emphasises tool-calling (git, terminal, Python execution) and multi-file editing over reasoning depth.

Capabilities

Devstral 2 is built for agentic software engineering: multi-file edits, repo planning, PR generation, test running (cap_agentic 8.5, cap_coding 8.0, cap_function_calling 8.5). SWE-bench Verified 72.2% makes it competitive with closed-weight leaders at a fraction of the price, and SWE-bench Multilingual 61.3% shows broad cross-language coding. It pairs naturally with the Mistral Vibe CLI, a terminal-native agent that reads repos, plans edits, and proposes changes. 256K context fits full small-repo loading (cap_long_context 8.0). It is text-only (cap_vision 0.0) and not tuned for chat or creative work (cap_creative_writing 4.0); reasoning is solid but not its headline (cap_reasoning 6.5). As of mid-2026 it is partially superseded by Medium 3.5 but remains a strong dedicated coding agent at a lower output price. No native real-time retrieval (cap_realtime_data 0.0).

Benchmark analysis

Benchmark	Score	vs Predecessor	vs Top Competitor	Source
SWE-bench Verified	72.2%	substantial gain vs Devstral 1	trails Claude Sonnet 4.5 (~77-79%) and Medium 3.5 (77.6%) by ~5pp	HF card
SWE-bench Multilingual	61.3%	new	strong cross-language coding	HF card
Terminal-Bench 2	32.6%	new	mid-tier agentic terminal	HF card
Devstral Small 2 SWE-bench Verified	~68%	n/a	best in 24B class	HF card

Devstral 2 published its core coding benchmarks (SWE-bench Verified/Multilingual, Terminal-Bench 2), so coverage is good for the metrics that matter to a coding agent. General reasoning/math benchmarks are not the focus and are null.

Speed & latency

Mistral has not published official tps/TTFT for Devstral 2 (null). As a 125B dense agentic coder it sits in the medium latency tier — latency-per-step is reasonable for agent loops, and the model is typically run in a multi-step harness (Vibe CLI) where total task time, not single-token latency, is what matters. Devstral Small 2 (24B) is faster and laptop-viable for local loops.

Pricing analysis

Surface	Cost	Notes
API input	$0.40 / 1M tok	La Plateforme
API output	$0.90 / 1M tok	La Plateforme (corrected from a prior $2.00 figure)
Cached input	$0.04 / 1M tok	cache read
Batch (in/out)	$0.20 / $0.45	~50% async discount
Devstral Small 2 (in/out)	~$0.07 / $0.28	smaller, cheaper sibling
Self-host (125B)	modified-MIT	weights on Hugging Face
Self-host (24B Small 2)	Apache 2.0	laptop-viable
Vibe CLI	open source	now defaults to Medium 3.5, configurable to Devstral 2
Cloud	Bedrock, Azure AI Foundry	managed

Deployment & access

Weights are on Hugging Face. LICENSE NOTE: the 125B Devstral 2 is under a modified-MIT license (HF tags it "other") — open for most parties but with a large-revenue commercial carve-out, the same structure as Medium 3.5. The companion Devstral Small 2 (24B) is clean Apache 2.0 with no carve-out, making it the right choice for unrestricted self-host/fine-tune at the small tier. The 125B self-hosts on ~80GB+ VRAM (FP8); Small 2 runs on a laptop. Managed on Bedrock and Azure AI Foundry; La Plateforme EU-hosted by default. Pairs with the open-source Vibe CLI (configurable back to Devstral 2 even though it now defaults to Medium 3.5).

Safety & privacy

Standard Mistral posture: GDPR-native, SOC 2 Type II, ISO 27001/27701, EU AI Act aligned, EU residency by default, 30-day abuse retention, no training on inputs unless opt-in, ZDR available. No built-in moderation. As a coding agent it rarely encounters refusal-sensitive content; practical refusal rate is low.

Ecosystem & tooling

SDKs in Python and TypeScript/JavaScript; integrates with the Mistral Vibe CLI, Cline, OpenHands, and Continue.dev. Open weights (modified-MIT 125B, Apache-2.0 24B) drive a growing self-host community with FP8/BF16/GGUF derivatives. Popularity is growing, strongest among developers building open-weight agentic coding tools and those running local agents on the 24B.

Buyer questions

Is the output price $0.90 or $2.00?

$0.90 — the verified La Plateforme rate is $0.40 input / $0.90 output (an earlier figure of $2.00 was incorrect).

Is it open weights?

The 125B is modified-MIT (open with a large-revenue carve-out, same as Medium 3.5). For clean Apache 2.0, use Devstral Small 2 (24B).

How good is it at coding?

SWE-bench Verified 72.2% and Multilingual 61.3% — competitive agentic coding, ~5pp behind Medium 3.5 and Claude Sonnet 4.5.

Should I use it or Medium 3.5?

Use Devstral 2 when output cost matters (8.3x cheaper output) and peak quality isn't required; use Medium 3.5 for the best SWE-bench and a reasoning knob.

Can it run locally?

The 125B needs ~80GB+ VRAM; Devstral Small 2 (24B, Apache 2.0) runs on a laptop.

Does it do vision?

No — text-only. No screenshot-to-code.

What is Vibe CLI?

An open-source terminal coding agent; it now defaults to Medium 3.5 but is configurable back to Devstral 2.

Comparable models

Medium 3.5: — Mistral

Its successor in the role — +5.4pp SWE-bench but 8.3x the output price ($7.50 vs $0.90), same modified-MIT license; the quality-vs-cost choice inside Mistral.

Claude Sonnet 4.5:

+5-7pp SWE-bench, several times the price, closed weights — the premium agentic coder.

Devstral Small 2 (Mistral):

The clean Apache-2.0 24B sibling for laptop/local agents and unrestricted fine-tuning.

Qwen 3 Coder:

Competitive open-weight coder; weaker tool-use and no first-party CLI.

Sources

Primary references used to verify this review.

Model specs

Input price: $0.40 / Mtok
Output price: $0.90 / Mtok
Cached input: $0.04 / Mtok
Batch (in/out): $0.20 / $0.45
Context window: 256K tokens
Max output: 33K tokens
Knowledge cutoff: 2025-09
Released: 2025-12-08
Modalities: text → text
Output speed: Not profiled
License: Open weights (custom-modified-mit)
Clouds: Bedrock, Azure AI Foundry

Does not train on API inputs by default

Last verified 2026-05-27