DeepSeek V3.2

by DeepSeek · DeepSeek V3 family · best for open-weights math/reasoning at GA stability

ReasoningCost-OptimizedOpen-Weights

8.2

AI Panel Score

Value 9.3/10

DeepSeek V3.2 is the family's last 128K-context generation before the V4 long-context jump, and the most polished open-weights model in the V3 line — a 671B-parameter Mixture-of-Experts model (37B active) that introduced DeepSeek Sparse Attention (DSA) and posts frontier-class competition-math scores. The V3.2-Exp preview shipped 2025-09-29; the GA line (including the math-tuned V3.2-Speciale) landed 2025-12-01. The single sentence a buyer needs: when you need GA stability rather than V4 preview risk, strong math/reasoning, and rock-bottom token cost inside a 128K window, V3.2 is still the production-default DeepSeek pick in mid-2026.

Compare this model All DeepSeek V3 versions

What's new

Introduces DeepSeek Sparse Attention (DSA) — a fine-grained sparse-attention mechanism built atop the V3 MLA latent-attention design that cuts training and long-context inference cost while preserving quality.
AIME 2025 climbs to 93.1% (from V3.1's ~88%), HMMT 2025 to 92.5%, and HLE to 30.6.
V3.2-Speciale, a high-compute reasoning/math variant (API-only at launch, weights later), achieved gold-medal-level results across IMO 2025, IOI 2025, and ICPC World Finals 2025.
API pricing cut to under 3 cents per 1M input tokens on cache hits at the Exp launch — roughly half of V3.1's effective rates.
Output ceiling raised to 64K (from V3.1's 8K), removing most long-form chunking pain.

Benchmarks

Benchmark	Score	Source
Humanity's Last Exam	30.6%	api-docs.deepseek.com 2025-12-01T00:00:00.000Z
MMLU-Pro	85%	openrouter.ai 2025-12-01T00:00:00.000Z
AIME 2025	93.1%	api-docs.deepseek.com 2025-12-01T00:00:00.000Z
LiveCodeBench	74.1%	macaron.im 2026-04-24T00:00:00.000Z
SWE-bench Verified	67.8%	macaron.im 2026-04-24T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker8/10

“When I need a stable, board-defensible DeepSeek today, it's V3.2 — GA, open weights, and proven in production.”

V3.2 was the production-stable bet for cost-conscious teams from late 2025 through Q1 2026, and even with V4 in preview it remains the rational GA choice for anyone who prioritizes guaranteed stability over preview-tier risk. DSA made long-context inference economical on V3-class hardware, and the math/reasoning gains made it competitive with US frontier on a meaningful slice of work. The sovereignty calculus is unchanged — fine for non-regulated workloads, off the table for many enterprise buyers without a neutral inference partner — but the mature open-weights status and broad third-party hosting make an in-boundary deployment the most turnkey in the DeepSeek lineup.

Strategic Fit 8Vendor Risk 7Roadmap Confidence 8.5

Pros

GA stability
proven open weights
broad hosting

Cons

128K ceiling
coding trails frontier
PRC API residency

Right for: Teams needing a stable, cheap, math-strong default

Avoid if: You need 1M context or top-tier coding

Domain Strategist8/10

“V3.2 is the model that crystallized the DeepSeek price thesis — the sub-3-cent input cut put the whole industry on notice.”

Strategically, V3.2 is the consolidation release: it took V3.1's hybrid design, added DSA efficiency and gold-medal math, and cut price hard enough that the under-3-cents-per-million input headline became a category event. Its positioning is "the stable, cheap, reasoning-strong open-weights default," and through early 2026 it largely owned that slot against Qwen and Llama. The competitive risk now is internal cannibalization — V4-Flash undercuts it on context and price — so its strategic role is shifting from frontier wedge to dependable GA backbone. The math/Speciale angle remains a genuine differentiator in STEM-heavy markets.

Competitive Positioning 8Differentiation 8Market Timing 7.5

Pros

Defined the price thesis
standout math
stable open weights

Cons

Out-positioned by V4 on context/price
aging window

Right for: STEM and reasoning-heavy open-weights bets

Avoid if: You need the latest context/cost frontier

Finance Lead9/10

“Sub-3-cent cache-hit input was the move that made the unit economics undeniable — V3.2 sits ~10x below the US frontier.”

V3.2's price drop crystallized the DeepSeek thesis going into 2026. At $0.252 input / $0.378 output, it sits roughly 10x below the US frontier for comparable answer quality on most non-coding tasks, and cache-hit input at $0.025/M is transformative for RAG-style repeated-context workloads. As a GA model the pricing is stable and predictable — no preview volatility — and the open-weights option caps provider-price exposure. The unhedged line item remains geopolitical, but the mature third-party hosting ecosystem makes a compliant deployment cheap to stand up. On dollars-per-quality for math and reasoning, it is among the best values available.

Cost Efficiency 9.2Pricing Transparency 9Value per Dollar 9.2

Pros

~10x below frontier
transformative cache pricing
GA stability
open-weights price cap

Cons

Reasoning-token volume
geopolitical contingency

Right for: Cost-sensitive math/RAG programs

Avoid if: Compliance forces a certified premium vendor

Domain Practitioner8.5/10

“The mature one in the lineup — tool calls, JSON mode, exposed traces all behave, and the 64K output finally fits long generations.”

For a builder, V3.2 is the most settled DeepSeek model. Tool calling, JSON mode, and structured output all behave well; exposed reasoning content via the reasoner path is easy to debug. Open weights make self-hosted realistic at 671B/37B, and the OpenAI-compatible endpoint makes migration trivial. The 64K output ceiling (up from V3.1's 8K) removes most long-form chunking friction. Documentation is solid if still Chinese-first in spots. There is no batch API. For teams that standardized on V3.2 in Q4 2025, there is no urgent reason to jump to V4 preview unless context window or coding ceiling forces the issue.

API Ergonomics 8.5Tool/Agent Support 8.5Reliability 9

Pros

Mature, stable behavior
good tool use
bigger output
broad hosting

Cons

Chinese-first docs
no batch API
128K window

Right for: Builders wanting a settled, cheap default

Avoid if: You need 1M context or first-party SDK depth

Power User7.5/10

“On math and STEM it genuinely impresses; for everyday chat it's a competent, free, slightly-formal workhorse.”

End users on a V3.2-backed product won't notice a gap versus free ChatGPT or Claude on everyday tasks, and on math and STEM the model is actually a strong choice — the reasoning mode solves hard problems that free Western tiers struggle with. Latency is competitive in non-thinking mode and adds a few seconds for reasoning queries. Refusal rate is lower than Western models on most topics, with PRC-aligned guardrails on a narrow set. Helpfulness is high; tone is competent if slightly more formal than Claude. As a free option via the DeepSeek UI, the everyday value is strong.

Output Quality 7.5Speed 7.5Everyday Usefulness 7.5

Pros

Strong on math/STEM
permissive
free in UI

Cons

Slightly formal tone
PRC guardrails

Right for: STEM-leaning everyday use

Avoid if: You want the sharpest creative voice

Skeptic7.5/10

“The math scores are real and impressive — but 'Speciale gold medals' is a separate high-compute variant, not the model you call by default.”

V3.2's reasoning and math gains are well-documented and the price is verifiable, so the core story holds. The honest caveats: the gold-medal headlines belong to V3.2-Speciale, a high-compute API-only variant, not the base V3.2 most users hit — conflating them overstates the default experience. Coding is a real weak spot (SWE-bench ~67.8), well behind the frontier. The 128K window is now a generation old. And the family-wide governance issues — PRC storage, trains-on-input, no opt-out — apply. None of this undermines V3.2 as an excellent-value, stable, math-strong model; it just means matching the variant and benchmark to the actual deployment.

Claim Accuracy 7.5Weakness Severity 6.5Hype vs Reality 8

Pros

Verifiable math gains
stable open weights
transparent reports

Cons

Speciale ≠ base V3.2
weak coding
aging window
trains-on-input

Right for: Buyers who match variant to workload

Avoid if: You read Speciale scores as the default model

Strengths

Frontier-class competition math and reasoning at open-weights (AIME 93.1; Speciale at gold-medal level).
DSA sparse attention delivers genuine long-context inference cost reductions on V3-class hardware.
Mature, GA-stable API with a strong tool-use story — the production-default of early-to-mid 2026.
MIT open weights with broad inference-provider support and community quantizations.
64K output ceiling removes most long-form chunking friction.

Limitations

128K context is a generation behind V4's 1M window.
Coding (SWE-bench ~67.8) trails frontier coding agents by 10+ points.
Superseded by V4-Pro/Flash on context, output, and SWE-bench.
Same China data-residency and trains-on-input exposure as the family.
Not natively multimodal; no real-time retrieval.

Best use cases

Math, competitive programming, and STEM-heavy reasoning agents where the Speciale-line capability shines.
Cost-sensitive RAG and document-analysis pipelines that fit inside 128K, leveraging DSA economics.
GA-stable production workloads that cannot take on V4 preview risk.
Self-hosted open-weights deployments where V4's 1.6T footprint is unaffordable but 671B/37B is feasible.

Deep dive

The full research notes behind this review — verified against primary sources.

Architecture Capabilities Benchmark analysis Speed & latency Pricing analysis Deployment & access Safety & privacy Ecosystem & tooling

Architecture

V3.2 is a sparse Mixture-of-Experts model: 671B total parameters, ~37B activated per token, carried over from the V3 base architecture (DeepSeekMoE with 256 routed experts and Multi-head Latent Attention). The headline change is DeepSeek Sparse Attention (DSA), a fine-grained sparse-attention mechanism layered on MLA that selects a sparse subset of key-value positions per query, making long-context training and inference materially cheaper on V3-class hardware without quality loss. The hybrid Thinking/Non-Thinking design is inherited from V3.1. Open weights are on Hugging Face under MIT; training-token count for V3.2 specifically is not separately disclosed. Text-only.

Capabilities

V3.2's standout dimension is math (9.5): AIME 2025 93.1%, HMMT 92.5%, and the Speciale variant reaching gold-medal-level competition performance — genuinely frontier-class. Reasoning (8.5) is strong and HLE 30.6 is competitive with frontier reasoning models. Coding (7.5) is solid but trails frontier agents — SWE-bench Verified ~67.8 (third-party survey) sits ~13 points behind Claude Opus-tier. Long-context (7.0) is good within 128K and DSA keeps it cheap, but the window is now a generation behind V4's 1M. Multilingual (8.5) is strong in English and Chinese. Agentic (7.5), function-calling (8.0), and instruction-following (8.0) are mature — this is the version most production deployments stabilized on. Vision, OCR, and real-time data are zero. Creative writing (7.0) is competent but neutral. Safety calibration (6.5) follows the family pattern.

Benchmark analysis

Benchmark	Score	vs Predecessor	vs Top Competitor	Source
AIME 2025	93.1%	+5 pts vs V3.1 (~88)	parity with GPT-5 / Opus tier	DeepSeek
HLE	30.6	n/a	competitive with frontier reasoning	DeepSeek
MMLU-Pro	85.0	up from V3.1 (~83)	within ~2 pts of frontier	OpenRouter
LiveCodeBench	74.1	up from V3.1 (~80 older harness)	trails V4 (93.5)	Macaron survey
SWE-bench Verified	67.8	+~2 pts vs V3.1 (66.0)	trails Claude Opus tier (~80)	Macaron survey

V3.2-Speciale (math-focused variant) reports AIME ~96 and HMMT ~99 per DeepSeek; treated as a separate variant, not the base V3.2 line. GPQA Diamond is left null (no clean primary-source figure for the GA model).

Speed & latency

DeepSeek does not publish official tokens/sec. In practice, non-thinking mode is responsive and DSA keeps long-context passes economical; thinking mode adds a few seconds on reasoning queries. As a GA model it has higher, steadier rate limits than the V4 preview. Latency tier: medium.

Pricing analysis

Surface	Cost	Notes
API input (cache miss)	$0.252 / 1M tok	matches OpenRouter list
API input (cache hit)	$0.025 / 1M tok	~90% discount (reduced to 1/10 of launch price 2026-04-26)
API output	$0.378 / 1M tok
Direct UI	Free	chat.deepseek.com web/app
Open weights	$0	HF download; 8x H200-class node to self-host
Rate limits	Standard (GA) tier	higher than V4 preview

Deployment & access

First-party OpenAI-compatible API at api.deepseek.com (PRC-hosted). Open weights on Hugging Face under MIT license: self-hostable at ~671B/37B MoE, realistically an 8x H200-class node at FP8 (~400GB+ VRAM), with INT4 and GGUF community quants available for smaller setups. Widely served by neutral inference providers — OpenRouter, DeepInfra, Novita, Fireworks, Together — which is the standard path for teams wanting managed access without the PRC API. No first-party managed-cloud offering.

Safety & privacy

Same posture as the family: PRC data storage under Chinese law, trains-on-input by default (de-identified), no documented API opt-out, and no SOC2/HIPAA/GDPR/ISO27001 on the first-party service. Content moderation follows PRC norms. As a mature open-weights model with broad third-party hosting, V3.2 is the easiest DeepSeek model to deploy in a compliant, in-boundary configuration via a Western inference provider or self-host.

Ecosystem & tooling

OpenAI-compatible API with Python/TypeScript SDKs, LangChain / LlamaIndex / Vercel AI SDK integrations, and broad serving across OpenRouter, DeepInfra, Novita, Fireworks, Together, and SiliconFlow. Wired into coding tools (Cline, Kilo Code). As the GA workhorse of the V3 line, it has mainstream adoption across the open-weights ecosystem.

Buyer questions

Why pick V3.2 over V4 today?

GA stability. V4 is preview with shifting rate limits; V3.2 is production-proven with steady limits and broad third-party hosting. Choose V3.2 when reliability outweighs the 1M context and higher SWE-bench of V4.

What is DSA and why does it matter?

DeepSeek Sparse Attention selects a sparse subset of key-value positions per query, cutting long-context training and inference cost on V3-class hardware without quality loss — it is why V3.2 stays cheap on long inputs.

Is the math claim real?

Yes — AIME 2025 93.1% is strong, and the Speciale variant reached gold-medal level on IMO/IOI/ICPC. Note Speciale is a separate high-compute variant.

Can I self-host it?

Yes — 671B/37B MoE under MIT, realistically an 8x H200-class node at FP8, with INT4/GGUF community quants for smaller rigs.

How cheap is long-context RAG?

Very — $0.252/M input with cache hits at $0.025/M, plus DSA efficiency, make repeated 128K passes economical.

Is it multimodal?

No — text-only. For document/OCR pair it with DeepSeek-VL2 or a dedicated VL model.

Comparable models

GPT-5 mini

broader ecosystem, native multimodal; ~3-5x more expensive, weaker open-weights story.

Claude Sonnet 4.6

stronger coding and tool ergonomics, US procurement story; materially more expensive and closed.

Qwen 3 235B

closest China-origin open-weights peer at GA stability; comparable cost, V3.2 leads on competition math.

Sources

Primary references used to verify this review.

Model specs

Input price: $0.25 / Mtok
Output price: $0.38 / Mtok
Cached input: $0.03 / Mtok
Batch (in/out): —
Context window: 128K tokens
Max output: 64K tokens
Knowledge cutoff: 2025-07
Released: 2025-11-30
Modalities: text → text
Output speed: Not profiled
License: Open weights (MIT)
Clouds: First-party API

Other DeepSeek V3 versions

Last verified 2026-05-27