by DeepSeek · DeepSeek V4 family · best for frontier-grade agentic coding at open-weights cost
DeepSeek V4-Pro is the flagship of DeepSeek's April 2026 V4 family — a 1.6-trillion-parameter Mixture-of-Experts model that activates only 49B parameters per token, pairs a 1M-token context with a 384K output ceiling, and lands frontier-class coding and reasoning scores at roughly a tenth of US-frontier token prices. It shipped as a preview on 2026-04-24, with open weights on Hugging Face under an MIT license. The single sentence a buyer needs: if your workload is heavy coding or reasoning and your threat model tolerates a Chinese-origin provider (or you can self-host the weights), V4-Pro is the most cost-effective frontier-grade model available today. - **Provider:** DeepSeek (Hangzhou DeepSeek Artificial Intelligence Co.) - **Released:** 2026-04-24 (preview) - **Status:** Preview (no GA announced) - **Context window:** 1,000,000 tokens - **Max output:** 384,000 tokens - **Modalities:** Text in / text out - **Knowledge cutoff:** 2026-02 - **Headline price:** $0.435 in / $0.87 out per 1M tokens (75% cut now permanent)
| Benchmark | Score | Source |
|---|---|---|
| Humanity's Last Exam | 37.7% | huggingface.co 2026-04-24T00:00:00.000Z |
| MMLU-Pro | 87.5% | huggingface.co 2026-04-24T00:00:00.000Z |
| SimpleQA | 57.9% | huggingface.co 2026-04-24T00:00:00.000Z |
| HumanEval | 76.8% | huggingface.co 2026-04-24T00:00:00.000Z |
| GPQA Diamond | 90.1% | huggingface.co 2026-04-24T00:00:00.000Z |
| LiveCodeBench | 93.5% | huggingface.co 2026-04-24T00:00:00.000Z |
| MRCR Long Context | 83.5% | huggingface.co 2026-04-24T00:00:00.000Z |
| LMArena Coding Elo | 1287 | artificialanalysis.ai 2026-05-15T00:00:00.000Z |
| SWE-bench Verified | 80.6% | huggingface.co 2026-04-24T00:00:00.000Z |
| Artificial Analysis Index | 52 | artificialanalysis.ai 2026-05-15T00:00:00.000Z |
Six personas, six verdicts — the same panel that reviews every product on TopReviewed.
“Frontier coding at a tenth of the price rewrites my build-vs-buy math — the only real question is sovereignty, not capability.”
V4-Pro is the most disruptive procurement event of 2026 so far. SWE-bench in the low 80s at roughly an order of magnitude below Claude Opus or GPT-5.x reorders the economics of any heavy reasoning or coding program. The binding constraint is not the model, it is vendor risk: routing US enterprise traffic through a PRC-hosted API is a non-starter for many buyers, and the trains-on-input default compounds that. The mitigating fact is the MIT open weights, which let a buyer keep inference entirely in-boundary — at the cost of a serious multi-node GPU footprint. Preview status adds roadmap uncertainty. For non-regulated teams or those with self-host capacity, this is the rational frontier pick on dollars-per-quality-point.
“DeepSeek owns the open-weights frontier — V4-Pro is the model every Western lab now has to answer on price-per-point.”
Positioned as the open-weights SOTA, V4-Pro is the spearhead of DeepSeek's "frontier quality at commodity price" strategy. Its moat is not a single benchmark but the combination: MoE efficiency, a permissive MIT license, and pricing that the closed US frontier structurally cannot match without margin destruction. That forces a wedge into the market — buyers who would never have considered a Chinese model now evaluate it because the value gap is too large to ignore. The competitive risk is reputational and geopolitical rather than technical; export-control or procurement-policy shifts could reset the board overnight. Against Qwen, GLM, and Kimi, V4-Pro currently leads the open-weights intelligence rankings.
“This is an order-of-magnitude cost wedge — V4-Pro delivers frontier output at roughly a tenth the token price, and cache hits drop input to a rounding error.”
This is where V4-Pro embarrasses the US frontier. At $0.435/M input and $0.87/M output — and with the 75% cut now permanent rather than promotional — V4-Pro lands roughly 10-20x cheaper per token than Claude Opus 4.5 or GPT-5.x for comparable benchmark output. Cache-hit input at $0.003625/M makes repeated-context agent loops essentially free on the input side. Bills are predictable and the open-weights option caps exposure to provider price hikes entirely. The one non-unit-economics line item is geopolitical: budget a contingency for a forced provider switch if export-control or procurement rules change. On pure intelligence-per-dollar, nothing at this capability tier comes close.
“OpenAI-compatible endpoint, exposed reasoning traces, and a 384K output budget — migration is hours, and long code-gen finally finishes in one call.”
For a hands-on builder, V4-Pro is pleasant. The API is OpenAI-compatible, so swapping in is a base-URL and key change. The Thinking/Non-Thinking toggle plus High/Max effort tiers keep routing simple, and exposed reasoning content is invaluable for debugging agent loops. Tool calling, JSON mode, and structured output all work; the 384K output ceiling finally lets long-form code generation complete without chunking. The rough edges: SDK ergonomics and English-language docs trail OpenAI/Anthropic, error messages can be terse, and preview rate limits are tighter than steady-state. Open weights make local experimentation cheap if you have the hardware. No batch API yet.
“On hard problems it matches the big closed models; the wait in Max mode and the PRC content guardrails are the only tells.”
For a heavy daily user, V4-Pro is essentially indistinguishable from GPT-5.x or Claude on the bulk of real tasks, and noticeably better than free tiers on long-document and coding work. Output quality is high; the model is more willing to engage with technical content than Claude and refuses less on everyday topics. The trade-offs are latency in Max-effort thinking mode (several seconds to longer on hard queries) and PRC-aligned guardrails that return aligned responses on a narrow set of politically sensitive subjects, which users outside China may find limiting. Tone is competent but a shade more neutral and formal than Claude's. As a free option via chat.deepseek.com, the everyday value is hard to beat.
“The price is real and historic — but 'frontier' rests on self-reported, single-mode Max scores that nobody has independently reproduced yet.”
The cost story is genuine and the open weights are verifiable, so the headline is not hype. The caveats are specific. First, the marquee numbers (SWE-bench 80.6, LiveCodeBench 93.5, GPQA 90.1) are Pro-Max — the highest-effort reasoning configuration — and non-thinking-mode performance is materially lower; comparisons should match modes. Second, DeepSeek's AIME claims are flagged "internal only" by aggregators pending reproduction, so treat unreproduced figures as provisional. Third, the hosted API trains on input by default and stores data in the PRC, which is a real, not theoretical, governance exposure. Fourth, "1M context" benchmarks well on MRCR but real-world degradation at extreme lengths is under-tested. The model is excellent value; the asterisks are about mode-matching, reproduction, and governance — not capability.
- **Long-context coding agents** working across entire repositories where 80%+ SWE-bench at a tenth of the cost reorders build-vs-buy math. - **High-volume reasoning pipelines** where per-token economics dominate ROI and the workload is not residency-restricted. - **Self-hosted frontier deployments** for teams with GPU capacity that want frontier quality with zero vendor lock-in. - **Research and document analysis** over 500K-1M token windows.
Roughly 10-20x cheaper per token than US frontier flagships for comparable benchmark output, and the 75% discount is now permanent list pricing, not a promo.
Yes — the weights are MIT-licensed and self-hostable on your own AWS/Azure/GCP/on-prem GPUs, which keeps all data in your boundary. The first-party API, by contrast, stores data on PRC servers.
A multi-node GPU cluster — the 1.6T MoE at FP4/FP8-mixed needs roughly 900GB+ of VRAM (8x H200 floor, realistically 16+ GPUs). Most teams use a neutral inference provider instead.
Its top-mode coding and reasoning scores are frontier-class, but they are the highest-effort (Max) configuration and some math claims are not yet third-party reproduced. Run your own evals in the mode you will deploy.
No. V4-Pro is text-only despite some secondary coverage calling V4 multimodal.
It is preview, not GA. For guaranteed stability today, V3.2 is the GA fallback in the family.
Last verified 2026-05-27