How much cheaper is it, really?

Roughly 10-20x cheaper per token than US frontier flagships for comparable benchmark output, and the 75% discount is now permanent list pricing, not a promo.

Can I avoid sending data to China?

Yes — the weights are MIT-licensed and self-hostable on your own AWS/Azure/GCP/on-prem GPUs, which keeps all data in your boundary. The first-party API, by contrast, stores data on PRC servers.

What hardware do I need to self-host?

A multi-node GPU cluster — the 1.6T MoE at FP4/FP8-mixed needs roughly 900GB+ of VRAM (8x H200 floor, realistically 16+ GPUs). Most teams use a neutral inference provider instead.

Is it really frontier, or just cheap?

Its top-mode coding and reasoning scores are frontier-class, but they are the highest-effort (Max) configuration and some math claims are not yet third-party reproduced. Run your own evals in the mode you will deploy.

No. V4-Pro is text-only despite some secondary coverage calling V4 multimodal.

Is it production-stable?

It is preview, not GA. For guaranteed stability today, V3.2 is the GA fallback in the family.

DeepSeek V4-Pro Review — Benchmarks, Pricing & AI Panel Verdict

Benchmark	Score	Source
Humanity's Last Exam	37.7%	huggingface.co 2026-04-24T00:00:00.000Z
MMLU-Pro	87.5%	huggingface.co 2026-04-24T00:00:00.000Z
SimpleQA	57.9%	huggingface.co 2026-04-24T00:00:00.000Z
HumanEval	76.8%	huggingface.co 2026-04-24T00:00:00.000Z
GPQA Diamond	90.1%	huggingface.co 2026-04-24T00:00:00.000Z
LiveCodeBench	93.5%	huggingface.co 2026-04-24T00:00:00.000Z
MRCR Long Context	83.5%	huggingface.co 2026-04-24T00:00:00.000Z
LMArena Coding Elo	1287	artificialanalysis.ai 2026-05-15T00:00:00.000Z
SWE-bench Verified	80.6%	huggingface.co 2026-04-24T00:00:00.000Z
Artificial Analysis Index	52	artificialanalysis.ai 2026-05-15T00:00:00.000Z

Architecture

V4-Pro is a sparse Mixture-of-Experts model: 1.6T total parameters, ~49B activated per token (Hugging Face model card). The headline engineering is the hybrid attention design — Compressed Sparse Attention compresses KV entries along the sequence dimension and uses a lightweight indexer to select top-k blocks per query, while a Heavily Compressed Attention head applies aggressive compression for cheap long-context prefill. Training used more than 32T tokens; DeepSeek reports FP4 quantization-aware training on MoE expert weights with FP8 elsewhere, and a switch to the Muon optimizer for faster convergence at trillion-parameter scale. Expert count, layer count, tokenizer, and vocab size are not disclosed. The model is text-only — despite some secondary coverage describing V4 as "multimodal," DeepSeek's own announcement and model card describe text input/output only.

Capabilities

V4-Pro routes coding, reasoning, and tool use through one endpoint with a Thinking/Non-Thinking toggle and High/Max effort tiers, justifying high coding (9.0), reasoning (9.0), and math (9.0) scores: SWE-bench Verified 80.6, LiveCodeBench 93.5, GPQA Diamond 90.1, and a Codeforces rating of 3206 in Pro-Max mode (HF card). Agentic capability (8.5) is strong — Artificial Analysis ranks V4-Pro Max near the top of its real-world agentic-work Elo. Long-context (8.5) is genuine: MRCR-1M scores 83.5, and the sparse attention does not collapse on long-haystack retrieval. Multilingual (8.5) is strong in English and Chinese (C-Eval 93.1 base). Vision, OCR, and real-time data are zero — V4-Pro has no image input and no native web retrieval. Creative writing (7.5) and instruction-following (8.5) are competent but a touch more neutral than Claude or GPT-5.x. Function calling (8.0) works through the OpenAI-compatible interface but the SDK polish trails US labs. Safety calibration (6.5) reflects permissive defaults with PRC-aligned guardrails.

Benchmark analysis

Benchmark	Score	vs Predecessor	vs Top Competitor	Source
MMLU-Pro	87.5 (Pro-Max)	up from V3.2 ~85	within 1-2 pts of frontier	HF card
GPQA Diamond	90.1 (Pro-Max)	up from V3.2	parity with GPT-5.x / Opus 4.x tier	HF card
SWE-bench Verified	80.6 (Pro-Max)	+14.6 vs V3.1 (66.0)	within ~0.3 pts of Claude Opus 4.5 (~80.9)	HF card
LiveCodeBench	93.5 (Pro-Max)	up from V3.2 (83.3)	frontier-class	HF card
HLE	37.7 (Pro-Max)	up from V3.2 (30.6)	competitive with frontier reasoning	HF card
SimpleQA-Verified	57.9 (Pro-Max)	n/a	mid-pack	HF card
MRCR 1M	83.5	n/a (new 1M context)	strong long-context	HF card
HumanEval (base)	76.8	n/a	base-model figure	HF card
Artificial Analysis Index	52 (Reasoning Max)	up from R1 path	#1 open-weights tier	AA
LMArena Coding Elo	1287	n/a	top-3 coding	AA

Pro-Max scores are the High-compute reasoning mode; non-thinking-mode scores are materially lower. DeepSeek's published AIME figures are labelled "internal claim only" by aggregators pending third-party reproduction, so AIME is left null here.

Speed & latency

DeepSeek does not publish official tokens/sec for V4-Pro, and preview-tier serving throughput is below steady-state. In practice non-thinking mode is responsive; Max-effort thinking mode burns large reasoning-token budgets and adds noticeable latency on hard problems. Self-hosted throughput depends entirely on the (substantial) GPU footprint. Latency tier: medium.

Pricing analysis

Surface	Cost	Notes
API input (cache miss)	$0.435 / 1M tok	75% cut made permanent 2026-05-22
API input (cache hit)	$0.003625 / 1M tok	~99% discount on repeated context
API output	$0.87 / 1M tok	reasoning tokens billed at output rate
Direct UI	Free	chat.deepseek.com web/app
Open weights	$0	HF download; large GPU footprint to self-host
Rate limits	Preview-tier	undisclosed; expect tightening at GA

Deployment & access

Two paths. First-party API is OpenAI-compatible at api.deepseek.com, hosted on PRC infrastructure — a non-starter for many regulated US buyers but trivial to integrate for everyone else. Second, open weights on Hugging Face under MIT license: fully self-hostable, including inside EU/US data boundaries (AWS/Azure/GCP/on-prem GPU), which eliminates the data-residency problem. Self-hosting 1.6T MoE parameters is non-trivial — even at FP4/FP8-mixed it is a multi-node deployment (8x H200 floor, realistically 16+ GPUs), so most teams without that scale will use a neutral inference provider (OpenRouter, DeepInfra, Novita). No first-party Bedrock/Vertex/Azure managed offering.

Safety & privacy

DeepSeek's privacy policy stores data on servers in the People's Republic of China under Chinese law, and DeepSeek states it may use a portion of user input to construct training data (de-identified). There is no documented API data-opt-out and no SOC2/HIPAA/GDPR/ISO27001 certification on the first-party service. Content moderation follows PRC norms: permissive on technical and everyday topics, aligned refusals on a narrow set of politically sensitive subjects. For enterprise buyers, the honest read is that the hosted API carries real data-sovereignty exposure; the MIT-licensed open weights are the compliance escape hatch because inference can be kept entirely inside the buyer's own boundary.

Ecosystem & tooling

DeepSeek ships an OpenAI-compatible API with Python and TypeScript SDKs; the OpenAI SDK works directly. It integrates with LangChain, LlamaIndex, and the Vercel AI SDK, and is served by OpenRouter, DeepInfra, and Novita. It is wired into coding tools like Cursor, Cline, and Kilo Code. Popularity is growing fast on the back of the cost story and open-weights leadership.

DeepSeek V4-Pro

What's new

Benchmarks

AI Panel Review

Strengths

Limitations

Best use cases

Deep dive

Architecture

Capabilities

Benchmark analysis

Speed & latency

Pricing analysis

Deployment & access

Safety & privacy

Ecosystem & tooling

Buyer questions

Comparable models

Sources

Model specs

Other DeepSeek V4 versions