DeepSeek V4-Flash

Q: Do my existing deepseek-chat calls break?

No — deepseek-chat and deepseek-reasoner now resolve to V4-Flash automatically. Migrate to the explicit deepseek-v4-flash ID before the legacy aliases retire on 2026-07-24.

previewLatest Flash

by DeepSeek · DeepSeek V4 family · best for frontier-adjacent quality at the lowest cost in market

ReasoningCodingCost-OptimizedOpen-WeightsLong-Context

8.7

AI Panel Score

Value 9.9/10

DeepSeek V4-Flash is the volume tier of the V4 family — a 284B-parameter Mixture-of-Experts model that activates just 13B parameters per token, inherits V4-Pro's 1M-token context and CSA/HCA sparse attention, and serves it at $0.14 in / $0.28 out per 1M tokens. It shipped as a preview on 2026-04-24 with MIT-licensed open weights, and it is the default model behind the legacy deepseek-chat (non-thinking) and deepseek-reasoner (thinking) aliases, which retire 2026-07-24. The single sentence a buyer needs: for high-throughput chat, agent inner loops, and batch processing where per-token cost is the binding constraint, V4-Flash is the cheapest frontier-adjacent model in the market by a wide margin.

Compare this model All DeepSeek V4 versions

What's new

284B total / 13B active MoE — the workhorse tier of the V4 lineup, built on the same sparse-attention stack as V4-Pro.
Real 1M-token context and 384K output budget at a workhorse price point — a combination no rival matches today.
Becomes the default behind the legacy deepseek-chat and deepseek-reasoner aliases (non-thinking and thinking respectively); those aliases retire 2026-07-24, 15:59 UTC.
Trails V4-Pro by only ~1.6 SWE-bench points (79.0 vs 80.6) while activating roughly a quarter of the parameters.
Open weights on Hugging Face under MIT; ~158B in safetensors (FP4/FP8-mixed) make self-hosting realistic on a modest multi-GPU node.

Benchmarks

Benchmark	Score	Source
Humanity's Last Exam	34.8%	huggingface.co 2026-04-24T00:00:00.000Z
MMLU-Pro	86.2%	huggingface.co 2026-04-24T00:00:00.000Z
SimpleQA	34.1%	huggingface.co 2026-04-24T00:00:00.000Z
GPQA Diamond	88.1%	huggingface.co 2026-04-24T00:00:00.000Z
LiveCodeBench	91.6%	huggingface.co 2026-04-24T00:00:00.000Z
MRCR Long Context	78.7%	huggingface.co 2026-04-24T00:00:00.000Z
SWE-bench Verified	79%	huggingface.co 2026-04-24T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker8.5/10

“Flash makes quality-tiering a routing decision, not a vendor decision — default to it, escalate to Pro only on hard tasks.”

V4-Flash is the strategic pick for any platform pushing high token volume through an LLM. At $0.14/$0.28 with a 1M context and SWE-bench within ~1.6 points of the flagship, the unit economics are unlike anything else in the market. Because Pro and Flash share the same sparse-attention stack and API, complexity-based routing is an internal decision rather than a multi-vendor integration. The sovereignty calculus is identical to the rest of the family — the hosted API is a non-starter for most regulated US workloads — but Flash's smaller footprint makes in-boundary self-hosting genuinely affordable, which materially lowers the vendor-risk ceiling for teams with even modest GPU capacity.

Strategic Fit 9Vendor Risk 6.5Roadmap Confidence 7

Pros

Market-defining unit economics
same-family routing to Pro
affordable self-host

Cons

PRC residency on the API
preview status
non-thinking quality gap

Right for: High-volume platforms optimizing cost-per-request

Avoid if: You need a certified managed vendor with data guarantees

Domain Strategist8.5/10

“Flash is the pricing benchmark the rest of the industry now has to answer — it reframes 'cheap' for the entire mid-tier.”

Positioned as the volume workhorse, V4-Flash is DeepSeek's wedge into the high-throughput mid-tier dominated by GPT-5.x-mini, Gemini Flash, and Qwen Plus. Its differentiation is the same combination as Pro — MoE efficiency, MIT license, structural price advantage — but aimed at the segment where token cost matters most, so the pressure on rivals is sharpest here. The strategic risk is that mid-tier is the most contested band, with Qwen, GLM, and Kimi all shipping cheap capable models; Flash leads on the context-plus-price combination but the moat is thinner than at the frontier. Geopolitical fragility applies as always.

Competitive Positioning 8.5Differentiation 8Market Timing 8.5

Pros

Sets the mid-tier price floor
1M context at workhorse price
open weights

Cons

Crowded segment
thinner moat than the frontier

Right for: Teams standardizing a cheap default model

Avoid if: Buyers disqualify Chinese-origin models

Finance Lead9.7/10

“This is where the order-of-magnitude wedge hits hardest — a five-figure monthly GPT-5-mini workload lands in the low three figures here.”

Flash is the sharpest expression of DeepSeek's cost thesis. At $0.14/$0.28 — and cache hits at $0.0028 input — a high-volume workload that costs five figures monthly on a US mid-tier model lands in the low three figures here for comparable quality on most tasks. Pricing is transparent, the Pro/Flash split makes cost-tiering by request complexity straightforward, and open weights cap exposure to price hikes. The cautions: reasoning-token volume in High/Max mode inflates output bills, so budget on actual mode mix, and the geopolitical contingency line applies. On intelligence-per-dollar at the volume tier, nothing is close.

Cost Efficiency 9.9Pricing Transparency 9Value per Dollar 9.9

Pros

Lowest cost in market
aggressive cache discount
affordable self-host
transparent pricing

Cons

Reasoning-mode token inflation
geopolitical contingency

Right for: Any high-volume, cost-sensitive program

Avoid if: Compliance forces a certified premium vendor

Domain Practitioner8.5/10

“Same OpenAI-compatible API as Pro, same exposed traces, 384K output — wire it as the default and route up only when you must.”

For a builder, Flash is the obvious default. Same OpenAI-compatible endpoint and Thinking/Non-Thinking toggle as Pro, exposed reasoning content for debugging, and a 384K output budget that lets even long-form code-gen finish in one call — historically a Pro-tier-only luxury. Tool calls, JSON mode, and structured output all work. Open weights at ~158B make local dev and self-hosted prod realistic on modest hardware. The friction: docs feel Chinese-first in places, English examples for niche features are sparse, preview rate limits are tight, and there is no batch API. The right pattern is Flash-as-default with a Pro escalation path for hard tasks.

API Ergonomics 8.5Tool/Agent Support 8Reliability 8

Pros

Drop-in OpenAI compatibility
exposed CoT
big output budget
cheap self-host

Cons

Chinese-first docs
tight preview limits
no batch API

Right for: Builders wanting a cheap capable default

Avoid if: You need mature first-party SDKs and managed SLAs

Power User8/10

“On everyday tasks you can't tell it from a tier-one model; it only falls behind on the hardest reasoning and coding.”

For a heavy daily user, Flash is quick, helpful, and competent — on the bulk of everyday queries it is indistinguishable from a tier-one model, and as a free option via chat.deepseek.com the value is excellent. It lags Claude or GPT-5.x noticeably on the hardest reasoning and specialized coding, but those are the minority of consumer queries, and dialing thinking mode up closes much of the gap at the cost of latency. Content policy follows DeepSeek norms — generally permissive with PRC-aligned guardrails on a narrow set of topics. Tone is competent if slightly neutral.

Output Quality 8Speed 8Everyday Usefulness 8

Pros

Fast
competent on everyday tasks
free in the UI

Cons

Falls behind on hardest tasks
PRC content guardrails

Right for: Everyday use and high-QPS chat products

Avoid if: Your workload is dominated by the hardest reasoning

Skeptic7.5/10

“The 'within 1.6 points of the flagship' line only holds in Max mode — in plain chat mode Flash is a normal cheap model.”

The price is real and the open weights are verifiable, so the value story stands. The asterisk is mode: every marquee Flash number (SWE-bench 79.0, LiveCodeBench 91.6, GPQA 88.1) is the High/Max reasoning configuration. In non-thinking mode, LiveCodeBench falls to 55.2 and GPQA to 71.2 — i.e., a 13B-active model behaves like one until you pay the reasoning-token tax. Buyers comparing Flash's "frontier-adjacent" headline against a rival's default mode are not comparing like for like. Add the family-wide governance caveats — trains-on-input, PRC storage, no opt-out — and preview volatility. Excellent value; the headline just needs the mode footnote.

Claim Accuracy 7Weakness Severity 6.5Hype vs Reality 7.5

Pros

Verifiable open weights
genuine price advantage
transparent multi-mode card

Cons

Headline scores are Max-only
trains-on-input
preview volatility

Right for: Buyers who benchmark in their deployment mode

Avoid if: You assume the marquee numbers apply in plain chat mode

Strengths

Lowest-cost frontier-adjacent model in the market by a wide margin ($0.14/$0.28).
1M-token context and 384K output at a workhorse price — unique combination.
Default behind legacy deepseek-chat/deepseek-reasoner, so migration is automatic for existing users.
MIT open weights with a modest enough footprint (~158B) for affordable in-boundary self-hosting.
Cache-hit input at $0.0028/M makes retrieval-heavy loops nearly free on the input side.

Limitations

"Frontier-adjacent" only in High/Max thinking mode; non-thinking-mode quality is a clear step down.
~1.6 SWE-bench points behind V4-Pro and further behind on the hardest reasoning — not the top-of-envelope pick.
Preview status: APIs and rate limits may shift before GA.
Text-only; same China data-residency and trains-on-input exposure as the family.
Thinner SDK/docs polish than OpenAI/Anthropic; no batch API.

Best use cases

High-volume agentic workflows — classification, extraction, summarization, routing — where per-token cost is the binding constraint.
Batch processing of long documents — RAG ingest, contract review at scale — leveraging the 1M context and cache discount.
Drop-in replacement for legacy deepseek-chat workloads, which migrate automatically.
Cost-sensitive consumer chat products at scale, with V4-Pro reserved for the hardest requests via routing.

Deep dive

The full research notes behind this review — verified against primary sources.

Architecture Capabilities Benchmark analysis Speed & latency Pricing analysis Deployment & access Safety & privacy Ecosystem & tooling

Architecture

V4-Flash is a sparse Mixture-of-Experts model: 284B total parameters, ~13B activated per token (Hugging Face model card). It uses the same hybrid attention architecture as V4-Pro — Compressed Sparse Attention plus a Heavily Compressed Attention head — to deliver 1M-token context economically. Training used more than 32T tokens with FP4 quantization-aware training on MoE expert weights and FP8 elsewhere. Expert count, layer count, tokenizer, and vocab size are undisclosed. The packaged model is ~158B in safetensors format under mixed FP4/FP8 precision. It is text-only.

Capabilities

V4-Flash punches far above its 13B active-parameter weight, justifying coding (8.5), reasoning (8.5), and math (8.5) scores that sit just under V4-Pro: SWE-bench Verified 79.0, LiveCodeBench 91.6, GPQA Diamond 88.1, MMLU-Pro 86.2 in Flash-Max mode (HF card). Critically, those are the High/Max thinking-mode scores; non-thinking-mode figures are much lower (e.g., LiveCodeBench drops to 55.2, GPQA to 71.2), so the model is "frontier-adjacent" only when reasoning effort is dialed up. Agentic (8.0) and long-context (8.0) are strong — MRCR-1M 78.7. Multilingual (8.0) is solid in English and Chinese. Vision, OCR, and real-time data are zero. Creative writing (7.0) errs neutral and works better as a draft engine than a final-copy engine. Function calling (8.0) and instruction-following (8.0) are competent via the OpenAI-compatible interface. Safety calibration (6.5) mirrors the family: permissive defaults, PRC-aligned guardrails on a narrow topic set.

Benchmark analysis

Benchmark	Score	vs Predecessor	vs Top Competitor	Source
MMLU-Pro	86.2 (Flash-Max)	up from V3.2 ~85	within ~1 pt of V4-Pro	HF card
GPQA Diamond	88.1 (Flash-Max)	up from V3.x	~2 pts behind V4-Pro	HF card
SWE-bench Verified	79.0 (Flash-Max)	new tier	~1.6 pts behind V4-Pro (80.6)	HF card
LiveCodeBench	91.6 (Flash-Max)	new tier	~2 pts behind V4-Pro (93.5)	HF card
HLE	34.8 (Flash-Max)	up from V3.2 (30.6)	~3 pts behind V4-Pro	HF card
SimpleQA-Verified	34.1 (Flash-Max)	n/a	trails V4-Pro (57.9)	HF card
MRCR 1M	78.7 (Flash-Max)	new 1M context	~5 pts behind V4-Pro	HF card
Codeforces (rating)	3052 (Flash-Max)	n/a	near V4-Pro (3206)	HF card

All scores are Flash-Max (High-effort reasoning); non-thinking-mode scores are materially lower (LiveCodeBench 55.2, GPQA 71.2, MMLU-Pro 83.0). AIME is left null — DeepSeek's AIME figures are flagged provisional pending third-party reproduction.

Speed & latency

With only 13B active parameters, V4-Flash is the fast tier of the family: non-thinking responses are snappy and well-suited to interactive chat and high-QPS agent loops. Thinking mode (High/Max) adds latency proportional to the reasoning-token budget. DeepSeek does not publish official tokens/sec, and preview-tier serving throughput is below steady-state. Latency tier: fast.

Pricing analysis

Surface	Cost	Notes
API input (cache miss)	$0.14 / 1M tok
API input (cache hit)	$0.0028 / 1M tok	~98% discount on repeated context
API output	$0.28 / 1M tok	reasoning tokens billed at output rate
Direct UI	Free	chat.deepseek.com web/app
Open weights	$0	HF download (~158B safetensors); modest multi-GPU node to self-host
Rate limits	Preview-tier	undisclosed; expect tightening at GA

Deployment & access

First-party OpenAI-compatible API at api.deepseek.com, hosted on PRC infrastructure. The legacy deepseek-chat/deepseek-reasoner aliases now resolve to V4-Flash, so existing integrations migrate automatically (those aliases retire 2026-07-24). Open weights on Hugging Face under MIT license: at ~158B in FP4/FP8-mixed, self-hosting is realistic on a modest multi-GPU node (roughly 2-4 H200-class GPUs, ~180GB+ VRAM), which makes Flash a far easier self-host than the 1.6T Pro. Neutral inference providers (OpenRouter, DeepInfra, Novita) host it for teams that want managed access without the PRC API. No first-party Bedrock/Vertex/Azure offering.

Safety & privacy

Identical posture to the rest of the family: data stored on PRC servers under Chinese law, trains-on-input by default (de-identified per DeepSeek's policy), no documented API opt-out, and no SOC2/HIPAA/GDPR/ISO27001 certification on the first-party service. Content moderation follows PRC norms. For regulated buyers the MIT open weights are the compliance path — Flash is especially attractive here because its smaller footprint makes in-boundary self-hosting genuinely affordable.

Ecosystem & tooling

OpenAI-compatible API with Python and TypeScript SDKs (the OpenAI SDK works directly), LangChain / LlamaIndex / Vercel AI SDK integrations, and serving via OpenRouter, DeepInfra, and Novita. Wired into coding tools like Cline and Kilo Code. As the default behind the legacy aliases, it inherits the existing DeepSeek developer base; popularity is growing fast.

Buyer questions

Why is it so cheap?

A 13B-active MoE is far cheaper to serve than dense models of equivalent quality, and DeepSeek prices aggressively to win the volume tier. Cache hits drop input to $0.0028/M.

Do my existing deepseek-chat calls break?

No — deepseek-chat and deepseek-reasoner now resolve to V4-Flash automatically. Migrate to the explicit deepseek-v4-flash ID before the legacy aliases retire on 2026-07-24.

Is it actually frontier quality?

Close, but only in High/Max thinking mode. Non-thinking mode is a clear step down. Benchmark in the mode you will deploy.

Can I self-host it affordably?

Yes — at ~158B in FP4/FP8-mixed it runs on roughly 2-4 H200-class GPUs (~180GB+ VRAM), far cheaper than the 1.6T Pro. That also solves the data-residency problem.

How does it handle long documents?

Well — 1M context with MRCR-1M 78.7 in Max mode, plus the cache discount makes repeated long-context passes economical.

Is it production-stable?

It is preview. For guaranteed GA stability, V3.2 remains the family fallback.

Comparable models

GPT-5.4 mini

broader ecosystem, native multimodality, managed-cloud availability; roughly 3-7x more expensive per token and closed.

Gemini 2.5 Flash

comparable price band, weaker on coding/agentic tasks, native multimodal and 1M context.

Qwen 3 (Plus tier) / GLM-5.1 Flash

closest China-origin open-weights peers in the volume tier; Flash leads on the context-plus-price combination.

Sources

Primary references used to verify this review.

Model specs

Input price: $0.14 / Mtok
Output price: $0.28 / Mtok
Cached input: $0.00 / Mtok
Batch (in/out): —
Context window: 1M tokens
Max output: 384K tokens
Knowledge cutoff: 2026-02
Released: 2026-04-23
Modalities: text → text
Output speed: Not profiled
License: Open weights (MIT)
Clouds: First-party API

Other DeepSeek V4 versions

Last verified 2026-05-27