by DeepSeek · DeepSeek V4 family · best for frontier-adjacent quality at the lowest cost in market
DeepSeek V4-Flash is the volume tier of the V4 family — a 284B-parameter Mixture-of-Experts model that activates just 13B parameters per token, inherits V4-Pro's 1M-token context and CSA/HCA sparse attention, and serves it at $0.14 in / $0.28 out per 1M tokens. It shipped as a preview on 2026-04-24 with MIT-licensed open weights, and it is the default model behind the legacy `deepseek-chat` (non-thinking) and `deepseek-reasoner` (thinking) aliases, which retire 2026-07-24. The single sentence a buyer needs: for high-throughput chat, agent inner loops, and batch processing where per-token cost is the binding constraint, V4-Flash is the cheapest frontier-adjacent model in the market by a wide margin. - **Provider:** DeepSeek - **Released:** 2026-04-24 (preview) - **Status:** Preview (no GA announced) - **Context window:** 1,000,000 tokens - **Max output:** 384,000 tokens - **Modalities:** Text in / text out - **Knowledge cutoff:** 2026-02 - **Headline price:** $0.14 in / $0.28 out per 1M tokens
| Benchmark | Score | Source |
|---|---|---|
| Humanity's Last Exam | 34.8% | huggingface.co 2026-04-24T00:00:00.000Z |
| MMLU-Pro | 86.2% | huggingface.co 2026-04-24T00:00:00.000Z |
| SimpleQA | 34.1% | huggingface.co 2026-04-24T00:00:00.000Z |
| GPQA Diamond | 88.1% | huggingface.co 2026-04-24T00:00:00.000Z |
| LiveCodeBench | 91.6% | huggingface.co 2026-04-24T00:00:00.000Z |
| MRCR Long Context | 78.7% | huggingface.co 2026-04-24T00:00:00.000Z |
| SWE-bench Verified | 79% | huggingface.co 2026-04-24T00:00:00.000Z |
Six personas, six verdicts — the same panel that reviews every product on TopReviewed.
“Flash makes quality-tiering a routing decision, not a vendor decision — default to it, escalate to Pro only on hard tasks.”
V4-Flash is the strategic pick for any platform pushing high token volume through an LLM. At $0.14/$0.28 with a 1M context and SWE-bench within ~1.6 points of the flagship, the unit economics are unlike anything else in the market. Because Pro and Flash share the same sparse-attention stack and API, complexity-based routing is an internal decision rather than a multi-vendor integration. The sovereignty calculus is identical to the rest of the family — the hosted API is a non-starter for most regulated US workloads — but Flash's smaller footprint makes in-boundary self-hosting genuinely affordable, which materially lowers the vendor-risk ceiling for teams with even modest GPU capacity.
“Flash is the pricing benchmark the rest of the industry now has to answer — it reframes 'cheap' for the entire mid-tier.”
Positioned as the volume workhorse, V4-Flash is DeepSeek's wedge into the high-throughput mid-tier dominated by GPT-5.x-mini, Gemini Flash, and Qwen Plus. Its differentiation is the same combination as Pro — MoE efficiency, MIT license, structural price advantage — but aimed at the segment where token cost matters most, so the pressure on rivals is sharpest here. The strategic risk is that mid-tier is the most contested band, with Qwen, GLM, and Kimi all shipping cheap capable models; Flash leads on the context-plus-price combination but the moat is thinner than at the frontier. Geopolitical fragility applies as always.
“This is where the order-of-magnitude wedge hits hardest — a five-figure monthly GPT-5-mini workload lands in the low three figures here.”
Flash is the sharpest expression of DeepSeek's cost thesis. At $0.14/$0.28 — and cache hits at $0.0028 input — a high-volume workload that costs five figures monthly on a US mid-tier model lands in the low three figures here for comparable quality on most tasks. Pricing is transparent, the Pro/Flash split makes cost-tiering by request complexity straightforward, and open weights cap exposure to price hikes. The cautions: reasoning-token volume in High/Max mode inflates output bills, so budget on actual mode mix, and the geopolitical contingency line applies. On intelligence-per-dollar at the volume tier, nothing is close.
“Same OpenAI-compatible API as Pro, same exposed traces, 384K output — wire it as the default and route up only when you must.”
For a builder, Flash is the obvious default. Same OpenAI-compatible endpoint and Thinking/Non-Thinking toggle as Pro, exposed reasoning content for debugging, and a 384K output budget that lets even long-form code-gen finish in one call — historically a Pro-tier-only luxury. Tool calls, JSON mode, and structured output all work. Open weights at ~158B make local dev and self-hosted prod realistic on modest hardware. The friction: docs feel Chinese-first in places, English examples for niche features are sparse, preview rate limits are tight, and there is no batch API. The right pattern is Flash-as-default with a Pro escalation path for hard tasks.
“On everyday tasks you can't tell it from a tier-one model; it only falls behind on the hardest reasoning and coding.”
For a heavy daily user, Flash is quick, helpful, and competent — on the bulk of everyday queries it is indistinguishable from a tier-one model, and as a free option via chat.deepseek.com the value is excellent. It lags Claude or GPT-5.x noticeably on the hardest reasoning and specialized coding, but those are the minority of consumer queries, and dialing thinking mode up closes much of the gap at the cost of latency. Content policy follows DeepSeek norms — generally permissive with PRC-aligned guardrails on a narrow set of topics. Tone is competent if slightly neutral.
“The 'within 1.6 points of the flagship' line only holds in Max mode — in plain chat mode Flash is a normal cheap model.”
The price is real and the open weights are verifiable, so the value story stands. The asterisk is mode: every marquee Flash number (SWE-bench 79.0, LiveCodeBench 91.6, GPQA 88.1) is the High/Max reasoning configuration. In non-thinking mode, LiveCodeBench falls to 55.2 and GPQA to 71.2 — i.e., a 13B-active model behaves like one until you pay the reasoning-token tax. Buyers comparing Flash's "frontier-adjacent" headline against a rival's default mode are not comparing like for like. Add the family-wide governance caveats — trains-on-input, PRC storage, no opt-out — and preview volatility. Excellent value; the headline just needs the mode footnote.
- **High-volume agentic workflows** — classification, extraction, summarization, routing — where per-token cost is the binding constraint. - **Batch processing of long documents** — RAG ingest, contract review at scale — leveraging the 1M context and cache discount. - **Drop-in replacement for legacy deepseek-chat workloads**, which migrate automatically. - **Cost-sensitive consumer chat products** at scale, with V4-Pro reserved for the hardest requests via routing.
A 13B-active MoE is far cheaper to serve than dense models of equivalent quality, and DeepSeek prices aggressively to win the volume tier. Cache hits drop input to $0.0028/M.
No — `deepseek-chat` and `deepseek-reasoner` now resolve to V4-Flash automatically. Migrate to the explicit `deepseek-v4-flash` ID before the legacy aliases retire on 2026-07-24.
Close, but only in High/Max thinking mode. Non-thinking mode is a clear step down. Benchmark in the mode you will deploy.
Yes — at ~158B in FP4/FP8-mixed it runs on roughly 2-4 H200-class GPUs (~180GB+ VRAM), far cheaper than the 1.6T Pro. That also solves the data-residency problem.
Well — 1M context with MRCR-1M 78.7 in Max mode, plus the cache discount makes repeated long-context passes economical.
It is preview. For guaranteed GA stability, V3.2 remains the family fallback.
Last verified 2026-05-27