DeepSeek V4-Flash

previewLatest Flash

by DeepSeek · DeepSeek V4 family · best for frontier-adjacent quality at the lowest cost in market

ReasoningCodingCost-OptimizedOpen-WeightsLong-Context
8.7
AI Panel Score
Value 9.9/10

DeepSeek V4-Flash is the volume tier of the V4 family — a 284B-parameter Mixture-of-Experts model that activates just 13B parameters per token, inherits V4-Pro's 1M-token context and CSA/HCA sparse attention, and serves it at $0.14 in / $0.28 out per 1M tokens. It shipped as a preview on 2026-04-24 with MIT-licensed open weights, and it is the default model behind the legacy `deepseek-chat` (non-thinking) and `deepseek-reasoner` (thinking) aliases, which retire 2026-07-24. The single sentence a buyer needs: for high-throughput chat, agent inner loops, and batch processing where per-token cost is the binding constraint, V4-Flash is the cheapest frontier-adjacent model in the market by a wide margin. - **Provider:** DeepSeek - **Released:** 2026-04-24 (preview) - **Status:** Preview (no GA announced) - **Context window:** 1,000,000 tokens - **Max output:** 384,000 tokens - **Modalities:** Text in / text out - **Knowledge cutoff:** 2026-02 - **Headline price:** $0.14 in / $0.28 out per 1M tokens

What's new

  • 284B total / 13B active MoE — the workhorse tier of the V4 lineup, built on the same sparse-attention stack as V4-Pro.
  • Real 1M-token context and 384K output budget at a workhorse price point — a combination no rival matches today.
  • Becomes the default behind the legacy `deepseek-chat` and `deepseek-reasoner` aliases (non-thinking and thinking respectively); those aliases retire 2026-07-24, 15:59 UTC.
  • Trails V4-Pro by only ~1.6 SWE-bench points (79.0 vs 80.6) while activating roughly a quarter of the parameters.
  • Open weights on Hugging Face under MIT; ~158B in safetensors (FP4/FP8-mixed) make self-hosting realistic on a modest multi-GPU node.

Benchmarks

BenchmarkScoreSource
Humanity's Last Exam34.8%huggingface.co 2026-04-24T00:00:00.000Z
MMLU-Pro86.2%huggingface.co 2026-04-24T00:00:00.000Z
SimpleQA34.1%huggingface.co 2026-04-24T00:00:00.000Z
GPQA Diamond88.1%huggingface.co 2026-04-24T00:00:00.000Z
LiveCodeBench91.6%huggingface.co 2026-04-24T00:00:00.000Z
MRCR Long Context78.7%huggingface.co 2026-04-24T00:00:00.000Z
SWE-bench Verified79%huggingface.co 2026-04-24T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker8.5/10
Flash makes quality-tiering a routing decision, not a vendor decision — default to it, escalate to Pro only on hard tasks.

V4-Flash is the strategic pick for any platform pushing high token volume through an LLM. At $0.14/$0.28 with a 1M context and SWE-bench within ~1.6 points of the flagship, the unit economics are unlike anything else in the market. Because Pro and Flash share the same sparse-attention stack and API, complexity-based routing is an internal decision rather than a multi-vendor integration. The sovereignty calculus is identical to the rest of the family — the hosted API is a non-starter for most regulated US workloads — but Flash's smaller footprint makes in-boundary self-hosting genuinely affordable, which materially lowers the vendor-risk ceiling for teams with even modest GPU capacity.

Strategic Fit 9Vendor Risk 6.5Roadmap Confidence 7
Pros
  • Market-defining unit economics
  • same-family routing to Pro
  • affordable self-host
Cons
  • PRC residency on the API
  • preview status
  • non-thinking quality gap
Right for: High-volume platforms optimizing cost-per-request
Avoid if: You need a certified managed vendor with data guarantees
Domain Strategist8.5/10
Flash is the pricing benchmark the rest of the industry now has to answer — it reframes 'cheap' for the entire mid-tier.

Positioned as the volume workhorse, V4-Flash is DeepSeek's wedge into the high-throughput mid-tier dominated by GPT-5.x-mini, Gemini Flash, and Qwen Plus. Its differentiation is the same combination as Pro — MoE efficiency, MIT license, structural price advantage — but aimed at the segment where token cost matters most, so the pressure on rivals is sharpest here. The strategic risk is that mid-tier is the most contested band, with Qwen, GLM, and Kimi all shipping cheap capable models; Flash leads on the context-plus-price combination but the moat is thinner than at the frontier. Geopolitical fragility applies as always.

Competitive Positioning 8.5Differentiation 8Market Timing 8.5
Pros
  • Sets the mid-tier price floor
  • 1M context at workhorse price
  • open weights
Cons
  • Crowded segment
  • thinner moat than the frontier
Right for: Teams standardizing a cheap default model
Avoid if: Buyers disqualify Chinese-origin models
Finance Lead9.7/10
This is where the order-of-magnitude wedge hits hardest — a five-figure monthly GPT-5-mini workload lands in the low three figures here.

Flash is the sharpest expression of DeepSeek's cost thesis. At $0.14/$0.28 — and cache hits at $0.0028 input — a high-volume workload that costs five figures monthly on a US mid-tier model lands in the low three figures here for comparable quality on most tasks. Pricing is transparent, the Pro/Flash split makes cost-tiering by request complexity straightforward, and open weights cap exposure to price hikes. The cautions: reasoning-token volume in High/Max mode inflates output bills, so budget on actual mode mix, and the geopolitical contingency line applies. On intelligence-per-dollar at the volume tier, nothing is close.

Cost Efficiency 9.9Pricing Transparency 9Value per Dollar 9.9
Pros
  • Lowest cost in market
  • aggressive cache discount
  • affordable self-host
  • transparent pricing
Cons
  • Reasoning-mode token inflation
  • geopolitical contingency
Right for: Any high-volume, cost-sensitive program
Avoid if: Compliance forces a certified premium vendor
Domain Practitioner8.5/10
Same OpenAI-compatible API as Pro, same exposed traces, 384K output — wire it as the default and route up only when you must.

For a builder, Flash is the obvious default. Same OpenAI-compatible endpoint and Thinking/Non-Thinking toggle as Pro, exposed reasoning content for debugging, and a 384K output budget that lets even long-form code-gen finish in one call — historically a Pro-tier-only luxury. Tool calls, JSON mode, and structured output all work. Open weights at ~158B make local dev and self-hosted prod realistic on modest hardware. The friction: docs feel Chinese-first in places, English examples for niche features are sparse, preview rate limits are tight, and there is no batch API. The right pattern is Flash-as-default with a Pro escalation path for hard tasks.

API Ergonomics 8.5Tool/Agent Support 8Reliability 8
Pros
  • Drop-in OpenAI compatibility
  • exposed CoT
  • big output budget
  • cheap self-host
Cons
  • Chinese-first docs
  • tight preview limits
  • no batch API
Right for: Builders wanting a cheap capable default
Avoid if: You need mature first-party SDKs and managed SLAs
Power User8/10
On everyday tasks you can't tell it from a tier-one model; it only falls behind on the hardest reasoning and coding.

For a heavy daily user, Flash is quick, helpful, and competent — on the bulk of everyday queries it is indistinguishable from a tier-one model, and as a free option via chat.deepseek.com the value is excellent. It lags Claude or GPT-5.x noticeably on the hardest reasoning and specialized coding, but those are the minority of consumer queries, and dialing thinking mode up closes much of the gap at the cost of latency. Content policy follows DeepSeek norms — generally permissive with PRC-aligned guardrails on a narrow set of topics. Tone is competent if slightly neutral.

Output Quality 8Speed 8Everyday Usefulness 8
Pros
  • Fast
  • competent on everyday tasks
  • free in the UI
Cons
  • Falls behind on hardest tasks
  • PRC content guardrails
Right for: Everyday use and high-QPS chat products
Avoid if: Your workload is dominated by the hardest reasoning
Skeptic7.5/10
The 'within 1.6 points of the flagship' line only holds in Max mode — in plain chat mode Flash is a normal cheap model.

The price is real and the open weights are verifiable, so the value story stands. The asterisk is mode: every marquee Flash number (SWE-bench 79.0, LiveCodeBench 91.6, GPQA 88.1) is the High/Max reasoning configuration. In non-thinking mode, LiveCodeBench falls to 55.2 and GPQA to 71.2 — i.e., a 13B-active model behaves like one until you pay the reasoning-token tax. Buyers comparing Flash's "frontier-adjacent" headline against a rival's default mode are not comparing like for like. Add the family-wide governance caveats — trains-on-input, PRC storage, no opt-out — and preview volatility. Excellent value; the headline just needs the mode footnote.

Claim Accuracy 7Weakness Severity 6.5Hype vs Reality 7.5
Pros
  • Verifiable open weights
  • genuine price advantage
  • transparent multi-mode card
Cons
  • Headline scores are Max-only
  • trains-on-input
  • preview volatility
Right for: Buyers who benchmark in their deployment mode
Avoid if: You assume the marquee numbers apply in plain chat mode

Strengths

  • Lowest-cost frontier-adjacent model in the market by a wide margin ($0.14/$0.28).
  • 1M-token context and 384K output at a workhorse price — unique combination.
  • Default behind legacy `deepseek-chat`/`deepseek-reasoner`, so migration is automatic for existing users.
  • MIT open weights with a modest enough footprint (~158B) for affordable in-boundary self-hosting.
  • Cache-hit input at $0.0028/M makes retrieval-heavy loops nearly free on the input side.

Limitations

  • "Frontier-adjacent" only in High/Max thinking mode; non-thinking-mode quality is a clear step down.
  • ~1.6 SWE-bench points behind V4-Pro and further behind on the hardest reasoning — not the top-of-envelope pick.
  • Preview status: APIs and rate limits may shift before GA.
  • Text-only; same China data-residency and trains-on-input exposure as the family.
  • Thinner SDK/docs polish than OpenAI/Anthropic; no batch API.

Best use cases

- **High-volume agentic workflows** — classification, extraction, summarization, routing — where per-token cost is the binding constraint. - **Batch processing of long documents** — RAG ingest, contract review at scale — leveraging the 1M context and cache discount. - **Drop-in replacement for legacy deepseek-chat workloads**, which migrate automatically. - **Cost-sensitive consumer chat products** at scale, with V4-Pro reserved for the hardest requests via routing.

Buyer questions

Why is it so cheap?

A 13B-active MoE is far cheaper to serve than dense models of equivalent quality, and DeepSeek prices aggressively to win the volume tier. Cache hits drop input to $0.0028/M.

Do my existing deepseek-chat calls break?

No — `deepseek-chat` and `deepseek-reasoner` now resolve to V4-Flash automatically. Migrate to the explicit `deepseek-v4-flash` ID before the legacy aliases retire on 2026-07-24.

Is it actually frontier quality?

Close, but only in High/Max thinking mode. Non-thinking mode is a clear step down. Benchmark in the mode you will deploy.

Can I self-host it affordably?

Yes — at ~158B in FP4/FP8-mixed it runs on roughly 2-4 H200-class GPUs (~180GB+ VRAM), far cheaper than the 1.6T Pro. That also solves the data-residency problem.

How does it handle long documents?

Well — 1M context with MRCR-1M 78.7 in Max mode, plus the cache discount makes repeated long-context passes economical.

Is it production-stable?

It is preview. For guaranteed GA stability, V3.2 remains the family fallback.

Comparable models

**GPT-5.4 mini** — broader ecosystem, native multimodality, managed-cloud availability; roughly 3-7x more expensive per token and closed.
**Gemini 2.5 Flash** — comparable price band, weaker on coding/agentic tasks, native multimodal and 1M context.
**Qwen 3 (Plus tier) / GLM-5.1 Flash** — closest China-origin open-weights peers in the volume tier; Flash leads on the context-plus-price combination.

Model specs

Input price
$0.14 / Mtok
Output price
$0.28 / Mtok
Cached input
$0.00 / Mtok
Batch (in/out)
Context window
1M tokens
Max output
384K tokens
Knowledge cutoff
2026-02
Released
2026-04-23
Modalities
text → text
Output speed
Not profiled
License
Open weights (MIT)
Clouds
First-party API

Last verified 2026-05-27