DeepSeek V4 Benchmark Contamination: The Real Story

DeepSeek V4-Pro shipped with sixteen benchmarks above the fold and 'internal claim only' in the footnote. A practitioner's guide to reading contamination via the LiveBench delta and three perturbation tests.

DeepSeek dropped V4-Pro on April 24, 2026, with a model card that reads like a victory lap: 90.1 on MMLU, 76.8 on HumanEval, 93.5 on LiveCodeBench, a Codeforces rating of 3206, and 81% on SWE-bench. The technical report covered roughly sixteen benchmarks across coding, reasoning, knowledge, long-context, and agentic tasks. Independent reviewers labelled every score "internal claim only" until third-party reproduction lands, and that footnote, buried under the headline numbers, is the actual story.

The gap between V4's vendor-published scores and what third parties measure on contamination-resistant suites is not a rounding error. It is the only number worth looking at. Everything else is marketing.

The benchmark stack V4 chose to win on

Look at the suite DeepSeek picked. MMLU and HumanEval are the two most-cited evaluations in the literature, and both have been demonstrably saturated since 2024. LiveCodeBench is contamination-resistant by design — it pulls freshly-released programming problems on a rolling window — but the harness configuration, problem date range, and pass@k setting are all knobs the vendor controls when reporting a single number.

The asymmetry matters because each benchmark has a different probability of leakage into pre-training. Static benchmarks frozen years ago sit in dozens of public GitHub repositories. Their problems, solutions, and even the unit tests get scraped into web crawls. The model does not need to "see the answer key" — it just needs to have ingested any of the hundreds of derivative tutorials and solution write-ups that pattern-match the canonical questions.

Benchmark	V4-Pro claim	Contamination resistance	Independent reproducibility
MMLU	90.1	Low — frozen 2020 multiple-choice	High, but signal compressed near ceiling
HumanEval	76.8	Low — 164 problems, widely solved online	High, but saturated
LiveCodeBench	93.5	Medium-high — rolling problem window	Depends on declared window and pass@k
SWE-bench	81%	Medium — GitHub issues, some are pre-2024	Verified split partially fixes leakage
Codeforces rating	3206	Variable — depends on contest dates	Live contests are clean, archived problems are not

Notice the column nobody puts in their slide deck: independent reproducibility under controlled conditions. The numbers DeepSeek reports are not lies. They are the result of running a particular harness, with a particular prompt template, on a particular slice of data, on a particular date. Reproduction outside those four parameters is a different experiment.

What contamination actually looks like in the math

Contamination is usually framed as a binary — the model saw the test set or it did not. That framing is wrong, and it leads to the wrong defensive moves. Contamination is a graded property of the joint distribution between the model's training corpus and the evaluation set. A useful operational definition is the n-gram overlap rate between training data and benchmark items, with weighting for solution proximity.

The graded view also explains why "we removed the benchmark from training" is not the assurance it sounds like. Removing the literal benchmark file does nothing about the hundreds of derivative resources — tutorial walkthroughs, course materials, exam prep sites, blog posts explaining the answers — that mirror the same problems with the same intent. The benchmark surface form is rare in the corpus. The benchmark semantic content is everywhere.

contamination_score(B, T) = Σ_i max_j ngram_overlap(b_i, t_j) * solution_proximity(t_j)

where:
  B = benchmark set {b_1, ..., b_n}
  T = training corpus shards
  ngram_overlap(x, y) = |ngrams(x) ∩ ngrams(y)| / |ngrams(x)|
  solution_proximity(t_j) ∈ [0, 1]  // distance to a worked answer in the same shard

This is roughly the shape of what the BIG-bench and HELM teams use for their leakage audits. A score near zero says the benchmark problems do not appear in training data with their solutions nearby. A score above ~0.3 is enough to lift MMLU-style accuracy by 5-10 points without the model "knowing" anything more about the underlying domain. That is the entire mechanism by which a 7B model can occasionally beat a 70B model on a single benchmark and lose to it on everything else.

The harder problem is that DeepSeek, like every frontier lab, treats their training corpus as proprietary. So you cannot run this formula on V4's actual pre-training shards. You can only run it on the public crawl approximations and assume the lab also ingested those. That assumption is conservative — they ingested at least that much. Internal estimates from the contamination-audit literature suggest that public crawl reproductions capture roughly 60-75% of what a frontier lab's actual corpus contains, so the public-side score is a lower bound on the real contamination tax.

A second-order effect compounds the problem. Synthetic data generated by a previous-generation model that itself saw the benchmark gets baked into the next generation's training set. The contamination launders through one model into the next, and by the time it surfaces in benchmark scores, no single training run is identifiably "the one that saw the answers". This is the failure mode the open-source community calls benchmark laundering, and it is the reason published contamination audits keep coming back lower than the empirical drift on contamination-resistant suites would predict.

Why LiveBench is the only number I trust right now

LiveBench releases new questions monthly and draws problems from recently-released datasets, arXiv papers, news articles, and IMDb synopses. The contamination-resistance is not perfect, but it is the only public benchmark where the time-of-question-creation is later than every published frontier model's training cutoff by construction.

On the LiveBench leaderboard as of mid-May 2026, the top of the table moves on a roughly six-week cycle as new question batches land. A model that hit 0.846 on the January question set sometimes drops three to five points on the April set without any model update — that delta is the contamination signal you cannot get from any static benchmark.

What to read off the LiveBench delta

A useful diagnostic: take any model's score on the most recent LiveBench batch and subtract its score on the batch released closest to the model's training cutoff. If the delta is small (within ~2 points), the model genuinely generalises. If it is large, the model was optimised against benchmark distributions that have since drifted out from under it.

Where the diagnostic fails

The diagnostic does not work for models released after their evaluation, which is the entire DeepSeek V4-Pro situation right now. We do not yet have enough monthly batches post-V4 to compute the drift. For now, we have one batch and a marketing deck.

The diagnostic also breaks for models that publish their training cutoff loosely — "early 2026" instead of a specific commit date. DeepSeek's V4 card lists the cutoff as Q1 2026, which gives the lab almost a full quarter of LiveBench question batches it could legitimately claim were post-cutoff while still having seen the question structure during late training. The dating ambiguity is doing real work here, and it is not accidental.

The benchmark gap as evidence, not coincidence

When a model's static-benchmark scores cluster near the ceiling and its contamination-resistant scores cluster fifteen points lower, the parsimonious explanation is not "the contamination-resistant benchmarks are wrong". The parsimonious explanation is that the static benchmarks measure something other than capability, and that something correlates strongly with training-data exposure.

This is the part where DeepSeek's defenders push back. They argue that V4 was trained on roughly the same web corpus as ChatGPT, Gemini, and Qwen, so any contamination is shared across the field and the relative rankings still hold. That argument has two problems.

First, training corpora differ on the long tail in ways that matter precisely for benchmark coverage. Two labs both crawling Common Crawl can end up with very different exposure to, say, the Stack Overflow answers that mirror HumanEval problems, depending on deduplication strategy and quality filtering. A lab that deduplicates aggressively on near-duplicate text removes many of the tutorial reformulations of benchmark problems and ends up cleaner. A lab that deduplicates only on exact hash keeps every paraphrased solution and ends up dirtier. The choice is invisible to outside observers and can shift benchmark scores by single-digit points either way.

Second, post-training data is the bigger lever, and post-training corpora are almost entirely lab-specific. A team that bought or generated a synthetic dataset of "MMLU-style questions" for instruction tuning has shifted the distribution of their model in a way that competitors have not. The post-training step is where a lab can target a benchmark explicitly without anyone outside the lab being able to detect it, because the post-training data is rarely released. When a model's benchmark score jumps disproportionately between two checkpoints with similar pre-training, the post-training corpus is almost always the explanation.

The defence also assumes that the field's consensus on which benchmarks matter is itself unbiased. It is not. Labs collectively optimise against the benchmarks that get cited in tech press, and tech press cites the benchmarks that produce dramatic year-on-year score gains. The feedback loop selects for benchmarks that are easy to game.

What the panel saw when we re-ran V4

The TopReviewed AI panel scored DeepSeek at 7.6 across our six-persona evaluation, against 7.8 for ChatGPT and 8.1 for Gemini. Those numbers are not benchmark-derived — they aggregate the panel's qualitative judgment on architecture, pricing, ergonomics, ecosystem maturity, and observed behaviour on tasks the personas care about. The 0.2-0.5 point gap between DeepSeek and the closed-source frontier is roughly stable across our re-reviews and does not move when DeepSeek ships a new MMLU number.

What does move the panel score is a model demonstrating new behaviour on tasks the panel hand-constructs, which by definition are not in any training corpus. V4's strongest signal in our re-review was long-context coherence past 200k tokens, not raw coding accuracy. The coding accuracy is where the benchmark claims are loudest, and where the panel saw the smallest improvement over V3.2.

Practical tests you can run today

If you are evaluating V4 for a production workload, ignore the model card. Run three tests instead.

Date-filtered LiveCodeBench: Pull the harness, configure it to use only problems released after V4's training cutoff, and report pass@1. Compare against the vendor's reported number on the same date range. Any gap larger than 5 points is the contamination tax.
Held-out task suite: Construct ten task instances that match your production distribution and do not exist on the public web. Measure V4 against your current model on those. This is the only evaluation that maps to your actual ROI.
Prompt-perturbation stability: Take a published benchmark problem, rewrite the surface form (rename variables, restructure the question), and check whether accuracy drops. A model that drops more than 10 points on cosmetic perturbations is pattern-matching memorised solutions.

The third test is the one that separates contaminated wins from real capability. A model that has genuinely learned a skill is robust to surface-form changes. A model that has memorised the test set is not. The Apple research team's work on prompt-perturbation sensitivity in late 2024 found that frontier-model accuracy on math-word problems dropped by 5 to 65 points depending on the perturbation aggressiveness, and the most-publicised models took the largest hits. That same methodology applied to V4 has not been published yet, which is itself a signal.

Cost of running the tests

Running all three on a modest production prompt suite costs roughly 4 to 8 hours of engineering time plus a few hundred dollars of API spend. Compared to the cost of migrating a workload to a new model and discovering the gain was illusory, this is negligible. The reason teams skip it is not budget. It is that the vendor benchmark numbers are easier to put in a procurement memo than a custom test suite, and procurement memos are what actually get signed.

The footnote problem

Every frontier-model release in 2026 has the same structure. Headline scores at the top, contamination caveats in a footnote, third-party reproduction listed as "ongoing". The asymmetry between the prominence of the headline and the prominence of the caveat is the entire game. Buyers read the headline, integrate it into procurement decisions, and ship to production before the footnotes get checked.

The fix is structural, not editorial. Procurement teams should refuse to score models on any benchmark older than the model's training cutoff. The Hugging Face Open LLM Leaderboard partially does this by rotating evaluations, but the rotation cadence is too slow to keep up with frontier release cycles. OpenRouter publishes head-to-head usage data on real workloads, which is closer to the right signal — actual user-elicited prompts that no lab can train against without seeing them.

What to do with V4 this quarter

Run V4-Flash on a throughput workload where the failure mode is latency, not accuracy. The pricing is real and the throughput numbers do not have a contamination story attached. For anything in the reasoning column where DeepSeek claims a 10-point lead over V3.2, hold off until the August LiveBench batch lands and we can compute a clean drift number. Until then, you are buying a marketing deck, not a capability gain.

If you must commit to V4 for the reasoning workload now, run the prompt-perturbation test on your top twenty production prompts before signing the contract. If accuracy holds within five points after rewriting, the capability is probably real. If it drops more than ten, you have just confirmed that the benchmark gap is the whole story.

DeepSeek V4's Benchmark Gap Is the Whole Story, Not a Footnote

Can DeepSeek V4-Pro's benchmark scores be trusted, and how do you test for contamination?

The benchmark stack V4 chose to win on

What contamination actually looks like in the math

Why LiveBench is the only number I trust right now

What to read off the LiveBench delta

Where the diagnostic fails

The benchmark gap as evidence, not coincidence

What the panel saw when we re-ran V4

Practical tests you can run today

Cost of running the tests

The footnote problem

What to do with V4 this quarter

Discussion

Author

Recent Posts

AI Agent Memory Layer Compared: Mem0 vs Letta vs Zep vs LangGraph Memory

Mid-Tier LLM Pricing Is Collapsing: What Gemini 3 Flash's Cuts Mean for Buyers

Amazon Locks Out AI Shopping Agents: What It Means for Agentic Commerce Tools

More from the Blog