DeepSeek R1 (0528)

GALatest Reasoning

by DeepSeek · DeepSeek R1 family · best for exposed-CoT reasoning at a fraction of o-series cost

ReasoningOpen-WeightsCost-Optimized
8.0
AI Panel Score
Value 9.2/10

DeepSeek R1 is the category-defining open-weights reasoning model — the release that broke the "frontier reasoning is expensive" assumption in early 2025 and forced US labs to respond on price. It is reasoning-first: every response includes a full, visible chain-of-thought before the final answer, which is a genuine differentiator for interpretability, distillation, and audit pipelines. The R1-0528 refresh (2025-05-28) added function calling and JSON output and pushed AIME 2025 to 87.5% and GPQA Diamond to 81.0. Built on the V3 671B/37B MoE backbone with RL post-training, open-weights under MIT. The single sentence a buyer needs: when reasoning is the whole job and you want the chain-of-thought exposed, R1 delivers frontier-class math and science at a fraction of o-series cost. - **Provider:** DeepSeek - **Released:** 2025-01-20 (R1); 2025-05-28 (R1-0528 upgrade) - **Status:** GA - **Context window:** 128,000 tokens - **Max output:** 64,000 tokens - **Modalities:** Text in / text out (with full exposed chain-of-thought) - **Knowledge cutoff:** 2025-04 - **Headline price:** $0.55 in / $2.19 out per 1M tokens

What's new

  • R1-0528 (May 2025) added function calling and JSON output — both missing from the January launch — making R1 deployable in agent loops rather than chat-only.
  • AIME 2025 jumped from 70% (initial R1) to 87.5% (Pass@1); GPQA Diamond reached 81.0; LiveCodeBench rose to 73.3 and Codeforces-Div1 from 1530 to 1930.
  • Reduced hallucination rate and improved chain-of-thought legibility; the model no longer needs a forced "<think>" prefix.
  • Reasoning depth increased — R1-0528 averages ~23K reasoning tokens per AIME question (up from ~12K), trading latency and output cost for accuracy.
  • One official distilled model released with the 0528 update — DeepSeek-R1-0528-Qwen3-8B — for single-GPU reasoning (the original January R1 shipped a fuller 1.5B-70B distill family).

Benchmarks

BenchmarkScoreSource
Humanity's Last Exam17.7%huggingface.co 2025-05-28T00:00:00.000Z
MMLU93.4%huggingface.co 2025-05-28T00:00:00.000Z
MMLU-Pro85%huggingface.co 2025-05-28T00:00:00.000Z
SimpleQA27.8%huggingface.co 2025-05-28T00:00:00.000Z
AIME 202587.5%huggingface.co 2025-05-28T00:00:00.000Z
TAU-bench63.9%huggingface.co 2025-05-28T00:00:00.000Z
LMArena Elo1382artificialanalysis.ai 2025-05-28T00:00:00.000Z
GPQA Diamond81%huggingface.co 2025-05-28T00:00:00.000Z
LiveCodeBench73.3%huggingface.co 2025-05-28T00:00:00.000Z
Aider Polyglot71.6%huggingface.co 2025-05-28T00:00:00.000Z
SWE-bench Verified57.6%huggingface.co 2025-05-28T00:00:00.000Z
Artificial Analysis Index68artificialanalysis.ai 2025-05-28T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker8/10
R1 broke the frontier-reasoning price ceiling and gave me an exposed-CoT artifact I can audit — but hybrid models have absorbed most general use.

R1 was the model that broke the "frontier reasoning is expensive" assumption in early 2025 and forced US labs to respond on price. Strategically, the exposed chain-of-thought is more than a quality story — it lets enterprises build pipelines that consume the reasoning separately from the answer, valuable for audit and distillation. By mid-2026 the hybrid-mode models (V3.1/V3.2/V4) have largely absorbed R1's general-agent use case; R1 remains the right pick when reasoning is the whole job and you want the trace exposed. Sovereignty considerations are identical to the family, and the NIST CAISI safety flag is worth weighing for sensitive deployments.

Strategic Fit 8Vendor Risk 6.5Roadmap Confidence 8
Pros
  • Category-defining value
  • exposed CoT for audit/distillation
  • open weights
Cons
  • Narrowed by hybrid models
  • CAISI safety flag
  • PRC residency
Right for: Reasoning-as-the-product, audit-driven pipelines
Avoid if: You need a generalist or a certified-safe vendor
Domain Strategist8.5/10
R1 is the single most disruptive reasoning release of the cycle — it reset market expectations for what frontier reasoning should cost.

R1's strategic significance is hard to overstate: it was the open-weights reasoning model that, at roughly 1/30th of contemporaneous o-series pricing, reset the entire market's expectations and triggered a global re-rating of DeepSeek as a lab. Artificial Analysis tied DeepSeek as the #2 lab and undisputed open-weights leader on the back of the 0528 update. Its differentiation — full exposed CoT plus frontier math at commodity price — created a distinct category position no closed o-series model matched. The strategic decay by mid-2026 is that DeepSeek's own hybrid models cannibalize the generalist use case, narrowing R1 to the reasoning-specialist and CoT-artifact niche.

Competitive Positioning 8.5Differentiation 9Market Timing 9
Pros
  • Category-defining
  • exposed-CoT moat
  • reset pricing norms
Cons
  • Cannibalized by hybrids
  • niche by mid-2026
Right for: Reasoning-specialist positioning
Avoid if: You need a current generalist leader
Finance Lead9/10
R1's launch landed at ~1/30th of o-series pricing for comparable AIME/GPQA — the most disruptive pricing event in the reasoning category in years.

R1's launch pricing ($0.55 in / $2.19 out) landed at roughly 1/30th of the contemporaneous OpenAI o1 rate for comparable AIME/GPQA scores — the single most disruptive pricing event in the reasoning-model category. Cache-hit input at $0.14/M makes repeated-reasoning workloads (tutoring, agent loops) shockingly cheap. The critical line-item caution is reasoning-token volume: R1 burns output tokens at much higher rates than chat models because the chain-of-thought counts, and 0528 nearly doubled per-question reasoning to ~23K tokens — budget on roughly 3-5x the output volume of a comparable non-reasoning workload. Even so, the intelligence-per-dollar on hard reasoning is exceptional.

Cost Efficiency 9.2Pricing Transparency 9Value per Dollar 9.2
Pros
  • ~1/30th of o-series at launch
  • cheap cache hits
  • open-weights price cap
Cons
  • High reasoning-token output cost
  • budget carefully on volume
Right for: Cost-sensitive hard-reasoning workloads
Avoid if: Output-token volume is your dominant cost and answers are simple
Domain Practitioner8.5/10
The exposed reasoning trace is a developer delight — log it, parse it, display it — and 0528 finally made function calling first-class.

For a builder, the 0528 refresh fixed the practical complaints: function calling and JSON mode are first-class. The exposed reasoning trace is a genuine delight — you can log it, store it, parse it, or stream it to a "thinking" UI. Open weights and the distilled Qwen3-8B make local development and edge deployment realistic, and the OpenAI-compatible endpoint keeps integration simple. The main friction is latency — reasoning queries take real time, so UX must be designed around streaming the thought process. Tool-call reliability is good but not Claude-grade (Tau-Bench Retail 63.9). No parallel tool calls, no batch API.

API Ergonomics 8.5Tool/Agent Support 7.5Reliability 8.5
Pros
  • Exposed CoT
  • first-class function calling (0528)
  • open weights + distill
Cons
  • Latency-driven UX
  • no parallel tools
  • no batch API
Right for: Builders who want to consume the reasoning artifact
Avoid if: You need fast, parallelized tool agents
Power User8/10
On hard problems R1's visible thinking is genuinely impressive and builds trust — but it's overkill, and slow, for everyday questions.

For end users on free or low-cost chat, R1's reasoning mode is genuinely impressive on hard problems — it solves AIME problems and GPQA-style science questions that free GPT/Claude tiers struggle with, and the visible thinking process is novel and increases trust in the answer. The downsides are latency (10-30 seconds on hard queries) and that R1 is overkill for everyday questions, where its reasoning-first style is slower and stiffer than a generalist. Best deployed as a "deep think" toggle rather than the default. Content policy follows DeepSeek norms. As a free option via the DeepSeek UI's DeepThink toggle, the value on hard problems is excellent.

Output Quality 8Speed 6.5Everyday Usefulness 7.5
Pros
  • Impressive on hard problems
  • visible reasoning builds trust
  • free in UI
Cons
  • Slow
  • overkill for everyday queries
  • stiff prose
Right for: A deep-think toggle on hard questions
Avoid if: You want a fast everyday default
Skeptic7.5/10
Frontier-class reasoning at commodity price is real — but it's narrow, slow, burns output tokens, and NIST flagged it on safety.

R1's reasoning scores are well-documented on its own model card and independently tracked by Artificial Analysis, so the headline holds — this is not benchmark theater. The honest caveats are about scope and cost shape. R1 is a specialist: coding (SWE-bench 57.6) and general agent tool use are middling, and creative/instruction work is weak. The chain-of-thought that is its differentiator is also a cost multiplier — ~23K reasoning tokens per hard question means output bills can dwarf a chat model's. And the NIST CAISI evaluation flagged DeepSeek models, R1 included, as more susceptible on certain safety/hijacking evals than US frontier peers — a real consideration for security-sensitive use. The value is genuine; the asterisks are scope, latency, token economics, and safety posture.

Claim Accuracy 8Weakness Severity 6.5Hype vs Reality 8
Pros
  • Independently verified reasoning
  • open weights
  • transparent CoT
Cons
  • Narrow specialist
  • high output-token cost
  • CAISI safety flag
  • slow
Right for: Buyers who verify scope and budget token volume
Avoid if: You need a fast, broad, security-certified model

Strengths

  • Open, fully exposed chain-of-thought — uniquely useful for debugging, distillation, and audit/analytics pipelines.
  • Frontier-class math and scientific reasoning (AIME 87.5, GPQA 81.0) at a fraction of o-series pricing.
  • MIT open weights plus an official single-GPU distilled variant (Qwen3-8B) for edge reasoning.
  • Function calling and JSON mode added in the 0528 refresh.
  • Reduced hallucination and cleaner CoT legibility vs the January launch.

Limitations

  • Reasoning-only emphasis: general chat, creative writing, and instruction-following are not its sharpest modes.
  • Slow — thinking responses commonly take 10-30 seconds, and output bills scale with reasoning-token volume (~23K/question on AIME).
  • Coding (SWE-bench 57.6) trails dedicated coding models by 20+ points.
  • Largely superseded for general agents by the hybrid-mode V3.1/V3.2/V4 (which include thinking without a separate model).
  • Same China data-residency and trains-on-input exposure; NIST CAISI flagged elevated safety susceptibility.

Best use cases

- **Math, science, and competition-style reasoning** pipelines where accuracy on hard problems is the whole point. - **Workloads that consume the chain-of-thought** as an artifact — distillation training data, audit trails, AI tutoring. - **Cost-sensitive reasoning agents** where o-series pricing is prohibitive. - **Edge/self-hosted reasoning** via the distilled Qwen3-8B variant.

Buyer questions

Why pick R1 over a hybrid model like V3.2?

R1 always reasons and exposes the full chain-of-thought as a separate artifact, which is uniquely valuable for audit, distillation, and tutoring. For general agents where you only sometimes need thinking, a hybrid model is usually a better fit.

How does the exposed chain-of-thought help?

You can log, parse, store, or display the reasoning independently of the answer — useful for debugging, building distillation datasets, AI-tutoring step-by-step views, and audit trails.

Why is the output bill higher than I expect?

Reasoning tokens count as output, and R1-0528 averages ~23K reasoning tokens per hard question. Budget on roughly 3-5x the output volume of a comparable non-reasoning workload.

Can I run it cheaply?

The full model needs an 8x H200-class node, but the official distilled DeepSeek-R1-0528-Qwen3-8B runs reasoning on a single ~16GB GPU.

Is it good at coding?

Decent but not specialized — SWE-bench 57.6. Use V3.1+/V4 for pure coding.

Any security concerns?

The NIST CAISI evaluation flagged DeepSeek models, including R1, as more susceptible on certain safety/hijacking evals than US frontier models. Weigh this for security-sensitive deployments and prefer in-boundary self-host.

Comparable models

**OpenAI o3 / o4-mini** — stronger on the hardest reasoning, no exposed chain-of-thought, no open weights; materially more expensive (o3) or closer in price but closed (o4-mini).
**Qwen 3 235B (thinking) / QwQ-32B** — direct China-origin reasoning peers with open weights and comparable benchmarks; Qwen is more generalist, R1 exposes its CoT more cleanly.
**Claude Opus 4.5 (extended thinking)** — best generalist reasoning with thinking, summary-only CoT; ~10x more expensive and closed.

Model specs

Input price
$0.55 / Mtok
Output price
$2.19 / Mtok
Cached input
$0.14 / Mtok
Batch (in/out)
Context window
128K tokens
Max output
64K tokens
Knowledge cutoff
2025-04
Released
2025-05-27
Modalities
text → text
Output speed
Not profiled
License
Open weights (MIT)
Clouds
First-party API

Last verified 2026-05-27