by DeepSeek · DeepSeek R1 family · best for exposed-CoT reasoning at a fraction of o-series cost
DeepSeek R1 is the category-defining open-weights reasoning model — the release that broke the "frontier reasoning is expensive" assumption in early 2025 and forced US labs to respond on price. It is reasoning-first: every response includes a full, visible chain-of-thought before the final answer, which is a genuine differentiator for interpretability, distillation, and audit pipelines. The R1-0528 refresh (2025-05-28) added function calling and JSON output and pushed AIME 2025 to 87.5% and GPQA Diamond to 81.0. Built on the V3 671B/37B MoE backbone with RL post-training, open-weights under MIT. The single sentence a buyer needs: when reasoning is the whole job and you want the chain-of-thought exposed, R1 delivers frontier-class math and science at a fraction of o-series cost. - **Provider:** DeepSeek - **Released:** 2025-01-20 (R1); 2025-05-28 (R1-0528 upgrade) - **Status:** GA - **Context window:** 128,000 tokens - **Max output:** 64,000 tokens - **Modalities:** Text in / text out (with full exposed chain-of-thought) - **Knowledge cutoff:** 2025-04 - **Headline price:** $0.55 in / $2.19 out per 1M tokens
| Benchmark | Score | Source |
|---|---|---|
| Humanity's Last Exam | 17.7% | huggingface.co 2025-05-28T00:00:00.000Z |
| MMLU | 93.4% | huggingface.co 2025-05-28T00:00:00.000Z |
| MMLU-Pro | 85% | huggingface.co 2025-05-28T00:00:00.000Z |
| SimpleQA | 27.8% | huggingface.co 2025-05-28T00:00:00.000Z |
| AIME 2025 | 87.5% | huggingface.co 2025-05-28T00:00:00.000Z |
| TAU-bench | 63.9% | huggingface.co 2025-05-28T00:00:00.000Z |
| LMArena Elo | 1382 | artificialanalysis.ai 2025-05-28T00:00:00.000Z |
| GPQA Diamond | 81% | huggingface.co 2025-05-28T00:00:00.000Z |
| LiveCodeBench | 73.3% | huggingface.co 2025-05-28T00:00:00.000Z |
| Aider Polyglot | 71.6% | huggingface.co 2025-05-28T00:00:00.000Z |
| SWE-bench Verified | 57.6% | huggingface.co 2025-05-28T00:00:00.000Z |
| Artificial Analysis Index | 68 | artificialanalysis.ai 2025-05-28T00:00:00.000Z |
Six personas, six verdicts — the same panel that reviews every product on TopReviewed.
“R1 broke the frontier-reasoning price ceiling and gave me an exposed-CoT artifact I can audit — but hybrid models have absorbed most general use.”
R1 was the model that broke the "frontier reasoning is expensive" assumption in early 2025 and forced US labs to respond on price. Strategically, the exposed chain-of-thought is more than a quality story — it lets enterprises build pipelines that consume the reasoning separately from the answer, valuable for audit and distillation. By mid-2026 the hybrid-mode models (V3.1/V3.2/V4) have largely absorbed R1's general-agent use case; R1 remains the right pick when reasoning is the whole job and you want the trace exposed. Sovereignty considerations are identical to the family, and the NIST CAISI safety flag is worth weighing for sensitive deployments.
“R1 is the single most disruptive reasoning release of the cycle — it reset market expectations for what frontier reasoning should cost.”
R1's strategic significance is hard to overstate: it was the open-weights reasoning model that, at roughly 1/30th of contemporaneous o-series pricing, reset the entire market's expectations and triggered a global re-rating of DeepSeek as a lab. Artificial Analysis tied DeepSeek as the #2 lab and undisputed open-weights leader on the back of the 0528 update. Its differentiation — full exposed CoT plus frontier math at commodity price — created a distinct category position no closed o-series model matched. The strategic decay by mid-2026 is that DeepSeek's own hybrid models cannibalize the generalist use case, narrowing R1 to the reasoning-specialist and CoT-artifact niche.
“R1's launch landed at ~1/30th of o-series pricing for comparable AIME/GPQA — the most disruptive pricing event in the reasoning category in years.”
R1's launch pricing ($0.55 in / $2.19 out) landed at roughly 1/30th of the contemporaneous OpenAI o1 rate for comparable AIME/GPQA scores — the single most disruptive pricing event in the reasoning-model category. Cache-hit input at $0.14/M makes repeated-reasoning workloads (tutoring, agent loops) shockingly cheap. The critical line-item caution is reasoning-token volume: R1 burns output tokens at much higher rates than chat models because the chain-of-thought counts, and 0528 nearly doubled per-question reasoning to ~23K tokens — budget on roughly 3-5x the output volume of a comparable non-reasoning workload. Even so, the intelligence-per-dollar on hard reasoning is exceptional.
“The exposed reasoning trace is a developer delight — log it, parse it, display it — and 0528 finally made function calling first-class.”
For a builder, the 0528 refresh fixed the practical complaints: function calling and JSON mode are first-class. The exposed reasoning trace is a genuine delight — you can log it, store it, parse it, or stream it to a "thinking" UI. Open weights and the distilled Qwen3-8B make local development and edge deployment realistic, and the OpenAI-compatible endpoint keeps integration simple. The main friction is latency — reasoning queries take real time, so UX must be designed around streaming the thought process. Tool-call reliability is good but not Claude-grade (Tau-Bench Retail 63.9). No parallel tool calls, no batch API.
“On hard problems R1's visible thinking is genuinely impressive and builds trust — but it's overkill, and slow, for everyday questions.”
For end users on free or low-cost chat, R1's reasoning mode is genuinely impressive on hard problems — it solves AIME problems and GPQA-style science questions that free GPT/Claude tiers struggle with, and the visible thinking process is novel and increases trust in the answer. The downsides are latency (10-30 seconds on hard queries) and that R1 is overkill for everyday questions, where its reasoning-first style is slower and stiffer than a generalist. Best deployed as a "deep think" toggle rather than the default. Content policy follows DeepSeek norms. As a free option via the DeepSeek UI's DeepThink toggle, the value on hard problems is excellent.
“Frontier-class reasoning at commodity price is real — but it's narrow, slow, burns output tokens, and NIST flagged it on safety.”
R1's reasoning scores are well-documented on its own model card and independently tracked by Artificial Analysis, so the headline holds — this is not benchmark theater. The honest caveats are about scope and cost shape. R1 is a specialist: coding (SWE-bench 57.6) and general agent tool use are middling, and creative/instruction work is weak. The chain-of-thought that is its differentiator is also a cost multiplier — ~23K reasoning tokens per hard question means output bills can dwarf a chat model's. And the NIST CAISI evaluation flagged DeepSeek models, R1 included, as more susceptible on certain safety/hijacking evals than US frontier peers — a real consideration for security-sensitive use. The value is genuine; the asterisks are scope, latency, token economics, and safety posture.
- **Math, science, and competition-style reasoning** pipelines where accuracy on hard problems is the whole point. - **Workloads that consume the chain-of-thought** as an artifact — distillation training data, audit trails, AI tutoring. - **Cost-sensitive reasoning agents** where o-series pricing is prohibitive. - **Edge/self-hosted reasoning** via the distilled Qwen3-8B variant.
R1 always reasons and exposes the full chain-of-thought as a separate artifact, which is uniquely valuable for audit, distillation, and tutoring. For general agents where you only sometimes need thinking, a hybrid model is usually a better fit.
You can log, parse, store, or display the reasoning independently of the answer — useful for debugging, building distillation datasets, AI-tutoring step-by-step views, and audit trails.
Reasoning tokens count as output, and R1-0528 averages ~23K reasoning tokens per hard question. Budget on roughly 3-5x the output volume of a comparable non-reasoning workload.
The full model needs an 8x H200-class node, but the official distilled DeepSeek-R1-0528-Qwen3-8B runs reasoning on a single ~16GB GPU.
Decent but not specialized — SWE-bench 57.6. Use V3.1+/V4 for pure coding.
The NIST CAISI evaluation flagged DeepSeek models, including R1, as more susceptible on certain safety/hijacking evals than US frontier models. Weigh this for security-sensitive deployments and prefer in-boundary self-host.
Last verified 2026-05-27