Why pick R1 over a hybrid model like V3.2?

R1 always reasons and exposes the full chain-of-thought as a separate artifact, which is uniquely valuable for audit, distillation, and tutoring. For general agents where you only sometimes need thinking, a hybrid model is usually a better fit.

How does the exposed chain-of-thought help?

You can log, parse, store, or display the reasoning independently of the answer — useful for debugging, building distillation datasets, AI-tutoring step-by-step views, and audit trails.

Why is the output bill higher than I expect?

Reasoning tokens count as output, and R1-0528 averages ~23K reasoning tokens per hard question. Budget on roughly 3-5x the output volume of a comparable non-reasoning workload.

Can I run it cheaply?

The full model needs an 8x H200-class node, but the official distilled DeepSeek-R1-0528-Qwen3-8B runs reasoning on a single ~16GB GPU.

Is it good at coding?

Decent but not specialized — SWE-bench 57.6. Use V3.1+/V4 for pure coding.

Any security concerns?

The NIST CAISI evaluation flagged DeepSeek models, including R1, as more susceptible on certain safety/hijacking evals than US frontier models. Weigh this for security-sensitive deployments and prefer in-boundary self-host.

DeepSeek R1 (0528) Review — Benchmarks, Pricing & AI Panel Verdict

Benchmark	Score	Source
Humanity's Last Exam	17.7%	huggingface.co 2025-05-28T00:00:00.000Z
MMLU	93.4%	huggingface.co 2025-05-28T00:00:00.000Z
MMLU-Pro	85%	huggingface.co 2025-05-28T00:00:00.000Z
SimpleQA	27.8%	huggingface.co 2025-05-28T00:00:00.000Z
AIME 2025	87.5%	huggingface.co 2025-05-28T00:00:00.000Z
TAU-bench	63.9%	huggingface.co 2025-05-28T00:00:00.000Z
LMArena Elo	1382	artificialanalysis.ai 2025-05-28T00:00:00.000Z
GPQA Diamond	81%	huggingface.co 2025-05-28T00:00:00.000Z
LiveCodeBench	73.3%	huggingface.co 2025-05-28T00:00:00.000Z
Aider Polyglot	71.6%	huggingface.co 2025-05-28T00:00:00.000Z
SWE-bench Verified	57.6%	huggingface.co 2025-05-28T00:00:00.000Z
Artificial Analysis Index	68	artificialanalysis.ai 2025-05-28T00:00:00.000Z

Architecture

R1-0528 is built on the DeepSeek-V3 base architecture: a 671B-parameter DeepSeekMoE model (685B total counting the Multi-Token-Prediction module), ~37B activated per token, 256 routed experts, 61 layers, using Multi-head Latent Attention (MLA). Artificial Analysis confirms R1-0528 is a post-training update with no change to the V3/R1 architecture — the gains come from a reinforcement-learning post-training pipeline focused on deliberate reasoning. The defining behavior is always-on reasoning with a fully exposed chain-of-thought (reasoning_content), which downstream pipelines can consume as a separate artifact. Open weights are on Hugging Face under MIT; vocab size is 129,280. Text-only.

Capabilities

R1's strongholds justify its reasoning (9.5) and math (9.5) scores: AIME 2025 87.5%, GPQA Diamond 81.0, MMLU-Redux 93.4, and frontier-class competition-math performance. The exposed chain-of-thought is itself a capability — usable for distillation training data, audit trails, and AI tutoring. Coding (7.5) is strong but not specialized — SWE-bench Verified 57.6, LiveCodeBench 73.3, Aider-Polyglot 71.6 — so for pure coding the V3.1+/V4 models fit better. Agentic (7.0) and function-calling (7.0) work post-0528 but Tau-Bench (Retail 63.9 / Airline 53.5) and BFCL multi-turn (37.0) show tool use is competent rather than class-leading. Multilingual (8.0) is strong in English and Chinese. Long-context (7.0) is fine within 128K. Vision, OCR, and real-time data are zero. Creative writing (6.5) and instruction-following (7.5) are not its sharpest modes — R1 is an analyst, not a stylist. Safety calibration (6.5) reflects family norms; NIST CAISI flagged higher susceptibility on some safety evals.

Benchmark analysis

Benchmark	Score	vs Predecessor	vs Top Competitor	Source
AIME 2025 (Pass@1)	87.5%	+17.5 vs R1-initial (70)	within ~4 pts of o-series	HF card
GPQA Diamond (Pass@1)	81.0	up significantly	competitive with o3/o4-mini	HF card
MMLU-Redux (EM)	93.4	n/a	frontier-class	HF card
MMLU-Pro (EM)	85.0	n/a	within ~2 pts of frontier	HF card
LiveCodeBench (Pass@1)	73.3	up from 63.5	strong, not specialized	HF card
Aider-Polyglot	71.6	n/a	competitive	HF card
SWE-bench Verified	57.6	n/a	trails frontier coders (~80)	HF card
HLE (Pass@1)	17.7	n/a	mid-pack reasoning	HF card
Tau-Bench (Retail)	63.9	n/a	competent tool use	HF card
LMArena Elo	1382	top-tier at launch	top-10 globally	AA
Artificial Analysis Index	68	up from 60 (R1-initial)	tied #2 lab at release	AA

MATH-500 is not in DeepSeek's 0528 table and is left null. The AA Intelligence Index value (68) is on the index version current at R1-0528's release; AA periodically reindexes.

Speed & latency

R1 is the slow tier by design. Reasoning-first generation means hard queries take real time — 10-30 seconds is common — and R1-0528 averages ~23K reasoning tokens per AIME question, so both latency and output billing scale with problem difficulty. UX should stream the thought process or show a "thinking" indicator. Latency tier: slow.

Pricing analysis

Surface	Cost	Notes
API input (cache miss)	$0.55 / 1M tok
API input (cache hit)	$0.14 / 1M tok	~75% discount
API output	$2.19 / 1M tok	includes chain-of-thought tokens
Direct UI	Free	chat.deepseek.com (R1 / DeepThink toggle)
Open weights	$0	HF download; 8x H200-class node. Distilled Qwen3-8B runs on a single 16GB GPU.
Rate limits	Standard (GA) tier

Deployment & access

First-party OpenAI-compatible API at api.deepseek.com (PRC-hosted), with the deepseek-reasoner alias historically pointing at R1. Open weights on Hugging Face under MIT: self-hostable at ~685B/37B MoE (8x H200-class node at FP8, ~400GB+ VRAM), with INT4/GGUF community quants. The official distilled DeepSeek-R1-0528-Qwen3-8B runs reasoning on a single ~16GB GPU for edge/cost-sensitive deployments. Broadly served by neutral inference providers — OpenRouter, DeepInfra, Novita, Fireworks, Together, Hyperbolic, SambaNova. No first-party managed-cloud offering.

Safety & privacy

Same posture as the family: PRC data storage under Chinese law, trains-on-input by default (de-identified), no documented API opt-out, no SOC2/HIPAA/GDPR/ISO27001 on the first-party service. Content moderation follows PRC norms. Notably, the NIST Center for AI Standards and Innovation (CAISI) evaluation flagged DeepSeek models, including R1, as more susceptible than US frontier models on certain safety/agent-hijacking evals — a relevant data point for security-sensitive buyers. The MIT open weights remain the path to an in-boundary, compliance-controlled deployment.

Ecosystem & tooling

OpenAI-compatible API with Python/TypeScript SDKs, LangChain / LlamaIndex / Vercel AI SDK integrations, and very broad serving across OpenRouter, DeepInfra, Novita, Fireworks, Together, Hyperbolic, and SambaNova. Used by Perplexity and coding tools (Kilo Code). As the model that reset reasoning-price expectations, R1 has mainstream adoption and one of the most-downloaded open-weights footprints on Hugging Face.

DeepSeek R1 (0528)

What's new

Benchmarks

AI Panel Review

Strengths

Limitations

Best use cases

Deep dive

Architecture

Capabilities

Benchmark analysis

Speed & latency

Pricing analysis

Deployment & access

Safety & privacy

Ecosystem & tooling

Buyer questions

Comparable models

Sources

Model specs

Other DeepSeek R1 versions