Llama 3.1 70B

by Meta · Llama 3 family · best for 70B base checkpoint for continued pretraining

Open-WeightsCost-Optimized

6.7

AI Panel Score

Value 7.5/10

Llama 3.1 70B is the July 2024 dense 70B that was the default open-weights workhorse from mid-2024 until Llama 3.3 70B replaced it that December. It ships both base and Instruct checkpoints, has 128K context, and remains one of the most exhaustively documented open models. The one-sentence buyer takeaway in 2026: it is competent and superseded — Llama 3.3 70B is a drop-in upgrade for Instruct workloads, so the only reason to start fresh on 3.1 70B is needing the base checkpoint that 3.3 never shipped.

Compare this model All Llama 3 versions

What's new

Jumped from 8K (Llama 3) to 128K context window.
Added function-calling and improved tool use.
Multilingual across eight languages.
Released both base and Instruct checkpoints, supporting the full fine-tuning ecosystem — the asset 3.3 70B later lacked.

Benchmarks

Benchmark	Score	Source
BBH	73%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
MMLU	86%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
IFEval	87.5%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
MATH-500	68%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
MMLU-Pro	66.4%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
HumanEval	80.5%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
LMArena Elo	1247	LMArena2024
GPQA Diamond	46.7%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
Artificial Analysis Index	13	Artificial Analysis2026-05

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker6.5/10

“The strategic question is just 'why not 3.3?' Usually there's no answer — unless you've got a base-model fine-tune that 3.3 can't replicate.”

For a buyer in May 2026, 3.1 70B is functionally superseded. Llama 3.3 70B is a drop-in upgrade with better instruction-following at the same compute footprint, so for most Instruct workloads there is no reason to choose 3.1. The one defensible reason to stay is having a base-model fine-tune in production — 3.3 cannot replicate it because Meta only released the Instruct checkpoint for 3.3. Operational maturity is excellent and the model is rock-solid, but the strategic verdict is "migrate unless the base checkpoint is load-bearing."

Strategic Fit 6Vendor Risk 8Roadmap Confidence 6

Pros

proven
base checkpoint available
broad support

Cons

superseded by 3.3 at same footprint

Right for: teams with base-model fine-tunes

Avoid if: greenfield — pick 3.3 70B or Scout

Domain Strategist6/10

“Its remaining moat is one tile — the 70B base checkpoint. As an application model, 3.3 took its square.”

Strategically, 3.1 70B has been displaced by its own successor. As an application model it has no defensible position versus 3.3 70B (same footprint, better everything) or Scout (MoE, vision, bigger context). Its sole differentiation is the base checkpoint for continued pre-training. Market timing has passed; the open conversation has moved to MoE and long context. It retains a large installed base on momentum, but its forward strategic relevance is narrow and shrinking.

Competitive Positioning 6Differentiation 6Market Timing 5

Pros

base checkpoint
large installed base

Cons

no application moat
superseded

Right for: continued-pretraining use

Avoid if: you want a competitive deployed model

Finance Lead7/10

“Good economics, not best-in-class — Scout matches the footprint with vision and bigger context, often cheaper. Stay only if your fleet's already sized for 70B.”

TCO is good but no longer leading. Groq runs it at $0.59/$0.79; Fireworks and Bedrock around $0.90; DeepInfra at $0.40. Self-hosting on 4xH100 with FP8 is straightforward and competitive at high volume. The benchmark, though, is now Scout: same single-node footprint, vision included, 10M context, often cheaper per token. Migrating off a closed API today, you would target Scout or Maverick. With an existing 3.1 70B pipeline at scale, the upgrade math is not urgent but the direction is clear.

Cost Efficiency 7Pricing Transparency 8Value per Dollar 7

Pros

cheap
predictable
mature

Cons

dense cost vs MoE
Scout undercuts it

Right for: existing 70B fleets

Avoid if: sizing fresh — MoE wins

Domain Practitioner7/10

“Twenty-two months of tooling is the whole pitch. Every quant stack, every LoRA tutorial uses it as the reference — if my 3.1 system works, I'm not rushing.”

Builders love 3.1 70B for one reason: accumulated tooling. Every quantization stack, serving framework, and LoRA tutorial uses it as a reference, the function-calling format is thoroughly understood, and behavior is consistent across hosts. For a developer with a working 3.1-based system there is little urgency to migrate. For greenfield work you would pick 3.3 70B for instruction-following or Scout for vision/context. It is a solid, forgiving model with a narrow remaining moat and the unique advantage of a base checkpoint for serious fine-tunes.

API Ergonomics 7Tool/Agent Support 7Reliability 8

Pros

deepest tooling
base checkpoint
stable cross-host

Cons

superseded on capability
text-only

Right for: existing 3.1 builders, base-model fine-tuners

Avoid if: starting fresh

Power User6.5/10

“Same general feel as 3.3 but slightly looser on instructions — follows 'bullets, no exclamations' ~85% of the time vs 3.3's ~92%.”

End users get essentially the 3.3 70B experience minus a bit of instruction-following precision — a constraint like "use bullets, no exclamation marks" is honored ~85% of the time versus 3.3's ~92%. Refusal rates are sensible, latency is good (sub-second on Groq), and the conversation feel is competent. The December 2023 cutoff is visible to attentive users. For embedded SaaS assistants supporting a workflow, the gap is invisible; for flagship chatbots, the staleness and the slightly looser instruction adherence show.

Output Quality 6Speed 8Everyday Usefulness 7

Pros

fluent
fast
sensible refusals

Cons

slightly looser instructions than 3.3
stale cutoff

Right for: embedded assistants

Avoid if: personality or precision is the product

Skeptic6.5/10

“Honestly benchmarked and rock-solid — but Meta's own 3.3 beats it on every number at the same size. The only reason it still exists is the base checkpoint.”

Adversarially, 3.1 70B has no marketing to debunk — its claims were accurate and it was a genuinely strong 2024 model. The problem is that it is dominated by its own successor: 3.3 70B beats it on MMLU-Pro, GPQA, MATH, HumanEval, and IFEval at the identical footprint and cost. Add the December 2023 cutoff, no vision, no reasoning mode, and dense economics that lose to MoE, and there is little reason to start fresh on it. Its one honest, living justification is the base checkpoint that 3.3 lacks. A competent model that Meta itself rendered redundant.

Claim Accuracy 9Weakness Severity 5Hype vs Reality 8

Pros

honest
proven
base checkpoint

Cons

dominated by 3.3 at same cost
stale

Right for: skeptics who value only the base model

Avoid if: you want a current, competitive model

Strengths

Best-documented Llama 70B with 22 months of community tooling, recipes, and proven fine-tuning workflows.
Both base and Instruct checkpoints — full fine-tuning flexibility 3.3 70B cannot match.
Mature provider ecosystem with predictable cross-host behavior.
Permissive commercial license; single-node deployment on 4–8xH100.
Same dense-70B speed profile as 3.3, with Cerebras/Groq acceleration available.

Limitations

Beaten on every published benchmark by Llama 3.3 70B at the same footprint.
December 2023 cutoff; no vision, no reasoning mode.
4K output cap on most managed providers.
Dense architecture loses on $/Mtok to MoE alternatives (Scout, Maverick, Qwen MoE).
Functionally superseded — a migration target, not a new default.

Best use cases

Existing fine-tuned derivatives in production where quality is acceptable — there is no urgency to migrate. Fine-tuning when the workload needs a base checkpoint (continued pre-training, major instruction reformatting) that 3.3 70B's Instruct-only release does not support. Research baselines for 70B-scale open work. Cost-controlled text production where 3.3's instruction-following edge is not material.

Deep dive

The full research notes behind this review — verified against primary sources.

Architecture Capabilities Benchmark analysis Speed & latency Pricing analysis Deployment & access Safety & privacy Ecosystem & tooling

Architecture

A dense transformer: 70B parameters, 80 layers, Grouped-Query Attention, Llama 3 TikToken tokenizer (128,256 vocab). Trained on 15T+ tokens (~7M GPU-hours, roughly 7e24 FLOPs) with a December 2023 cutoff. No MoE, no vision. Architecturally it is the backbone that Llama 3.3 70B later refined via post-training — meaning the two share serving stacks, quantization paths, and most tooling, and 3.3's gains came from instruction tuning rather than a new architecture. The base checkpoint availability is the durable differentiator versus 3.3.

Capabilities

A solid text-only generalist. Instruction-following (cap_instruction_following 7.5): IFEval 87.5, good but behind 3.3 70B's 92.1. Reasoning (5.5): MMLU 86.0, MMLU-Pro 66.4, GPQA Diamond 46.7, BBH 73.0 — competent for 2024, behind frontier. Math (5.5): MATH 68.0, beaten by 3.3 70B (77.0). Coding (5.5): HumanEval 80.5, also beaten by 3.3 70B (88.4). Multilingual (6.0) across eight languages. No vision (0.0), no OCR (0.0), no reasoning mode, no real-time data (0.0). Function-calling (6.0) is well-understood. The pattern is consistent: on the same footprint, 3.3 70B is better on every published benchmark, so 3.1 70B's remaining value is its base checkpoint and its 22 months of accumulated tooling.

Benchmark analysis

Benchmark	Score	vs Successor (3.3 70B)	vs Sibling (3.1 405B)	Source
MMLU	86.0	= (3.3 70B 86.0)	-2.6 (405B 88.6)	eval details
MMLU-Pro	66.4	-2.5 (3.3 70B 68.9)	-6.9 (405B 73.3)	eval details
GPQA Diamond	46.7	-3.8 (3.3 70B 50.5)	-4.4 (405B 51.1)	eval details
MATH	68.0	-9.0 (3.3 70B 77.0)	-5.8 (405B 73.8)	eval details
HumanEval	80.5	-7.9 (3.3 70B 88.4)	-8.5 (405B 89.0)	eval details
IFEval	87.5	-4.6 (3.3 70B 92.1)	-1.1 (405B 88.6)	eval details
BBH	73.0	n/a	-8.3 (405B 81.3)	eval details
LMArena Elo	1247	below 3.3 70B (1257)	below 405B (1267)	LMArena
Artificial Analysis Index	13	-1 (3.3 70B 14)	-4 (405B 17)	AA

Speed & latency

Median output speed ~81.8 tokens/sec (same dense 70B class as 3.3 70B); Groq ~316 tps, SambaNova ~296, Cerebras 2,000+ tps for the fastest interactive feel. Time-to-first-token sub-half-second on the quickest hosts. Latency tier fast. Dense 70B economics: inference cost scales linearly with size, no MoE savings.

Pricing analysis

Surface	Cost	Notes
API input (representative)	~$0.40–$0.88 / 1M tok
API output (representative)	~$0.59–$0.90 / 1M tok
DeepInfra	$0.40 in	cheapest mainstream
Together	~$0.88 in / ~$0.88 out
Fireworks	$0.90 in / $0.90 out
Groq	$0.59 in / $0.79 out	316 tps
AWS Bedrock	$0.72–$0.90 / 1M tok	Standard / Latency-Optimized
Self-hosted	4–8x H100	FP8 fits 4xH100; INT4 ~40GB
Rate limits	provider-specific	generous on managed tiers

Open weights mean no single Meta price; figures are the May 2026 market.

Deployment & access

Open weights under the Llama 3 Community License, released as both base and Instruct checkpoints. Download from Hugging Face (meta-llama/Llama-3.1-70B base, -Instruct). Self-hostable on a single 8xH100 node in FP16, 4xH100 in FP8, or ~40GB at INT4 (single 48GB+ card or one H100). Managed on Bedrock, Vertex AI, Azure AI Foundry, OCI, IBM watsonx; inference providers include Together, Fireworks, Groq, DeepInfra, Cerebras, SambaNova, OpenRouter, Hyperbolic, and Novita. Twenty-two months of community tooling make it the single most documented 70B-class model. Commercial use permitted; separate Meta license required above 700M MAU.

Safety & privacy

No built-in moderation; Meta provides Llama Guard 3 (8B/1B) as an optional filter. Both base and Instruct released, so refusal behavior is fully tunable including from the base. "Trains on inputs" not applicable when self-hosted; Meta's own terms do not train on your data. No model-level compliance certifications. Governance under Meta's Responsible Use Guide.

Ecosystem & tooling

The single most-documented 70B-class model: native support across Hugging Face Transformers, vLLM, llama.cpp, Ollama, SGLang, TensorRT-LLM, MLX, torchtune, plus LangChain, LlamaIndex, and Unsloth. Available on Bedrock, Vertex AI, Azure AI Foundry, OCI, IBM watsonx, and on Together, Fireworks, Groq, DeepInfra, Cerebras, SambaNova, OpenRouter, Hyperbolic, and Novita. Large installed base of enterprise fine-tuned derivatives. Popularity is mainstream by installed base, though new adoption has shifted to 3.3 and Llama 4.

Buyer questions

What does it cost?

No single Meta price; representative inference is ~$0.40–$0.90 input and ~$0.59–$0.90 output per 1M tokens. Self-host on 4–8xH100.

Why pick it over 3.3 70B?

Almost only if you need the base checkpoint for continued pre-training — 3.3 ships Instruct-only. Otherwise 3.3 is a better drop-in.

Does it do vision?

No — text only.

Should I migrate existing 3.1 70B workloads?

No urgency if quality is acceptable; 3.3 70B is the easy Instruct upgrade and Scout the modernization path.

How current is its knowledge?

December 2023 cutoff — pair with retrieval for recent info.

What about safety/compliance?

No built-in moderation; add Llama Guard 3. Certifications come from your host/infra.

Any license limits?

Commercial use allowed; separate Meta license required above 700M MAU.

Comparable models

Llama 3.3 70B — same footprint, better instruction-following and benchmarks; drop-in replacement for Instruct workloads.

Llama 4 Scout — newer MoE with vision and 10M context at a similar deployment cost; the usual upgrade for new builds.

Qwen 3 32B / 72B — direct dense competitors, often cheaper and stronger on reasoning.

Sources

Primary references used to verify this review.

Model specs

Input price: $0.40 / Mtok
Output price: $0.59 / Mtok
Cached input: —
Batch (in/out): —
Context window: 128K tokens
Max output: 4K tokens
Knowledge cutoff: 2023-12
Released: 2024-07-22
Modalities: text → text
Output speed: ~81.8 tok/s
License: Open weights (Llama-3-Community)
Clouds: Bedrock, Vertex AI, Azure AI Foundry, GCP, OCI, IBM watsonx

Does not train on API inputs by default

Other Llama 3 versions

Last verified 2026-05-27