Llama 3.1 405B

by Meta · Llama 3 family · best for largest open base for from-scratch fine-tuning

Open-Weights

6.3

AI Panel Score

Value 4.5/10

Llama 3.1 405B is the July 2024 release that broke the open-vs-closed performance ceiling — the first downloadable model to seriously rival closed-API frontier (MMLU 88.6, within 0.1 of GPT-4o at the time) and the largest open dense transformer ever shipped. The one-sentence buyer takeaway in 2026: it is a landmark that has since been outclassed by both its smaller sibling (3.3 70B) and its successor (Maverick), and its only durable role now is as the largest available open base model for serious from-scratch fine-tuning.

Compare this model All Llama 3 versions

What's new

First open-weights model to seriously rival closed frontier at release (MMLU 88.6).
Largest publicly downloadable dense transformer ever shipped at the time (405B params).
128K context, 8 languages, function-calling baked into the Instruct variant.
Released as both base and Instruct checkpoints — a major asset for the fine-tuning community that 3.3 70B later lacked.

Benchmarks

Benchmark	Score	Source
BBH	81.3%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
MMLU	88.6%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
IFEval	88.6%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
MATH-500	73.8%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
MMLU-Pro	73.3%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
HumanEval	89%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
LMArena Elo	1267	LMArena2024
GPQA Diamond	51.1%	Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
Artificial Analysis Index	17	Artificial Analysis2026-05

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker6.5/10

“A landmark, now mostly legacy. The only reason to standardize on it today is if you're building a 100B+ derivative from the base checkpoint.”

For a buyer in May 2026, 405B is largely a legacy choice. Its strategic value at release was enormous — it broke the open-vs-closed ceiling — but it is now outclassed by its smaller sibling (3.3 70B, better IFEval, one-sixth the cost) and its successor (Maverick, better on every modern benchmark, MoE-efficient). The single place it still wins is the available base checkpoint for serious 100B+ customization. Unless you are running a research lab or building a from-scratch derivative, this is not the model to standardize on; the roadmap clearly points to Llama 4.

Strategic Fit 6Vendor Risk 8Roadmap Confidence 6

Pros

largest open base checkpoint
well-documented
permissive license

Cons

outclassed and outpriced
legacy direction

Right for: labs/teams building 100B+ derivatives

Avoid if: you want a production application model

Domain Strategist6/10

“Its square shrank to one tile: 'biggest open base you can pretrain-continue.' Everything else moved on.”

Strategically, 405B's positioning has collapsed to a single niche — the largest open base checkpoint for from-scratch fine-tuning. As an application model it has no defensible square: 3.3 70B and Maverick dominate it on capability, cost, and speed. Its differentiation is purely the 405B base availability and its historical role as the model that proved open weights could compete. Market timing has long passed; the open frontier is now MoE and reasoning models. A historically pivotal model with a narrow, fading strategic role.

Competitive Positioning 5Differentiation 7Market Timing 5

Pros

unique 405B open base
historical importance

Cons

no application moat
superseded direction

Right for: customization at extreme scale

Avoid if: you need a competitive deployed model

Finance Lead5/10

“The worst TCO in the lineup. Bedrock charges 10x Maverick for worse benchmarks, and self-hosting a 405B dense model is a money pit unless it's fully utilized.”

The economics are the weakest in the Meta range. Bedrock runs $2.40–$5.32 input — over 10x Maverick on the same platform — and Azure tops $8.00. Self-hosting needs 8xH100 minimum and is expensive to keep utilized at the throughput a dense 405B demands, while delivering only ~29 tps. At 1B+ tokens/month the comparison against Maverick or Scout is unambiguous: 5–20x more expensive for worse scores. The only economic justification is amortizing sunk infrastructure or recovering a custom fine-tune investment.

Cost Efficiency 4Pricing Transparency 7Value per Dollar 4

Pros

predictable pricing where available

Cons

10–20x Maverick
slow
expensive to self-host

Right for: amortizing existing investments

Avoid if: starting fresh — Maverick/Scout win decisively

Domain Practitioner6.5/10

“I rarely pick it to build on, but as a fine-tuning base it's the biggest open canvas there is — and Meta documented it better than anything else.”

For builders, 405B is rarely the right application model — dense 405B makes self-hosting painful, the chat template shows its age, and every modern feature (vision, MoE efficiency, the Llama 4 tool format) lives elsewhere. Where it still shines is as a fine-tuning base: the pretrained checkpoint is the largest open release that allows genuine continued pre-training or major instruction reformatting, and Meta's documentation (eval_details, torchtune recipes) is best-in-class. For application work you would pick 3.3 70B or Maverick; for serious customization at scale, this is the canvas.

API Ergonomics 6Tool/Agent Support 6Reliability 8

Pros

largest open base
superb documentation
stable

Cons

painful to self-host
aging template
no modern features

Right for: serious fine-tuners

Avoid if: you want a deployable application model

Power User6/10

“Competent but visibly older — the December 2023 brain trips on recent libraries and product names within a few minutes.”

In daily use, 405B in 2026 feels noticeably dated. Responses are competent but the December 2023 cutoff shows — recent events, libraries, and product names trip it up. Conversation feel lacks the polish of any post-2025 model, and refusal patterns are more conservative than newer alternatives. Latency is also poor (~29 tps), which hurts interactive feel. For embedded production workloads users will not notice; for any user-facing chat product, the staleness and slowness are felt quickly.

Output Quality 6Speed 4Everyday Usefulness 6

Pros

competent core quality
strong on classic tasks

Cons

stale knowledge
slow
conservative refusals

Right for: backstage workloads

Avoid if: user-facing or latency-sensitive

Skeptic6/10

“Genuinely historic and honestly benchmarked — but in 2026 it's a slow, stale, expensive way to do what a 70B does better and cheaper.”

Adversarially, 405B is the rare model whose problem is not deceptive marketing — its release claims (MMLU 88.6, frontier-adjacent in 2024) were accurate and it set the open-weights bar. The issue is pure obsolescence: it is beaten by its own 3.3 70B sibling on IFEval and MATH at one-sixth the size and one-tenth the cost, beaten by Maverick everywhere, runs at a slow ~29 tps, and carries a 30-month-old cutoff. There is no benchmark gaming to call out — just a model that the market and Meta itself have moved past. The honest verdict: a museum piece with one living use (the base checkpoint).

Claim Accuracy 9Weakness Severity 4Hype vs Reality 8

Pros

honest claims
historic

Cons

obsolete on cost/speed/freshness

Right for: skeptics who value it only as a base checkpoint

Avoid if: you need a current, economical model

Strengths

Base pretrained checkpoint available — the strongest open foundation for from-scratch fine-tuning over 100B params.
Strong classic-benchmark reasoning, math, and code for its 2024 generation.
Exceptionally well-documented (eval_details repo, detailed model card).
Wide provider availability and 18+ months of tooling maturity.
Permissive commercial license; 128K context, 8 languages.

Limitations

Surpassed by 3.3 70B on instruction-following (IFEval) and MATH at one-sixth the size.
Surpassed by Maverick on every modern benchmark while costing 10–20x more to serve.
Slowest model in the lineup (~29 tps); impractical to self-host below an 8xH100 box.
December 2023 cutoff is ~30 months stale; no vision, no reasoning mode.
Worst TCO in the Meta lineup — the managed value case has effectively disappeared.

Best use cases

From-scratch fine-tuning and continued pre-training where you need the largest available open base model over 100B params. Research workloads requiring a frontier-adjacent open baseline with full documentation. Distillation source for smaller production models (the same role it plays inside Meta). Backwards-compatibility for existing pipelines that have not migrated to 3.3 or Llama 4.

Deep dive

The full research notes behind this review — verified against primary sources.

Architecture Capabilities Benchmark analysis Speed & latency Pricing analysis Deployment & access Safety & privacy Ecosystem & tooling

Architecture

A dense transformer at extreme scale: 405B parameters, 126 layers, Grouped-Query Attention, the Llama 3 TikToken tokenizer (128,256 vocab). Trained on 15T+ tokens across 16,000+ H100 GPUs (~30.84M GPU-hours, roughly 3.8e25 FLOPs) with a December 2023 cutoff. No MoE — every parameter is active on every token, which is exactly why it is expensive to serve. No vision. Meta published unusually detailed architecture and eval documentation (the eval_details repo) and released both base and Instruct checkpoints, making it the most thoroughly documented large open model and the strongest foundation for continued pre-training at the 100B+ scale.

Capabilities

At release, 405B was frontier-adjacent across the board; in 2026 it is competent but surpassed everywhere. Reasoning (cap_reasoning 6.0): MMLU 88.6, MMLU-Pro 73.3, GPQA Diamond 51.1, BBH 81.3 — strong for 2024, behind today's frontier and Maverick. Math (6.0): MATH 73.8, trailed only GPT-4o at release; now beaten by its own 3.3 70B sibling (77.0). Coding (6.0): HumanEval 89.0, solid but no agentic-coding pedigree. Instruction-following (7.5): IFEval 88.6, good but beaten by 3.3 70B (92.1) at one-sixth the size. Multilingual (6.0) across eight languages. No vision (0.0), no OCR (0.0), no reasoning mode, no real-time data (0.0). Its standout property is not a capability score but availability: it is the largest open base checkpoint you can fine-tune from scratch.

Benchmark analysis

Benchmark	Score	vs Sibling (3.3 70B)	vs Top Competitor	Source
MMLU	88.6	+2.6 (3.3 70B 86.0)	~ GPT-4o (0.1 behind) at release	eval details
MMLU-Pro	73.3	+4.4 (3.3 70B 68.9)	trails Maverick (80.5)	eval details
GPQA Diamond	51.1	+0.6 (3.3 70B 50.5)	trails 2026 frontier	eval details
MATH	73.8	-3.2 (3.3 70B 77.0)	trailed only GPT-4o at release	eval details
HumanEval	89.0	+0.6 (3.3 70B 88.4)	~ GPT-4o	eval details
IFEval	88.6	-3.5 (3.3 70B 92.1)	beaten by 3.3 70B	eval details
BBH	81.3	n/a	strong for 2024	eval details
LMArena Elo	1267	comparable	mid-pack today	LMArena
Artificial Analysis Index	17	+3 (3.3 70B 14)	below 2026 frontier	AA

Speed & latency

The slowest model in the Meta lineup: median ~29 tokens/sec because every one of 405B dense parameters fires on every token. Time-to-first-token is moderate (~0.7s) but sustained generation is slow, and high-throughput serving demands a large, expensive GPU fleet. SambaNova and Cerebras can accelerate it but at premium cost. Latency tier is slow — a real factor for interactive use and a major contributor to its poor TCO.

Pricing analysis

Surface	Cost	Notes
API input (representative)	~$2.40–$3.50 / 1M tok
API output (representative)	~$3.00–$5.32 / 1M tok
Fireworks	$3.00 in / $3.00 out	cheapest managed
Together	$3.50 in / $3.50 out	405B Instruct Turbo
AWS Bedrock	$2.40–$5.32 / 1M tok	Standard / Latency-Optimized
Azure	~$8.00 / 1M tok	most expensive managed
Self-hosted	8xH100 minimum (FP8)	~230GB VRAM at INT4 to hold 405B
Rate limits	provider-specific	more restrictive than smaller tiers

Open weights mean no single Meta price; the figures above are the May 2026 market. Note these are ~10–20x Maverick on the same platforms — the core reason the managed-deployment value case has collapsed.

Deployment & access

Open weights under the Llama 3 Community License, released as both base and Instruct checkpoints — the key differentiator. Download from Hugging Face (meta-llama/Llama-3.1-405B base, -Instruct). Self-hosting requires an 8xH100 node minimum with FP8 quantization for single-node serving; INT4 needs ~230GB VRAM just to hold the parameters. Managed on Bedrock, Vertex AI, Azure AI Foundry, OCI, IBM watsonx; inference providers include Together, Fireworks, DeepInfra, SambaNova, OpenRouter, Hyperbolic, and Novita. Eighteen-plus months of serving and quantization tooling exist, but the economics make managed deployment hard to justify versus Maverick. Commercial use permitted; separate Meta license required above 700M MAU.

Safety & privacy

No built-in moderation in the weights; Meta provides Llama Guard 3 (8B/1B) as an optional filter. Because both base and Instruct checkpoints are released, refusal behavior is fully tunable — including from the base model, which is precisely why it remains attractive for customization. "Trains on inputs" not applicable when self-hosted; Meta's own terms do not train on your data. No model-level compliance certifications. Governance under Meta's Responsible Use Guide.

Ecosystem & tooling

Mature but increasingly legacy tooling: native support across Hugging Face Transformers, vLLM, llama.cpp, SGLang, TensorRT-LLM, torchtune, plus LangChain, LlamaIndex, and Unsloth. Available on Bedrock, Vertex AI, Azure AI Foundry, OCI, IBM watsonx, and on Together, Fireworks, DeepInfra, SambaNova, OpenRouter, Hyperbolic, and Novita. Played a central role in 2024-era Meta AI and remains a common distillation source. Popularity is now niche — its installed base is shrinking toward 3.3/Llama 4.

Buyer questions

What does it cost?

No single Meta price; representative inference is ~$2.40–$8.00 input and ~$3.00–$16.00 output per 1M tokens — roughly 10–20x Maverick. Self-host needs 8xH100 minimum.

Why would I still use it?

Almost solely as the largest open base checkpoint for from-scratch fine-tuning or continued pre-training at 100B+ scale.

Does it do vision?

No — text only.

How does it compare to 3.3 70B?

3.3 70B matches or beats it on most benchmarks at one-sixth the size and cost; pick 405B only if you specifically need the base model.

How current is its knowledge?

December 2023 cutoff — pair with retrieval for anything recent.

What about safety/compliance?

No built-in moderation; add Llama Guard 3. Certifications come from your host/infra.

Any license limits?

Commercial use allowed; separate Meta license required above 700M MAU.

Comparable models

Llama 3.3 70B — close on quality, one-sixth the cost, beats it on IFEval and MATH; almost always the better pick today unless you need the base checkpoint.

Llama 4 Maverick — newer, MoE-efficient, better on every modern benchmark, ~10x cheaper to serve, plus vision.

DeepSeek V3.x base — open MoE alternative at comparable or better quality and far better economics.

Sources

Primary references used to verify this review.

Model specs

Input price: $3 / Mtok
Output price: $3 / Mtok
Cached input: —
Batch (in/out): —
Context window: 128K tokens
Max output: 4K tokens
Knowledge cutoff: 2023-12
Released: 2024-07-22
Modalities: text → text
Output speed: ~29 tok/s
License: Open weights (Llama-3-Community)
Clouds: Bedrock, Vertex AI, Azure AI Foundry, GCP, OCI, IBM watsonx

Does not train on API inputs by default

Other Llama 3 versions

Last verified 2026-05-27