by Meta · Llama 3 family · best for largest open base for from-scratch fine-tuning
Llama 3.1 405B is the July 2024 release that broke the open-vs-closed performance ceiling — the first downloadable model to seriously rival closed-API frontier (MMLU 88.6, within 0.1 of GPT-4o at the time) and the largest open dense transformer ever shipped. The one-sentence buyer takeaway in 2026: it is a landmark that has since been outclassed by both its smaller sibling (3.3 70B) and its successor (Maverick), and its only durable role now is as the largest available open base model for serious from-scratch fine-tuning. - Provider: Meta - Release: 2024-07-23 (GA, open weights, base + Instruct) - Status: GA; superseded on every modern axis, defensible only as a fine-tuning base - Context: 128,000 tokens - Max output: 4,096 tokens (provider-dependent) - Modalities: text only - Knowledge cutoff: December 2023 - Headline price: ~$2.40–$8.00 in / ~$3.00–$16.00 out per 1M tokens
| Benchmark | Score | Source |
|---|---|---|
| BBH | 81.3% | Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z |
| MMLU | 88.6% | Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z |
| IFEval | 88.6% | Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z |
| MATH-500 | 73.8% | Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z |
| MMLU-Pro | 73.3% | Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z |
| HumanEval | 89% | Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z |
| LMArena Elo | 1267 | LMArena2024 |
| GPQA Diamond | 51.1% | Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z |
| Artificial Analysis Index | 17 | Artificial Analysis2026-05 |
Six personas, six verdicts — the same panel that reviews every product on TopReviewed.
“A landmark, now mostly legacy. The only reason to standardize on it today is if you're building a 100B+ derivative from the base checkpoint.”
For a buyer in May 2026, 405B is largely a legacy choice. Its strategic value at release was enormous — it broke the open-vs-closed ceiling — but it is now outclassed by its smaller sibling (3.3 70B, better IFEval, one-sixth the cost) and its successor (Maverick, better on every modern benchmark, MoE-efficient). The single place it still wins is the available base checkpoint for serious 100B+ customization. Unless you are running a research lab or building a from-scratch derivative, this is not the model to standardize on; the roadmap clearly points to Llama 4.
“Its square shrank to one tile: 'biggest open base you can pretrain-continue.' Everything else moved on.”
Strategically, 405B's positioning has collapsed to a single niche — the largest open base checkpoint for from-scratch fine-tuning. As an application model it has no defensible square: 3.3 70B and Maverick dominate it on capability, cost, and speed. Its differentiation is purely the 405B base availability and its historical role as the model that proved open weights could compete. Market timing has long passed; the open frontier is now MoE and reasoning models. A historically pivotal model with a narrow, fading strategic role.
“The worst TCO in the lineup. Bedrock charges 10x Maverick for worse benchmarks, and self-hosting a 405B dense model is a money pit unless it's fully utilized.”
The economics are the weakest in the Meta range. Bedrock runs $2.40–$5.32 input — over 10x Maverick on the same platform — and Azure tops $8.00. Self-hosting needs 8xH100 minimum and is expensive to keep utilized at the throughput a dense 405B demands, while delivering only ~29 tps. At 1B+ tokens/month the comparison against Maverick or Scout is unambiguous: 5–20x more expensive for worse scores. The only economic justification is amortizing sunk infrastructure or recovering a custom fine-tune investment.
“I rarely pick it to build on, but as a fine-tuning base it's the biggest open canvas there is — and Meta documented it better than anything else.”
For builders, 405B is rarely the right application model — dense 405B makes self-hosting painful, the chat template shows its age, and every modern feature (vision, MoE efficiency, the Llama 4 tool format) lives elsewhere. Where it still shines is as a fine-tuning base: the pretrained checkpoint is the largest open release that allows genuine continued pre-training or major instruction reformatting, and Meta's documentation (eval_details, torchtune recipes) is best-in-class. For application work you would pick 3.3 70B or Maverick; for serious customization at scale, this is the canvas.
“Competent but visibly older — the December 2023 brain trips on recent libraries and product names within a few minutes.”
In daily use, 405B in 2026 feels noticeably dated. Responses are competent but the December 2023 cutoff shows — recent events, libraries, and product names trip it up. Conversation feel lacks the polish of any post-2025 model, and refusal patterns are more conservative than newer alternatives. Latency is also poor (~29 tps), which hurts interactive feel. For embedded production workloads users will not notice; for any user-facing chat product, the staleness and slowness are felt quickly.
“Genuinely historic and honestly benchmarked — but in 2026 it's a slow, stale, expensive way to do what a 70B does better and cheaper.”
Adversarially, 405B is the rare model whose problem is not deceptive marketing — its release claims (MMLU 88.6, frontier-adjacent in 2024) were accurate and it set the open-weights bar. The issue is pure obsolescence: it is beaten by its own 3.3 70B sibling on IFEval and MATH at one-sixth the size and one-tenth the cost, beaten by Maverick everywhere, runs at a slow ~29 tps, and carries a 30-month-old cutoff. There is no benchmark gaming to call out — just a model that the market and Meta itself have moved past. The honest verdict: a museum piece with one living use (the base checkpoint).
From-scratch fine-tuning and continued pre-training where you need the largest available open base model over 100B params. Research workloads requiring a frontier-adjacent open baseline with full documentation. Distillation source for smaller production models (the same role it plays inside Meta). Backwards-compatibility for existing pipelines that have not migrated to 3.3 or Llama 4.
No single Meta price; representative inference is ~$2.40–$8.00 input and ~$3.00–$16.00 output per 1M tokens — roughly 10–20x Maverick. Self-host needs 8xH100 minimum.
Almost solely as the largest open base checkpoint for from-scratch fine-tuning or continued pre-training at 100B+ scale.
No — text only.
3.3 70B matches or beats it on most benchmarks at one-sixth the size and cost; pick 405B only if you specifically need the base model.
December 2023 cutoff — pair with retrieval for anything recent.
No built-in moderation; add Llama Guard 3. Certifications come from your host/infra.
Commercial use allowed; separate Meta license required above 700M MAU.
Does not train on API inputs by default
Last verified 2026-05-27