Llama 3.1 405B

GA

by Meta · Llama 3 family · best for largest open base for from-scratch fine-tuning

Open-Weights
6.3
AI Panel Score
Value 4.5/10

Llama 3.1 405B is the July 2024 release that broke the open-vs-closed performance ceiling — the first downloadable model to seriously rival closed-API frontier (MMLU 88.6, within 0.1 of GPT-4o at the time) and the largest open dense transformer ever shipped. The one-sentence buyer takeaway in 2026: it is a landmark that has since been outclassed by both its smaller sibling (3.3 70B) and its successor (Maverick), and its only durable role now is as the largest available open base model for serious from-scratch fine-tuning. - Provider: Meta - Release: 2024-07-23 (GA, open weights, base + Instruct) - Status: GA; superseded on every modern axis, defensible only as a fine-tuning base - Context: 128,000 tokens - Max output: 4,096 tokens (provider-dependent) - Modalities: text only - Knowledge cutoff: December 2023 - Headline price: ~$2.40–$8.00 in / ~$3.00–$16.00 out per 1M tokens

What's new

  • First open-weights model to seriously rival closed frontier at release (MMLU 88.6).
  • Largest publicly downloadable dense transformer ever shipped at the time (405B params).
  • 128K context, 8 languages, function-calling baked into the Instruct variant.
  • Released as both base and Instruct checkpoints — a major asset for the fine-tuning community that 3.3 70B later lacked.

Benchmarks

BenchmarkScoreSource
BBH81.3%Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
MMLU88.6%Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
IFEval88.6%Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
MATH-50073.8%Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
MMLU-Pro73.3%Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
HumanEval89%Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
LMArena Elo1267LMArena2024
GPQA Diamond51.1%Meta Llama 3.1 eval details2024-07-23T00:00:00.000Z
Artificial Analysis Index17Artificial Analysis2026-05

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker6.5/10
A landmark, now mostly legacy. The only reason to standardize on it today is if you're building a 100B+ derivative from the base checkpoint.

For a buyer in May 2026, 405B is largely a legacy choice. Its strategic value at release was enormous — it broke the open-vs-closed ceiling — but it is now outclassed by its smaller sibling (3.3 70B, better IFEval, one-sixth the cost) and its successor (Maverick, better on every modern benchmark, MoE-efficient). The single place it still wins is the available base checkpoint for serious 100B+ customization. Unless you are running a research lab or building a from-scratch derivative, this is not the model to standardize on; the roadmap clearly points to Llama 4.

Strategic Fit 6Vendor Risk 8Roadmap Confidence 6
Pros
  • largest open base checkpoint
  • well-documented
  • permissive license
Cons
  • outclassed and outpriced
  • legacy direction
Right for: labs/teams building 100B+ derivatives
Avoid if: you want a production application model
Domain Strategist6/10
Its square shrank to one tile: 'biggest open base you can pretrain-continue.' Everything else moved on.

Strategically, 405B's positioning has collapsed to a single niche — the largest open base checkpoint for from-scratch fine-tuning. As an application model it has no defensible square: 3.3 70B and Maverick dominate it on capability, cost, and speed. Its differentiation is purely the 405B base availability and its historical role as the model that proved open weights could compete. Market timing has long passed; the open frontier is now MoE and reasoning models. A historically pivotal model with a narrow, fading strategic role.

Competitive Positioning 5Differentiation 7Market Timing 5
Pros
  • unique 405B open base
  • historical importance
Cons
  • no application moat
  • superseded direction
Right for: customization at extreme scale
Avoid if: you need a competitive deployed model
Finance Lead5/10
The worst TCO in the lineup. Bedrock charges 10x Maverick for worse benchmarks, and self-hosting a 405B dense model is a money pit unless it's fully utilized.

The economics are the weakest in the Meta range. Bedrock runs $2.40–$5.32 input — over 10x Maverick on the same platform — and Azure tops $8.00. Self-hosting needs 8xH100 minimum and is expensive to keep utilized at the throughput a dense 405B demands, while delivering only ~29 tps. At 1B+ tokens/month the comparison against Maverick or Scout is unambiguous: 5–20x more expensive for worse scores. The only economic justification is amortizing sunk infrastructure or recovering a custom fine-tune investment.

Cost Efficiency 4Pricing Transparency 7Value per Dollar 4
Pros
  • predictable pricing where available
Cons
  • 10–20x Maverick
  • slow
  • expensive to self-host
Right for: amortizing existing investments
Avoid if: starting fresh — Maverick/Scout win decisively
Domain Practitioner6.5/10
I rarely pick it to build on, but as a fine-tuning base it's the biggest open canvas there is — and Meta documented it better than anything else.

For builders, 405B is rarely the right application model — dense 405B makes self-hosting painful, the chat template shows its age, and every modern feature (vision, MoE efficiency, the Llama 4 tool format) lives elsewhere. Where it still shines is as a fine-tuning base: the pretrained checkpoint is the largest open release that allows genuine continued pre-training or major instruction reformatting, and Meta's documentation (eval_details, torchtune recipes) is best-in-class. For application work you would pick 3.3 70B or Maverick; for serious customization at scale, this is the canvas.

API Ergonomics 6Tool/Agent Support 6Reliability 8
Pros
  • largest open base
  • superb documentation
  • stable
Cons
  • painful to self-host
  • aging template
  • no modern features
Right for: serious fine-tuners
Avoid if: you want a deployable application model
Power User6/10
Competent but visibly older — the December 2023 brain trips on recent libraries and product names within a few minutes.

In daily use, 405B in 2026 feels noticeably dated. Responses are competent but the December 2023 cutoff shows — recent events, libraries, and product names trip it up. Conversation feel lacks the polish of any post-2025 model, and refusal patterns are more conservative than newer alternatives. Latency is also poor (~29 tps), which hurts interactive feel. For embedded production workloads users will not notice; for any user-facing chat product, the staleness and slowness are felt quickly.

Output Quality 6Speed 4Everyday Usefulness 6
Pros
  • competent core quality
  • strong on classic tasks
Cons
  • stale knowledge
  • slow
  • conservative refusals
Right for: backstage workloads
Avoid if: user-facing or latency-sensitive
Skeptic6/10
Genuinely historic and honestly benchmarked — but in 2026 it's a slow, stale, expensive way to do what a 70B does better and cheaper.

Adversarially, 405B is the rare model whose problem is not deceptive marketing — its release claims (MMLU 88.6, frontier-adjacent in 2024) were accurate and it set the open-weights bar. The issue is pure obsolescence: it is beaten by its own 3.3 70B sibling on IFEval and MATH at one-sixth the size and one-tenth the cost, beaten by Maverick everywhere, runs at a slow ~29 tps, and carries a 30-month-old cutoff. There is no benchmark gaming to call out — just a model that the market and Meta itself have moved past. The honest verdict: a museum piece with one living use (the base checkpoint).

Claim Accuracy 9Weakness Severity 4Hype vs Reality 8
Pros
  • honest claims
  • historic
Cons
  • obsolete on cost/speed/freshness
Right for: skeptics who value it only as a base checkpoint
Avoid if: you need a current, economical model

Strengths

  • Base pretrained checkpoint available — the strongest open foundation for from-scratch fine-tuning over 100B params.
  • Strong classic-benchmark reasoning, math, and code for its 2024 generation.
  • Exceptionally well-documented (eval_details repo, detailed model card).
  • Wide provider availability and 18+ months of tooling maturity.
  • Permissive commercial license; 128K context, 8 languages.

Limitations

  • Surpassed by 3.3 70B on instruction-following (IFEval) and MATH at one-sixth the size.
  • Surpassed by Maverick on every modern benchmark while costing 10–20x more to serve.
  • Slowest model in the lineup (~29 tps); impractical to self-host below an 8xH100 box.
  • December 2023 cutoff is ~30 months stale; no vision, no reasoning mode.
  • Worst TCO in the Meta lineup — the managed value case has effectively disappeared.

Best use cases

From-scratch fine-tuning and continued pre-training where you need the largest available open base model over 100B params. Research workloads requiring a frontier-adjacent open baseline with full documentation. Distillation source for smaller production models (the same role it plays inside Meta). Backwards-compatibility for existing pipelines that have not migrated to 3.3 or Llama 4.

Buyer questions

What does it cost?

No single Meta price; representative inference is ~$2.40–$8.00 input and ~$3.00–$16.00 output per 1M tokens — roughly 10–20x Maverick. Self-host needs 8xH100 minimum.

Why would I still use it?

Almost solely as the largest open base checkpoint for from-scratch fine-tuning or continued pre-training at 100B+ scale.

Does it do vision?

No — text only.

How does it compare to 3.3 70B?

3.3 70B matches or beats it on most benchmarks at one-sixth the size and cost; pick 405B only if you specifically need the base model.

How current is its knowledge?

December 2023 cutoff — pair with retrieval for anything recent.

What about safety/compliance?

No built-in moderation; add Llama Guard 3. Certifications come from your host/infra.

Any license limits?

Commercial use allowed; separate Meta license required above 700M MAU.

Comparable models

Llama 3.3 70B — close on quality, one-sixth the cost, beats it on IFEval and MATH; almost always the better pick today unless you need the base checkpoint.
Llama 4 Maverick — newer, MoE-efficient, better on every modern benchmark, ~10x cheaper to serve, plus vision.
DeepSeek V3.x base — open MoE alternative at comparable or better quality and far better economics.

Model specs

Input price
$3 / Mtok
Output price
$3 / Mtok
Cached input
Batch (in/out)
Context window
128K tokens
Max output
4K tokens
Knowledge cutoff
2023-12
Released
2024-07-22
Modalities
text → text
Output speed
~29 tok/s
License
Open weights (Llama-3-Community)
Clouds
Bedrock, Vertex AI, Azure AI Foundry, GCP, OCI, IBM watsonx

Does not train on API inputs by default

Last verified 2026-05-27