Qwen3-235B-A22B

GALatest Large

by Alibaba Cloud · Qwen3 family · best for frontier-adjacent reasoning at open-weight prices

FrontierReasoningCodingOpen-Weights
8.5
AI Panel Score
Value 9.5/10

Qwen3-235B-A22B is Alibaba's open-weight flagship, shipped 2025-04-29 under Apache 2.0. It is a Mixture-of-Experts model — 235B total parameters but only 22B activated per token (the "A22B" suffix) — that brings frontier-adjacent math, code, and reasoning into the open-weight tier with a hybrid "thinking / non-thinking" mode toggle in a single set of weights. For a buyer, the one-sentence pitch is: DeepSeek-R1-class reasoning and broad multilingual quality, redistributable under Apache 2.0, at roughly 1/30th the per-token cost of a Western frontier model. - Provider: Alibaba Cloud (Qwen Team) - Released: 2025-04-29 (GA) - Tier: Large open-weight MoE flagship - Context: 131,072 tokens - Max output: 32,768 tokens - Modalities: text in, text out - Knowledge cutoff: approx. 2024-10 - Headline price: $0.20 in / $0.60 out per 1M tokens (Together AI)

What's new

  • First Qwen flagship built as a true large-scale Mixture-of-Experts: 235B total / 22B active, 128 experts with 8 routed per token, 94 layers.
  • Hybrid thinking toggle in one model — the same weights serve fast chat (thinking off) and long chain-of-thought (thinking on) via a chat-template flag, so a single deployment covers both workloads.
  • Context window jumped to 131K across the Qwen3 line (most Qwen2.5 dense models were 32K native).
  • Pre-training scaled to approx. 36 trillion tokens across 119 languages — roughly double Qwen2.5's 18T — lifting non-English math and code reasoning materially.
  • Apache 2.0 across the whole Qwen3 open release, including this 235B MoE — a cleaner license than the Qwen License that governed the Qwen2.5-72B flagship.

Benchmarks

BenchmarkScoreSource
MMLU-Pro82.8%Qwen3 Technical Report (arXiv 2505.09388)2025-05-14T00:00:00.000Z
AIME 202581.5%Qwen3 Technical Report (arXiv 2505.09388)2025-05-14T00:00:00.000Z
LMArena Elo1431LMArena Text leaderboard (qwen3-235b-a22b-instruct-2507 checkpoint)2025-08-01T00:00:00.000Z
GPQA Diamond70%Qwen3 Technical Report (arXiv 2505.09388)2025-05-14T00:00:00.000Z
LiveCodeBench70.7%Qwen3 Technical Report (arXiv 2505.09388), LiveCodeBench v52025-05-14T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker8/10
Apache-2.0 frontier-adjacent reasoning is the leverage point — it ends single-vendor dependence for our reasoning tier.

Open weights under Apache 2.0 with MMLU-Pro 82.8 and AIME'25 81.5 is exactly the strategic wedge buyers wanted: it removes lock-in to Anthropic/OpenAI for reasoning and lands within striking distance on capability. The MoE design keeps unit economics sane. The genuine risks are governance, not capability — the DashScope mainland endpoint routes through China (disqualifying for some public-sector/regulated EU work), and PRC content alignment is baked in. Self-host on Together, Fireworks, DeepInfra, or your own cluster and the concern collapses to "weights originated in China," increasingly normalized in 2026.

Strategic Fit 9Vendor Risk 6Roadmap Confidence 8
Pros
  • License clarity
  • capability
  • serving economics
Cons
  • China data-residency optics
  • content alignment
Right for: teams building a sovereign or multi-vendor reasoning tier
Avoid if: you need US-jurisdiction compliance guarantees from the model vendor itself
Domain Strategist8.5/10
Qwen owns the open-weight multilingual frontier — nobody else pairs this reasoning quality with this Asian-language depth.

In market terms, Qwen3-235B's moat is the intersection of frontier-adjacent reasoning and best-in-class Asian-language coverage, under a permissive license. DeepSeek competes on English reasoning and price; Llama competes on US-aligned content and ecosystem; neither matches Qwen on Chinese/Japanese/Korean/Arabic breadth at this capability tier. That positioning is durable for any product targeting Asian or genuinely global audiences. The competitive risk is internal — Alibaba's own newer Qwen3.6/3.7 line and the 2507 refresh keep cannibalizing the original 235B, so timing favors treating it as a stable, well-understood base rather than the bleeding edge.

Competitive Positioning 9Differentiation 9Market Timing 7
Pros
  • Unique multilingual+reasoning combo
  • permissive license
Cons
  • Fast-moving internal successors
Right for: global/Asian-market builders
Avoid if: you only serve English and want the single best English model
Finance Lead9.5/10
At $0.20/$0.60 it's roughly 30x cheaper than a Western flagship — and self-host has zero per-token fee.

This is where Qwen3-235B is hard to argue with. Together's $0.20 in / $0.60 out is roughly 25-40x cheaper than GPT-4o-class and far below Claude Opus on equivalent reasoning. Self-hosting on 8x H100 (~$15-30/hr) breaks even against API at roughly 2-3M tokens/hr of sustained throughput — easy for any production agent. No license fees, no MAU thresholds, no per-seat overhead. Pricing is flat and predictable across the major providers, so bill modeling is clean. The only nuance is thinking-mode output-token inflation — gate it to control spend.

Cost Efficiency 10Pricing Transparency 9Value per Dollar 10
Pros
  • Order-of-magnitude cheaper than frontier
  • no license fee
  • predictable
Cons
  • Thinking mode inflates output tokens
  • 8x H100 capex for self-host
Right for: cost-modeled GenAI at scale
Avoid if: volume is too low to amortize self-host and you need the absolute cheapest tiny model
Domain Practitioner9/10
Day-one HF availability, clean thinking-mode template, fine-tunes converge fast — this is the builder's open weight.

Hugging Face availability is best-in-class — Instruct, Base, FP8, AWQ, GPTQ, and GGUF shipped at launch. Fine-tuning works cleanly with Transformers, vLLM, SGLang, Axolotl, and LLaMA-Factory; LoRA/QLoRA recipes are well documented. The hybrid thinking mode is a single chat-template flag, a clean abstraction. Tool-use and structured JSON are first-class, so agents don't need brittle scaffolding. Multilingual SFT (Chinese/Japanese/Arabic) converges faster on Qwen3 base than on Llama. The 235B MoE is impractical on a single dev box, so most practitioners iterate on Qwen3-32B and deploy the 235B — a friction point, but a well-trodden path.

API Ergonomics 8Tool/Agent Support 9Reliability 9
Pros
  • Comprehensive HF artifacts
  • clean tooling
  • fine-tune friendly
Cons
  • Too large for single-GPU iteration
Right for: teams fine-tuning open weights for verticals
Avoid if: you want a turnkey hosted API and never touch weights
Power User7.5/10
On math and code it competes with paid frontier chat; on creative writing and US idiom it still trails Claude.

Side-by-side against free-tier ChatGPT, Claude, or Gemini, Qwen3-235B on chat.qwen.ai or self-hosted is genuinely competitive — often better on math, sometimes weaker on creative writing. Response quality is high; refusal rate is Western-comparable except on PRC-political topics where it deflects. Latency is acceptable in non-thinking mode (sub-2s first token typical) and slow in thinking mode. For everyday global/Asian use the quality-per-dollar is unmatched; for US consumer apps where political-topic handling and Western idiom matter, the gaps are real.

Output Quality 8Speed 7Everyday Usefulness 8
Pros
  • Strong math/code
  • excellent multilingual
  • free web UI
Cons
  • Creative/US-idiom gap
  • thinking-mode latency
  • political refusals
Right for: power users in technical or multilingual workflows
Avoid if: you want the best creative-writing daily driver
Skeptic7/10
Benchmark-strong, but the headline numbers are thinking-mode and the 2507 Elo isn't the model you'd actually deploy.

The capability is real, but read the fine print. Published scores (82.8 MMLU-Pro, 81.5 AIME'25, 70.7 LiveCodeBench) are thinking-mode-on — non-thinking quality is materially lower, and most product surfaces can't afford always-on reasoning. The LMArena Elo of 1431 is the refreshed 2507 checkpoint, not the original April weights, so quoting it for "Qwen3-235B" conflates two models. China data residency on the mainland API is a genuine governance exposure, and PRC content alignment is demonstrable, not theoretical. None of this makes it a bad model — it makes the marketing framing optimistic. Self-hosted, with thinking gated and the right checkpoint pinned, it is excellent.

Claim Accuracy 7Weakness Severity 6Hype vs Reality 7
Pros
  • Genuinely strong
  • verifiable architecture
Cons
  • Thinking-mode/checkpoint caveats
  • governance optics
Right for: skeptics who pin checkpoints and self-host
Avoid if: you take headline benchmarks at face value for a latency-bound product

Strengths

  • Best open-weight math/reasoning at release; rivals dedicated reasoning models with thinking mode on.
  • Apache 2.0 — no MAU caps, full commercial and redistribution rights, clean fine-tuning.
  • 22B active params keep serving cost/latency near a 30B dense, not a 235B.
  • Multilingual breadth (esp. Chinese/Japanese/Korean/Arabic/Indic) exceeds DeepSeek and Llama.
  • Native tool-use, parallel function calls, structured JSON out of the box.

Limitations

  • FP16 self-hosting needs ~470GB VRAM (8x H100-class) — out of reach for solo developers.
  • Knowledge cutoff approx. October 2024; weaker on 2025-2026 facts.
  • Thinking-mode latency variance is large; same prompt can take 2s or 30s.
  • Western cultural/brand-voice polish trails Claude and GPT.
  • PRC-aligned content alignment and DashScope-mainland data residency are real considerations for some buyers.

Best use cases

- Multilingual production workloads — any pipeline mixing English with Chinese, Japanese, Korean, Arabic, or Indic languages where DeepSeek and Llama leave quality on the table. - Self-hosted reasoning — DeepSeek-R1-class reasoning under Apache 2.0 with no monthly API fee. - High-volume coding agents — LiveCodeBench plus native tool-use make it credible for autonomous loops at a fraction of frontier pricing. - Asian-market consumer apps — Chinese-first quality where Western brand polish matters less.

Buyer questions

How is it priced?

Open weights, so you pay an inference provider (Together $0.20/$0.60, DeepInfra ~$0.27 blended) or self-host on ~8x H100. No license fee.

Can I use it commercially?

Yes — Apache 2.0, no MAU threshold, full commercial and redistribution rights.

What about China data residency?

The official DashScope mainland endpoint routes through Alibaba Cloud in China; use the international endpoint, a US/EU-hosted provider, or self-host to avoid it.

How hard is setup?

Hosted is a drop-in OpenAI-compatible API. Self-hosting needs an 8x H100-class node and vLLM/SGLang — non-trivial but well-documented.

Does it reason?

Yes, via an optional hybrid thinking mode with full visible CoT; toggle it per request.

How does it handle non-English?

Best-in-class among open weights for Chinese, Japanese, Korean, Arabic, and Indic languages across 119 supported languages.

Should I worry about content alignment?

On PRC-sensitive political topics, expect stricter refusals/deflection than Western models. For most workloads this is irrelevant; for political/news products, test it.

Comparable models

DeepSeek-V3 (671B MoE) — larger, edges Qwen3 on pure English reasoning; Qwen3 wins on multilingual breadth and Apache-2.0 license clarity.
Llama 4 Maverick (MoE) — comparable open-weight flagship; Llama wins on US-aligned content and ecosystem, Qwen3 wins on math/code and Asian languages.
GPT-4o / Claude Opus-class — closed-source; both more polished on Western content, both 25-50x more expensive per token, neither self-hostable.

Model specs

Input price
$0.20 / Mtok
Output price
$0.60 / Mtok
Cached input
Batch (in/out)
Context window
131K tokens
Max output
33K tokens
Knowledge cutoff
2024-10
Released
2025-04-28
Modalities
text → text
Output speed
~68 tok/s
License
Open weights (Apache-2.0)
Clouds
GCP

Does not train on API inputs by default

Other Qwen3 versions

Last verified 2026-05-27