by Alibaba Cloud · Qwen3 family · best for frontier-adjacent reasoning at open-weight prices
Qwen3-235B-A22B is Alibaba's open-weight flagship, shipped 2025-04-29 under Apache 2.0. It is a Mixture-of-Experts model — 235B total parameters but only 22B activated per token (the "A22B" suffix) — that brings frontier-adjacent math, code, and reasoning into the open-weight tier with a hybrid "thinking / non-thinking" mode toggle in a single set of weights. For a buyer, the one-sentence pitch is: DeepSeek-R1-class reasoning and broad multilingual quality, redistributable under Apache 2.0, at roughly 1/30th the per-token cost of a Western frontier model. - Provider: Alibaba Cloud (Qwen Team) - Released: 2025-04-29 (GA) - Tier: Large open-weight MoE flagship - Context: 131,072 tokens - Max output: 32,768 tokens - Modalities: text in, text out - Knowledge cutoff: approx. 2024-10 - Headline price: $0.20 in / $0.60 out per 1M tokens (Together AI)
| Benchmark | Score | Source |
|---|---|---|
| MMLU-Pro | 82.8% | Qwen3 Technical Report (arXiv 2505.09388)2025-05-14T00:00:00.000Z |
| AIME 2025 | 81.5% | Qwen3 Technical Report (arXiv 2505.09388)2025-05-14T00:00:00.000Z |
| LMArena Elo | 1431 | LMArena Text leaderboard (qwen3-235b-a22b-instruct-2507 checkpoint)2025-08-01T00:00:00.000Z |
| GPQA Diamond | 70% | Qwen3 Technical Report (arXiv 2505.09388)2025-05-14T00:00:00.000Z |
| LiveCodeBench | 70.7% | Qwen3 Technical Report (arXiv 2505.09388), LiveCodeBench v52025-05-14T00:00:00.000Z |
Six personas, six verdicts — the same panel that reviews every product on TopReviewed.
“Apache-2.0 frontier-adjacent reasoning is the leverage point — it ends single-vendor dependence for our reasoning tier.”
Open weights under Apache 2.0 with MMLU-Pro 82.8 and AIME'25 81.5 is exactly the strategic wedge buyers wanted: it removes lock-in to Anthropic/OpenAI for reasoning and lands within striking distance on capability. The MoE design keeps unit economics sane. The genuine risks are governance, not capability — the DashScope mainland endpoint routes through China (disqualifying for some public-sector/regulated EU work), and PRC content alignment is baked in. Self-host on Together, Fireworks, DeepInfra, or your own cluster and the concern collapses to "weights originated in China," increasingly normalized in 2026.
“Qwen owns the open-weight multilingual frontier — nobody else pairs this reasoning quality with this Asian-language depth.”
In market terms, Qwen3-235B's moat is the intersection of frontier-adjacent reasoning and best-in-class Asian-language coverage, under a permissive license. DeepSeek competes on English reasoning and price; Llama competes on US-aligned content and ecosystem; neither matches Qwen on Chinese/Japanese/Korean/Arabic breadth at this capability tier. That positioning is durable for any product targeting Asian or genuinely global audiences. The competitive risk is internal — Alibaba's own newer Qwen3.6/3.7 line and the 2507 refresh keep cannibalizing the original 235B, so timing favors treating it as a stable, well-understood base rather than the bleeding edge.
“At $0.20/$0.60 it's roughly 30x cheaper than a Western flagship — and self-host has zero per-token fee.”
This is where Qwen3-235B is hard to argue with. Together's $0.20 in / $0.60 out is roughly 25-40x cheaper than GPT-4o-class and far below Claude Opus on equivalent reasoning. Self-hosting on 8x H100 (~$15-30/hr) breaks even against API at roughly 2-3M tokens/hr of sustained throughput — easy for any production agent. No license fees, no MAU thresholds, no per-seat overhead. Pricing is flat and predictable across the major providers, so bill modeling is clean. The only nuance is thinking-mode output-token inflation — gate it to control spend.
“Day-one HF availability, clean thinking-mode template, fine-tunes converge fast — this is the builder's open weight.”
Hugging Face availability is best-in-class — Instruct, Base, FP8, AWQ, GPTQ, and GGUF shipped at launch. Fine-tuning works cleanly with Transformers, vLLM, SGLang, Axolotl, and LLaMA-Factory; LoRA/QLoRA recipes are well documented. The hybrid thinking mode is a single chat-template flag, a clean abstraction. Tool-use and structured JSON are first-class, so agents don't need brittle scaffolding. Multilingual SFT (Chinese/Japanese/Arabic) converges faster on Qwen3 base than on Llama. The 235B MoE is impractical on a single dev box, so most practitioners iterate on Qwen3-32B and deploy the 235B — a friction point, but a well-trodden path.
“On math and code it competes with paid frontier chat; on creative writing and US idiom it still trails Claude.”
Side-by-side against free-tier ChatGPT, Claude, or Gemini, Qwen3-235B on chat.qwen.ai or self-hosted is genuinely competitive — often better on math, sometimes weaker on creative writing. Response quality is high; refusal rate is Western-comparable except on PRC-political topics where it deflects. Latency is acceptable in non-thinking mode (sub-2s first token typical) and slow in thinking mode. For everyday global/Asian use the quality-per-dollar is unmatched; for US consumer apps where political-topic handling and Western idiom matter, the gaps are real.
“Benchmark-strong, but the headline numbers are thinking-mode and the 2507 Elo isn't the model you'd actually deploy.”
The capability is real, but read the fine print. Published scores (82.8 MMLU-Pro, 81.5 AIME'25, 70.7 LiveCodeBench) are thinking-mode-on — non-thinking quality is materially lower, and most product surfaces can't afford always-on reasoning. The LMArena Elo of 1431 is the refreshed 2507 checkpoint, not the original April weights, so quoting it for "Qwen3-235B" conflates two models. China data residency on the mainland API is a genuine governance exposure, and PRC content alignment is demonstrable, not theoretical. None of this makes it a bad model — it makes the marketing framing optimistic. Self-hosted, with thinking gated and the right checkpoint pinned, it is excellent.
- Multilingual production workloads — any pipeline mixing English with Chinese, Japanese, Korean, Arabic, or Indic languages where DeepSeek and Llama leave quality on the table. - Self-hosted reasoning — DeepSeek-R1-class reasoning under Apache 2.0 with no monthly API fee. - High-volume coding agents — LiveCodeBench plus native tool-use make it credible for autonomous loops at a fraction of frontier pricing. - Asian-market consumer apps — Chinese-first quality where Western brand polish matters less.
Open weights, so you pay an inference provider (Together $0.20/$0.60, DeepInfra ~$0.27 blended) or self-host on ~8x H100. No license fee.
Yes — Apache 2.0, no MAU threshold, full commercial and redistribution rights.
The official DashScope mainland endpoint routes through Alibaba Cloud in China; use the international endpoint, a US/EU-hosted provider, or self-host to avoid it.
Hosted is a drop-in OpenAI-compatible API. Self-hosting needs an 8x H100-class node and vLLM/SGLang — non-trivial but well-documented.
Yes, via an optional hybrid thinking mode with full visible CoT; toggle it per request.
Best-in-class among open weights for Chinese, Japanese, Korean, Arabic, and Indic languages across 119 supported languages.
On PRC-sensitive political topics, expect stricter refusals/deflection than Western models. For most workloads this is irrelevant; for political/news products, test it.
Does not train on API inputs by default
Last verified 2026-05-27