Is 3.5 Flash actually better than 3.1 Pro?

On agentic and coding/multimodal benchmarks Google publishes (MCP Atlas, Terminal-Bench, CharXiv, MMMU-Pro), yes. On pure reasoning (HLE, ARC-AGI-2) and long-context recall, no — 3.1 Pro wins. Pick by workload.

What does it cost vs prior Flash?

$1.50/$9 flat — about 3x Gemini 3 Flash Preview and 6x Gemini 3.1 Flash-Lite. The lift buys frontier-adjacent quality and 4x speed; re-model unit economics if migrating up from cheap Flash.

Is the 1M context reliable?

The window is 1M, but recall drops to 26.6% at 1M (77.3% at 128K). Treat ~128K-256K as the reliable working range and chunk beyond that.

AA measures ~61% on the Omniscience knowledge eval. Use Google Search grounding or retrieval for factual workloads.

How fast is it really?

~203 tok/s sustained (4x frontier peers). The high AA TTFT reflects thinking-on; lower the thinking budget for snappy first tokens.

No. Gemini is closed-weights, API/Vertex only.

Gemini 3.5 Flash Review — Benchmarks, Pricing & AI Panel Verdict

Q: Is migration from 2.x Flash hard?

No — it's a model-name swap; the SDK surface is identical across the Gemini line.

Benchmark	Score	Source
Humanity's Last Exam	40.2%	deepmind.google 2026-05-19T00:00:00.000Z
MMMU	83.6%	deepmind.google 2026-05-19T00:00:00.000Z
GPQA Diamond	92.2%	buildfastwithai.com 2026-05-20T00:00:00.000Z
Terminal-Bench	76.2%	deepmind.google 2026-05-19T00:00:00.000Z
MRCR Long Context	77.3%	deepmind.google 2026-05-19T00:00:00.000Z
SWE-bench Verified	80.8%	benchlm.ai 2026-05-24T00:00:00.000Z
Artificial Analysis Index	55	artificialanalysis.ai 2026-05-28T00:00:00.000Z

Architecture

Sparse mixture-of-experts (Gemini family), with parameter counts, expert counts, layers, and attention all undisclosed — honestly null. Disclosed and verifiable: a 1M-token context window, 65,536 max output tokens, a January 2025 knowledge cutoff, and native multimodal input (text, image, audio, video). It exposes a configurable thinking budget; the high AA TTFT (~18.9s) reflects thinking being on by default in that eval. Long-context behavior is the architecture's clearest limitation: MRCR v2 is 77.3% at 128K but collapses to 26.6% at the full 1M, so the window is larger than its reliable working memory.

Capabilities

Gemini 3.5 Flash is built for agents (cap_agentic 9.5, cap_function_calling 9.5): it leads the field on MCP Atlas (83.6%), posts Terminal-Bench 2.1 76.2%, OSWorld-Verified 78.4%, and Toolathlon 56.5%, holding tool state and routing across long loops better than any other Flash-tier model. Coding is strong (cap_coding 8.5): SWE-Bench Pro 55.1% on the official card, ~80.8% SWE-bench Verified per aggregators. Reasoning (8.5) and math (8.0) are good but trail 3.1 Pro (HLE 40.2% vs 44.4%, ARC-AGI-2 72.1% vs 77.1%). Vision and document/OCR lead (cap_vision 9.5, cap_document_ocr 9.0): MMMU-Pro 83.6% (#1) and CharXiv 84.2%. Long context (cap_long_context 8.5) has the 1M window but weak 1M recall (MRCR 26.6%). Real-time data (cap_realtime_data 9.0) via Google Search grounding. Creative writing (7.5) is workmanlike; instruction following (8.5) and multilingual (8.5) are solid; safety calibration (8.0) mirrors 3.1 Pro, with an AA-measured ~61% hallucination rate that makes grounding important for factual tasks.

Benchmark analysis

Benchmark	Score	vs Predecessor	vs Top Competitor	Source
MCP Atlas	83.6%	New high for Flash	#1 field-wide, beats 3.1 Pro (69.2%)	card
Terminal-Bench 2.1	76.2%	New high for Flash	Beats 3.1 Pro; 2nd behind GPT-5.5	card
CharXiv	84.2%	Significant gain	Field leader on chart reasoning	card
MMMU-Pro	83.6%	Up from 2.5 Flash	#1 on llm-stats MMMU-Pro board	card
SWE-Bench Pro (Public)	55.1%	New	Trails Opus 4.7 by ~9 pts	card
SWE-bench Verified	~80.8%	Up ~10-15 pts vs 3 Flash	Ahead of Opus 4.6 per aggregator	BenchLM
OSWorld-Verified	78.4%	New	Strong computer use	card
GDPval-AA	1656 Elo	New	Beats 3.1 Pro	card
Humanity's Last Exam	40.2%	New for Flash	Behind 3.1 Pro (44.4%), Opus 4.7	card
ARC-AGI-2	72.1%	Strong	Behind 3.1 Pro (77.1%), GPT-5.5 (84.6%)	card
GPQA Diamond	~92.2%	Up from 2.5 Flash	Trails 3.1 Pro (94.3%)	BFW
MRCR v2 (128K / 1M)	77.3% / 26.6%	Trade-off vs Pro	3.1 Pro wins long-context recall	card
Artificial Analysis Index	55	New	#8 of 148	AA

(MMLU-Pro, AIME 2025, LiveCodeBench, IFEval, SimpleQA, LMArena Elo are not on Google's card or not yet stable, left null.)

Speed & latency

Sustained ~203.5 output tokens/sec (peaks above 280) makes 3.5 Flash roughly 4x faster than other frontier-class models — its core selling point for agent fleets and interactive UX. The ~18.9s AA TTFT looks high but reflects thinking enabled by default in that benchmark; with a low or zero thinking budget, first-token latency drops substantially and the model feels genuinely snappy. For high-throughput agent loops, browser automation, and streaming applications, the speed/quality ratio is the best in Google's lineup.

Pricing analysis

Surface	Cost	Notes
API input	$1.50 / 1M tok	Flat, no 200K tier
API output	$9.00 / 1M tok	Thinking tokens billed as output
Cached input	$0.15 / 1M tok	90% discount; + $1.00/1M tok/hour storage
Batch (in/out)	$0.75 / $4.50	~50% off; async
Search grounding	5,000 prompts/mo free (shared across Gemini 3), then $14 / 1,000 queries	Live Google Search
Direct UI	$19.99/mo (AI Pro); $100 & $200/mo (Ultra)
Free tier	"Free of charge" on AI Studio	RPD/RPM caps
Rate limits	Tiered	Raises with spend gates

Deployment & access

Proprietary, closed-weights. Available via the Gemini API (Google AI Studio), Vertex AI, Gemini CLI, Google Antigravity, and Vertex AI Agent Builder; resold through OpenRouter. Vertex AI provides VPC-SC, CMEK, audit logging, regional pinning (US, EU, Asia), and data residency. No open weights or self-hosting. Migration from 3.1 Pro or 2.x Flash is a model-name swap — the SDK surface is identical across the Gemini line. The pricing advantage (flat $1.50/$9, no 200K cliff) plus 4x speed makes it the natural default for production agents that don't need 3.1 Pro's reasoning ceiling.

Safety & privacy

Google Frontier Safety Framework with configurable filters. Paid API and Vertex inputs are not used to train models; the free AI Studio tier may be. Opt-out available. Compliance: SOC 2, HIPAA, GDPR, ISO 27001, FedRAMP, CCPA. Built-in content moderation. Refusal behavior mirrors 3.1 Pro — stricter than OpenAI/Anthropic on some categories. The AA-measured ~61% hallucination rate on the Omniscience eval is a real caution: pair factual workloads with Search grounding or retrieval.

Ecosystem & tooling

SDKs in Python, TypeScript, Go, Java, Dart; integrations with LangChain, LlamaIndex, Vercel AI SDK, Genkit, Google ADK, and native Model Context Protocol. Powers the Gemini app default mode, Google AI Pro, Gemini CLI, Google Antigravity, and Vertex AI Agent Builder. Distribution across Google's surfaces plus its agent positioning make it mainstream and rising fast.

Gemini 3.5 Flash

What's new

Benchmarks

AI Panel Review

Strengths

Limitations

Best use cases

Deep dive

Architecture

Capabilities

Benchmark analysis

Speed & latency

Pricing analysis

Deployment & access

Safety & privacy

Ecosystem & tooling

Buyer questions

Comparable models

Sources

Model specs

Other Gemini 3.5 versions