Gemini 3.1 Pro

previewLatest Pro

by Google · Gemini 3 family · best for frontier reasoning + long-context on Google Cloud

FrontierReasoningLong-ContextMultimodal
8.7
AI Panel Score
Value 8.5/10

Gemini 3.1 Pro is Google DeepMind's flagship reasoning model, launched 2026-02-19 in preview to validate the release before general availability. As of 2026-05-28 it remains the headline model in the Gemini app (Google AI Pro and Ultra) and the top reasoning option on the Gemini API and Vertex AI, even though its API/Vertex surface is still governed by Pre-GA Offerings Terms. It posts the highest public GPQA Diamond score of any proprietary model (94.3%, no tools), pairs that with a real 1M-token context (2M on Vertex for enterprise), and grounds answers in live Google Search. For a buyer: if you want frontier reasoning plus the deepest enterprise-cloud and live-data integration, this is Google's answer — accept that the API is technically still pre-GA. - Provider: Google (DeepMind) - Released: 2026-02-19 (preview; no GA date announced) - Status: preview (Pre-GA terms on API/Vertex; production-default in the consumer Gemini app) - Context window: 1,048,576 tokens (2,097,152 / 2M on Vertex AI) - Max output: 65,536 tokens - Modalities: text, image, audio, video in; text out - Knowledge cutoff: January 2025 - Headline price: $2.00 in / $12.00 out per 1M tokens (<=200K prompt)

What's new

  • Largest reasoning leap in Gemini history: GPQA Diamond 94.3% (no tools), Humanity's Last Exam 44.4% (51.4% with search+code), ARC-AGI-2 77.1% (verified) — more than double Gemini 3 Pro's ARC-AGI-2.
  • SWE-bench Verified jumps to 80.6% (single attempt), making this Google's first genuinely frontier coding agent.
  • LiveCodeBench Pro Elo 2887, top of the public competitive-coding board.
  • "Deep Think" extended-reasoning mode for hard math, science, and long-horizon agent tasks.
  • 2M-token context mode rolling out on Vertex AI for enterprise; tighter Workspace and Google Cloud grounding (Docs, Sheets, Gmail, BigQuery, AlloyDB).

Benchmarks

BenchmarkScoreSource
Humanity's Last Exam44.4%deepmind.google 2026-02-19T00:00:00.000Z
MMLU92.6%deepmind.google 2026-02-19T00:00:00.000Z
MMMU80.5%deepmind.google 2026-02-19T00:00:00.000Z
TAU-bench90.8%deepmind.google 2026-02-19T00:00:00.000Z
LMArena Elo1501facebook.com 2026
GPQA Diamond94.3%deepmind.google 2026-02-19T00:00:00.000Z
LiveCodeBench2887%deepmind.google 2026-02-19T00:00:00.000Z
Terminal-Bench68.5%deepmind.google 2026-02-19T00:00:00.000Z
MRCR Long Context84.9%deepmind.google 2026-02-19T00:00:00.000Z
SWE-bench Verified80.6%deepmind.google 2026-02-19T00:00:00.000Z
Artificial Analysis Index57artificialanalysis.ai 2026-05-28T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker9/10
The strongest case yet to standardize on Google's AI stack — if we can live with a flagship that's still technically pre-GA.

Gemini 3.1 Pro gives a CTO frontier reasoning plus the deepest enterprise integration on the market: Vertex VPC-SC, CMEK, residency controls, audit logging, and Workspace/BigQuery grounding no rival matches. The 1M-2M context removes RAG complexity for many workloads. Two strategic risks: it remains under Pre-GA terms (support and stability caveats), and adopting Vertex deepens Google Cloud lock-in. At $2/$12 with cache discounts and the 2M context, long-context TCO beats Claude Opus and GPT-5 Pro. Roadmap confidence is high given 3.5 Pro is already incoming.

Strategic Fit 9Vendor Risk 7Roadmap Confidence 9
Pros
  • Frontier reasoning, deepest enterprise grounding, strong long-context TCO
Cons
  • Pre-GA terms, Google Cloud lock-in
Right for: Google Cloud enterprises wanting frontier capability with governance
Avoid if: You need a hard GA SLA today or are multi-cloud-neutral
Domain Strategist9/10
Google's wedge is live data plus long context — 3.1 Pro turns the Search and Workspace moat into a model-level advantage.

In market terms, 3.1 Pro competes at the very top on reasoning (GPQA 94.3%) while owning a differentiation axis rivals can't copy: native Google Search grounding and Workspace/BigQuery integration. That positions it as the default for any org already inside Google's ecosystem and for use cases where freshness and citation matter. Its competitive moat is distribution (the Gemini app, Workspace, Android, Cloud) more than raw benchmark leads, several of which Opus 4.6 and GPT-5.3-Codex contest. Market timing is strong, but 3.5 Pro looming may stall procurement decisions.

Competitive Positioning 9Differentiation 9Market Timing 8
Pros
  • Unmatched live-data + ecosystem moat, frontier reasoning
Cons
  • Some benchmark leads are contested
  • 3.5 Pro overhang
Right for: Ecosystem-aligned buyers, freshness-sensitive use cases
Avoid if: You want a pure benchmark king regardless of ecosystem
Finance Lead8.5/10
Cleanest frontier pricing in the category — until a prompt crosses 200K and the per-token rate quietly doubles.

$2/$12 standard is competitive with GPT-5 and ~40% under Claude Opus 4.7 for long-context work. Explicit caching cuts input reads to $0.20, and batch halves rates. The catch is the two-tier model: above 200K tokens, input/output jump to $4/$18, and cached storage runs $4.50/1M tokens/hour — easy to under-model in RAG and streaming pipelines. Thinking tokens bill as output and can balloon spend on Deep Think. Vertex billing folds into existing Google Cloud invoices, simplifying procurement. Predictable once the tier boundary and thinking-token behavior are understood.

Cost Efficiency 8Pricing Transparency 8Value per Dollar 8
Pros
  • Cheaper long-context than Opus, clean caching/batch discounts, unified Cloud billing
Cons
  • 200K price cliff, thinking-token output costs
Right for: Long-context workloads on Google Cloud
Avoid if: Budgets can't absorb variable thinking-token output
Domain Practitioner8.5/10
Function calling and structured output just work, and 1M context lets me skip half my RAG plumbing.

For a builder, the Gemini API and Vertex surface is the smoothest Google has shipped — clean SDKs (Python, TS, Go, Java, Dart), reliable function calling, response-schema structured output, built-in code execution, and Search grounding. The 1M-2M context collapses many agent loops into a single call. Genkit and Google ADK ease orchestration. Friction points: preview-tier RPD caps until spend gates clear, and high TTFT makes tight interactive loops painful. AI Studio's prompt history helps debug tool calls. Migration across Gemini tiers is a model-name swap.

API Ergonomics 9Tool/Agent Support 9Reliability 8
Pros
  • Clean SDKs, reliable structured output, huge context, code execution
Cons
  • Preview rate caps, slow TTFT
Right for: Agent and long-context builders on Google
Avoid if: You need low-latency interactive turns from this tier
Power User8.5/10
Brilliant on hard problems and live questions; just don't expect it to answer fast in Deep Think.

In the Gemini app on AI Pro/Ultra, 3.1 Pro is genuinely excellent for research, planning, and analysis, and Search grounding makes it strong on current events where most rivals guess. Conversation quality is high and multimodal understanding (PDFs, images, video) is a daily advantage. Downsides users feel: Deep Think latency (10-30s), stricter refusals on edgy prompts than Claude or ChatGPT, and a tone that reads slightly clinical. The 2026 UX overhaul (native macOS app, cleaner mobile) fixed many prior complaints; Trustpilot remains mixed, mostly about caps and policy.

Output Quality 9Speed 7Everyday Usefulness 8.5
Pros
  • Top-tier answers, live data, strong multimodal
Cons
  • Slow in Deep Think, stricter refusals, clinical tone
Right for: Researchers and power users wanting depth + freshness
Avoid if: You prize instant responses or maximum creative latitude
Skeptic7.5/10
A 'flagship' still under Pre-GA terms, with a 200K price cliff and TTFT north of 30 seconds — read the asterisks.

The GPQA 94.3% headline is real, but several touted leads are contested: Opus 4.6 edges SWE-bench, GPT-5.3-Codex leads specialized coding, and 3.5 Flash beats 3.1 Pro on agentic benchmarks Google itself publishes. "Released" overstates it — this is preview validating before GA, with limited support. Long-context marketing glosses over MRCR recall decay near 1M and the doubled >200K pricing. AA hallucination data on the family is non-trivial, so Search grounding is doing real work to cover knowledge gaps. None of this makes it bad — it makes the "best at everything" framing marketing, not fact.

Claim Accuracy 7Weakness Severity 6Hype vs Reality 7
Pros
  • Genuinely frontier reasoning, honest about thinking/grounding
Cons
  • Pre-GA, contested leads, price cliff, slow
Right for: Buyers who verify claims against the official card
Avoid if: You take "GA flagship, best at everything" at face value

Strengths

  • Highest public GPQA Diamond (94.3%) of any proprietary model.
  • True 1M-token context (2M on Vertex) with usable long-context retrieval (MRCR 128K 84.9%).
  • Frontier coding (SWE-bench Verified 80.6%, LiveCodeBench Pro 2887).
  • Deepest live-data integration via Google Search grounding.
  • Unmatched Workspace and Google Cloud (BigQuery, AlloyDB) grounding.
  • Enterprise-grade governance and broad compliance on Vertex.

Limitations

  • Still pre-GA on API/Vertex (Pre-GA Offerings Terms) — limited support guarantees.
  • Output is text-only; image/video generation needs Imagen 4 / Veo 3.1 handoff.
  • Long-context pricing doubles above 200K tokens ($4/$18), an easy cost surprise.
  • High TTFT (~30s) and slow output (~143 tok/s) — poor fit for snappy interactive UX.
  • Tight preview rate limits until cumulative-spend gates clear.
  • Stricter refusals than peers on some prompts.

Best use cases

- Long-document agent workflows: legal review, research synthesis, codebase-wide refactors that exploit 1M-2M context. - Frontier coding agents needing both SWE-bench strength and reliable tool/function calling. - Scientific and research assistants where GPQA-class reasoning is the deciding factor. - Multimodal pipelines fusing PDFs, screenshots, audio, and video in one prompt. - Google Cloud enterprises wanting Vertex governance plus Workspace and BigQuery grounding.

Buyer questions

Is Gemini 3.1 Pro generally available?

Not formally. It launched 2026-02-19 in preview to validate before GA, and as of 2026-05-28 the API/Vertex surface is still under Pre-GA Offerings Terms. It is, however, the production-default model in the consumer Gemini app.

What does the long-context tier cost?

Prompts over 200K tokens bill at $4.00 input / $18.00 output per 1M (vs $2/$12 under 200K). Cached reads rise to $0.40 and storage is $4.50/1M tokens/hour.

How do I get the 2M-token window?

The extended 2M context is rolling out on Vertex AI for enterprise; the standard Gemini API exposes 1M.

Does Google train on my data?

No for paid API and Vertex inputs. The free AI Studio tier may use inputs to improve products; opt-out is available.

How does it compare to 3.5 Flash for agents?

3.5 Flash beats 3.1 Pro on Terminal-Bench, MCP Atlas, and CharXiv at lower cost and 1.5x speed. Use Pro when pure reasoning, HLE-class problems, or long-context recall dominate.

What about live/current data?

Google Search grounding gives 5,000 free prompts/month (shared across Gemini 3), then $14 per 1,000 queries — the deepest real-time integration of any frontier model.

Can I self-host or get the weights?

No. Gemini is closed-weights, API/Vertex only.

Comparable models

**Claude Opus 4.7** — Stronger creative tone and edges SWE-bench/expert tasks; weaker live-data access and long-context cost. Opus leads LMArena (~1504); 3.1 Pro Preview sits just behind.
**GPT-5.5** — Comparable frontier reasoning and ranks above 3.5 Flash on the AA Index; weaker native video ingestion and no Workspace/Search-grounding moat. The two trade frontier leads benchmark by benchmark.
**Gemini 3.5 Flash** — Google's own sibling beats 3.1 Pro on agentic/coding tasks (Terminal-Bench 76.2%, MCP Atlas 83.6%) at lower price and higher speed, but loses on pure reasoning, HLE, and long-context recall. The honest internal split: Flash for agents, Pro for reasoning.

Model specs

Input price
$2 / Mtok
Output price
$12 / Mtok
Cached input
$0.20 / Mtok
Batch (in/out)
$1 / $6
Context window
1.0M tokens
Max output
66K tokens
Knowledge cutoff
2025-01
Released
2026-02-18
Modalities
text, image, audio, video → text
Output speed
~142.7 tok/s
License
Proprietary
Clouds
Vertex AI, GCP

Does not train on API inputs by default

Last verified 2026-05-27