Gemini 2.5 Flash-Lite

by Google · Gemini 2.5 family · best for cheapest GA model with 1M context

Cost-OptimizedEdge / On-DeviceLong-Context

7.7

AI Panel Score

Value 9.5/10

Gemini 2.5 Flash-Lite is the lowest-cost GA model in Google's lineup — $0.10 in / $0.40 out per 1M tokens — built for ultra-high-volume, ultra-low-latency workloads while still carrying a full 1M-token context. GA on 2025-07-22, it is Google's officially designated migration target for teams leaving Gemini 2.0 Flash and 2.0 Flash-Lite before the 2026-06-01 shutdown. As of 2026-05-28 it remains the safe, cheap, GA-stable workhorse for classification, extraction, and bulk transformation. For a buyer: when cost-per-call and GA stability dominate and you don't need deep reasoning, this is the pick; if you can tolerate preview, 3.1 Flash-Lite reasons far better.

Compare this model All Gemini 2.5 versions

What's new

Lowest-cost GA model in the Gemini 2.5 family, optimized for ultra-low latency.
1M-token context preserved despite the Flash-Lite tier.
Output token limit raised to ~65K (vs 8K on earlier Flash-Lite previews).
Multimodal input (text, image, audio).
Designated by Google as the official migration target for Gemini 2.0 Flash/Flash-Lite before the 2026-06-01 shutdown.

Benchmarks

Benchmark	Score	Source
MMLU-Pro	72.4%	artificialanalysis.ai 2025
GPQA Diamond	47.4%	artificialanalysis.ai 2025
Artificial Analysis Index	13	artificialanalysis.ai 2026-05-28T00:00:00.000Z

AI Panel Review

Six personas, six verdicts — the same panel that reviews every product on TopReviewed.

Decision Maker8/10

“The safe, cheap, GA-stable workhorse — and the migration target Google itself names for 2.0 Flash.”

For high-volume, low-complexity workloads where GA stability and rock-bottom price are the priority, 2.5 Flash-Lite is the right Google choice today. Vertex governance is identical to higher tiers, and it's the official migration target Google recommends for the 2026-06-01 Gemini 2.0 shutdown. The strategic decision is the trade vs 3.1 Flash-Lite: if you can tolerate preview, that model reasons far better at slightly higher cost; for mission-critical paths where preview isn't acceptable, 2.5 Flash-Lite wins on stability. Lock-in is the usual Google Cloud consideration.

Strategic Fit 8Vendor Risk 8Roadmap Confidence 8

Pros

Cheapest GA, stable, official 2.0 migration target, broad cloud availability

Cons

Weak reasoning
preview 3.1 Flash-Lite tempting

Right for: Mission-critical high-volume simple tasks

Avoid if: You need reasoning and can accept preview status

Domain Strategist7.5/10

“It owns the floor — cheapest GA model with a 1M context — a defensible niche even as 3.1 Flash-Lite looms.”

Strategically, 2.5 Flash-Lite owns the value floor: the cheapest GA model in the lineup that still carries a 1M context. Its competitive position is stability plus price for bulk infrastructure, not capability. Differentiation is the GA label and multi-cloud reach (including OCI) at a moment when the stronger 3.1 Flash-Lite is still preview. Market timing is favorable thanks to the 2.0 deprecation wave pushing migrations its way. The risk is that once 3.1 Flash-Lite reaches GA, this model's niche narrows to pure cost-minimization.

Competitive Positioning 8Differentiation 7Market Timing 8

Pros

Owns the GA value floor, multi-cloud, migration-wave tailwind

Cons

Niche narrows once 3.1 Flash-Lite is GA

Right for: Bulk-infrastructure buyers wanting GA + lowest price

Avoid if: You want capability headroom

Finance Lead9/10

“The cheapest GA model in the lineup — at $0.10/$0.40, high-volume unit economics basically disappear as a concern.”

This is the cost leader among GA models — $0.10 input, $0.40 output. For pipelines processing millions of items per day, the gap vs 3.1 Flash-Lite ($0.25/$1.50) compounds materially, and caching ($0.01) plus batch (~50%) push it lower still. The free AI Studio tier is generous for prototyping. For procurement it's the safest choice: GA, stable pricing, official migration target. The one finance caveat is the capability shortfall vs 3.1 Flash-Lite — if quality forces upgrades, you face a rebuy decision within 6-12 months. Audio at 3x text is a minor watch-item.

Cost Efficiency 10Pricing Transparency 9Value per Dollar 10

Pros

Cheapest GA, deep cache discount, stable pricing

Cons

Capability shortfall may force a rebuy
audio premium

Right for: Million-item-per-day pipelines

Avoid if: Quality needs will push you up the stack anyway

Domain Practitioner8/10

“Same SDK as the whole family, 1M context at the bottom of the price list, and it's blisteringly fast — just respect the ceiling.”

For builders, ergonomics are identical to the rest of Gemini — same SDK, same Vertex surface, same function-calling patterns — so migration off 2.0 Flash is a one-line change. The 1M context at this tier is genuinely unusual and useful. Output speed is excellent for streaming. The hard constraint is the capability ceiling: don't build serious coding or multi-hop agent workflows on Flash-Lite. For high-throughput classification, extraction, and transformation, the speed and price are class-leading and the reliability is GA-grade.

API Ergonomics 8Tool/Agent Support 7Reliability 9

Pros

Zero-migration SDK, 1M context, very fast, GA-reliable

Cons

Low capability ceiling

Right for: High-throughput simple pipelines

Avoid if: You need coding or complex agents

Power User7/10

“I almost never see it directly — it's infrastructure — but where it surfaces it's fast and fine for simple questions.”

End users almost never meet 2.5 Flash-Lite directly; it lives inside apps and embedded assistants rather than as a Gemini app default. Where it surfaces, responses are fast and adequate for simple questions, but noticeably thinner than Pro or 3.5 Flash on anything complex. Refusals follow Google's standard policy stack. Its user-facing impact is essentially latency — where it shines — rather than depth. As bulk infrastructure rather than a headline consumer model, most users never consciously interact with it.

Output Quality 6Speed 9Everyday Usefulness 7

Pros

Very fast, fine for simple tasks, current via grounding

Cons

Thin on complex tasks
not a consumer default

Right for: Embedded fast assistants

Avoid if: You want depth per answer

Skeptic7/10

“A 47.4% GPQA tells you what this is — a cheap classifier, not a reasoner; the 1M context is its only headline that survives scrutiny.”

No overclaiming here, which is refreshing — Google positions it honestly as cheap and fast. The skeptical notes are about fit, not hype: GPQA 47.4% and AA Index 13 confirm it's a classifier-grade model, so treating it as a general assistant will disappoint. The 1M context is real but, as with the rest of the family, recall figures are unpublished — likely degrading earlier at this tier. The audio premium (3x) and lack of video quietly narrow its multimodal story. It's exactly what it claims to be; the only mistake a buyer can make is asking it to reason.

Claim Accuracy 8Weakness Severity 6Hype vs Reality 8

Pros

Honestly positioned, genuinely cheap and fast, real 1M context

Cons

Classifier-grade reasoning
unpublished recall
audio premium
no video

Right for: Skeptics who scope it to bulk simple tasks

Avoid if: You expect general-assistant quality

Strengths

Lowest GA price in the Gemini lineup ($0.10/$0.40).
1M context preserved at the Flash-Lite tier.
65K output tokens — usable for long-form drafts.
Production-stable; Google's official Gemini 2.0 migration target.
Among the fastest GA models (up to ~887 tok/s burst); multimodal input including audio.

Limitations

Reasoning is meaningfully below 3.1 Flash-Lite — GPQA 47.4% vs 86.9%.
Not suitable for complex agent loops or serious coding.
No video input (unlike the rest of the family).
January 2025 knowledge cutoff.
Output tone is workmanlike, not creative.
Audio input bills at 3x text ($0.30 vs $0.10).

Best use cases

High-volume classification and extraction (intent, sentiment, entity, routing).
Lightweight in-app chat assistants where cost-per-call dominates.
Bulk transformations: summarization, translation, reformatting.
Production migration off Gemini 2.0 Flash for cost-sensitive teams needing GA stability.
Multimodal triage (image moderation, audio routing) at scale.

Deep dive

The full research notes behind this review — verified against primary sources.

Architecture Capabilities Benchmark analysis Speed & latency Pricing analysis Deployment & access Safety & privacy Ecosystem & tooling

Architecture

Sparse mixture-of-experts (Gemini family); parameter counts, experts, layers, and attention are undisclosed and null. Verifiable: a 1M-token context window, 65,535 max output tokens, January 2025 knowledge cutoff, and multimodal input across text, image, and audio (no video, unlike the rest of the family). It offers a thinking toggle, off by default for lowest latency. The headline engineering achievement is preserving the full 1M context and a 65K output ceiling at the cheapest GA price point in the lineup.

Capabilities

Gemini 2.5 Flash-Lite is built for ultra-high-volume, ultra-low-latency work — classification, extraction, simple chat, bulk transformation. It deliberately trades reasoning depth for speed and cost: GPQA Diamond 47.4% (cap_reasoning 5.5) is well below 3.1 Flash-Lite's 86.9%, and coding/agentic are correspondingly modest (cap_coding 5.0, cap_agentic 5.0). What it keeps is the full 1M context (cap_long_context 7.5) — unusual at this price — plus solid instruction following (7.0), function calling (7.0), and multilingual (7.0) for templated work. Vision and document/OCR (7.0) are adequate for triage. Real-time data (cap_realtime_data 9.0) via Search grounding. The 65K output limit makes it usable for long-form generation, not just short responses. Safety (8.0) follows Google's standard stack.

Benchmark analysis

Benchmark	Score	vs Predecessor	vs Top Competitor	Source
MMLU-Pro	72.4%	Solid for tier	Behind 3.1 Flash-Lite	AA
GPQA Diamond	47.4%	Modest	Well behind 3.1 Flash-Lite (86.9%)	AA
Artificial Analysis Index	13	New for tier	Below average; built for cost not intelligence	AA

(SWE-bench, LiveCodeBench, MMMU, AIME, LMArena are unpublished for this model and left null. The model is optimized for cost/latency, not benchmark leadership.)

Speed & latency

Among the fastest GA models available: ~215 output tokens/sec on Artificial Analysis, with third-party tests reporting bursts up to ~887 tok/s under favorable conditions. TTFT is the lowest in the 2.5 family. This speed is the model's defining, user-visible benefit — for high-throughput pipelines processing millions of items, decode speed and low latency translate directly into lower infrastructure cost and snappier embedded UX.

Pricing analysis

Surface	Cost	Notes
API input	$0.10 / 1M tok	Text/image/video; audio $0.30
API output	$0.40 / 1M tok	Cheapest GA output in the lineup
Cached input	$0.01 / 1M tok	90% discount; audio $0.03; + $1.00/1M tok/hour storage
Batch (in/out)	$0.05 / $0.20	~50% off; audio in $0.15
Search grounding	1,500 RPD free, then $35 / 1,000 grounded prompts	Older grounding pricing
Free tier	"Free of charge" on AI Studio	RPD/RPM caps
Direct UI	Not the consumer default	3.x models default in app
Rate limits	High RPD on paid tiers	Suited to bulk workloads

Deployment & access

Proprietary, closed-weights. Available via the Gemini API (Google AI Studio), Vertex AI, and Oracle OCI generative-ai, resold through OpenRouter. Vertex AI provides VPC-SC, CMEK, audit logging, regional pinning (US, EU, Asia), and data residency. No open weights or self-hosting. SDK surface is identical across the Gemini line — migration off Gemini 2.0 Flash is a model-name swap, which is exactly why Google designates it the official replacement. Its multi-cloud availability (including OCI) is slightly broader than the newer 3.x Flash-Lite preview.

Safety & privacy

Google Frontier Safety Framework with configurable filters. Paid API and Vertex inputs are not used to train models; the free AI Studio tier may be. Opt-out available. Compliance: SOC 2, HIPAA, GDPR, ISO 27001, FedRAMP, CCPA. Built-in content moderation. Refusals follow Google's standard policy stack.

Ecosystem & tooling

SDKs in Python, TypeScript, Go, Java, Dart; integrations with LangChain, LlamaIndex, Vercel AI SDK, Genkit, and Google ADK. Available on Vertex AI, the Gemini API, and Oracle OCI. Used as bulk infrastructure for classification pipelines and embedded assistants rather than as a consumer default. Mainstream by deployment volume, given the Gemini 2.0 migration wave.

Buyer questions

Is it generally available?

Yes — GA since 2025-07-22, with no announced deprecation date as of 2026-05-28. It's the GA-stable choice in the Flash-Lite tier.

Why pick it over 3.1 Flash-Lite?

GA stability and lower price ($0.10/$0.40 vs $0.25/$1.50). 3.1 Flash-Lite reasons far better (GPQA 86.9% vs 47.4%) but is still preview.

Should I use it to migrate off Gemini 2.0 Flash?

Yes — it's Google's officially designated migration target ahead of the 2026-06-01 shutdown, and migration is a model-name swap.

What can't it do well?

Complex reasoning, serious coding, and multi-hop agents. It's a classifier/extraction/bulk-transformation engine, plus it lacks video input.

Does audio cost extra?

Yes — audio input is $0.30/1M (3x the $0.10 text rate).

Can I self-host?

No. Gemini is closed-weights, API/Vertex/OCI only.

Comparable models

Gemini 3.1 Flash-Lite (preview)

Far stronger reasoning (GPQA 86.9% vs 47.4%) at $0.25/$1.50; preview status is the main hesitation vs this GA-stable model.

Gemini 2.0 Flash-Lite

The predecessor, reaching EOL 2026-06-01; 2.5 Flash-Lite is the official replacement Google designates.

GPT-5.4 nano / Claude Haiku 4.5

Comparable cheap tier; weaker multimodal and smaller context, and no Search grounding. 2.5 Flash-Lite wins on context size and price; the others may edge it on ecosystem fit.

Sources

Primary references used to verify this review.

Model specs

Input price: $0.10 / Mtok
Output price: $0.40 / Mtok
Cached input: $0.01 / Mtok
Batch (in/out): $0.05 / $0.20
Context window: 1.0M tokens
Max output: 66K tokens
Knowledge cutoff: 2025-01
Released: 2025-07-21
Modalities: text, image, audio → text
Output speed: ~215 tok/s
License: Proprietary
Clouds: Vertex AI, GCP, OCI

Does not train on API inputs by default

Other Gemini 2.5 versions

Last verified 2026-05-27