Google Gemini Cost Optimization Enterprise AI Token Budget AI Infrastructure

Gemini 3.5 Flash: Cut AI Token Costs 45% (Caching + Batch)

Google CEO: companies 'blowing through token budgets.' Gemini 3.5 Flash cuts costs 45-50% via caching and batch API. Real enterprise workload math.

By Rajesh Beri·June 4, 2026·9 min read

THE DAILY BRIEF

Google GeminiCost OptimizationEnterprise AIToken BudgetAI Infrastructure

Google CEO: companies 'blowing through token budgets.' Gemini 3.5 Flash cuts costs 45-50% via caching and batch API. Real enterprise workload math.

By Rajesh Beri·June 4, 2026·9 min read

At Google I/O 2026, Google's CEO made a blunt admission from the keynote stage: enterprise customers are "already blowing through their annual token budgets." That line wasn't marketing hyperbole. It was validation of what CFOs and CTOs have been quietly escalating internally for months. AI inference costs are spiraling, and the math doesn't work at scale.

Google's response is Gemini 3.5 Flash — a production-grade model positioned as the cost-optimized middle ground between experimental Flash preview models and premium Pro-tier offerings. But Flash isn't just about cheaper per-token pricing. The real savings come from two enterprise features most teams aren't fully leveraging: context caching and batch API.

If you're running AI in production and haven't optimized for these, you're leaving 45-50% cost savings (calculate your potential savings) on the table. Here's how the economics actually work.

The Token Budget Crisis No One Talks About

Most enterprises treat AI inference costs like cloud compute in 2010 — they know it's expensive, but nobody's tracking unit economics. Finance teams see monthly invoices climbing from $50K to $200K to $500K, but can't tie those numbers back to specific workloads, cost per task, or ROI.

The problem compounds when you move from experimentation to production. A chatbot pilot serving 100 employees might cost $2,000/month. Scale that same architecture to 10,000 employees and you're looking at $200,000/month — before accounting for retry logic, fallback models, or multi-turn conversations that inflate context windows.

Google's CEO wasn't exaggerating. Multiple conversations with enterprise AI leaders confirm the same pattern: teams budget $500K for the year, blow through it by Q2, then scramble to justify emergency budget extensions. Why? Because they're optimizing for model quality, not cost per successful task.

Gemini 3.5 Flash addresses this directly. At $1.50 per million input tokens and $9.00 per million output tokens, it's positioned 25% cheaper than Gemini 3.1 Pro ($2.00/$12.00) while maintaining production-grade stability that preview models can't guarantee.

But here's the kicker: standard pricing is just the baseline. The actual cost per task depends entirely on whether you're using caching and batch processing.

Context Caching: The 90% Discount Most Teams Ignore

Context caching is the single biggest cost lever in Gemini 3.5 Flash, yet most enterprise deployments don't use it. The math is straightforward: cache hits cost $0.15 per million tokens instead of $1.50 — a 90% reduction on input costs.

Here's how it works in practice. Your AI agent includes a 10,000-token system prompt with instructions, few-shot examples, tool definitions, and response schemas. Without caching, you pay $1.50 per million tokens every single time a user makes a request. For a workload running 100,000 requests per day, that's $75/day just for the system prompt.

With context caching enabled, you pay the full $1.50 rate once to populate the cache, then $0.15 per million tokens for every subsequent cache hit. Same workload: $7.50/day instead of $75/day. That's $2,025 saved per month on system prompts alone.

The catch: cache storage costs $1.00 per hour. This means caching only delivers ROI when cached content is reused frequently within each storage hour. For most production workloads — chatbots, coding agents, document analysis pipelines — this threshold is trivial. If you're processing more than 100 requests per hour using the same base context, caching pays for itself immediately.

Real-world example from a Fortune 500 company running a coding agent: 5,000 daily sessions, 10,000-token input per session (code context + instructions), 3,000-token output (generated code). Total input tokens: 50 million per day. Without caching: $75/day input cost. With 80% cache hit rate: $15/day for uncached input + $6/day for cached input = $21/day total. That's a 72% reduction in input costs, or $1,620 saved per month.

Output tokens still cost $9.00 per million (no caching benefit), so the total monthly cost drops from $6,300 to $4,680 — a 26% savings just from caching input. And that's before batch API optimization.

Batch API: The 50% Discount for Non-Urgent Work

If you're running AI workloads that don't need real-time responses, Google offers a 50% discount via the Batch API. Same model, same quality, half the cost — as long as you can tolerate higher latency.

Batch pricing for Gemini 3.5 Flash: $0.75 per million input tokens (vs $1.50 standard), $4.50 per million output tokens (vs $9.00 standard). The trade-off: batch requests may take minutes to hours instead of seconds.

This makes batch API ideal for:

Nightly data processing pipelines
Bulk document classification
Content moderation queues
Evaluation and testing workflows
Any workload where latency isn't user-facing

Real-world example: A document analysis pipeline processing 500 documents per day. Each document averages 100,000 tokens input (long-form content), 2,000 tokens output (summary + metadata). Total daily tokens: 50 million input, 1 million output.

Standard pricing: $75/day input + $9/day output = $84/day ($2,520/month).
Batch pricing: $37.50/day input + $4.50/day output = $42/day ($1,260/month).

That's a 50% cost reduction — $1,260 saved monthly — with zero quality degradation. The only requirement: results don't need to be available in real-time.

Combining caching and batch: 65% total savings. If the same document pipeline has 60% shared context across documents (common templates, instructions, schemas), context caching reduces input costs another 54%. Final monthly cost: $765 instead of $2,520. That's 70% cheaper than the naive implementation.

The Output Token Problem: Why Cheaper Input Isn't Enough

Most cost optimization guides focus on input token costs, but output tokens are the real budget killer. In Gemini 3.5 Flash, output costs 6x more than input ($9.00 vs $1.50 per million tokens). For workloads with heavy code generation, long explanations, or multi-step reasoning, output tokens dominate total cost.

Example: A coding agent workload with 10,000-token input and 3,000-token output per session. Input: $0.015 per session. Output: $0.027 per session. Output represents 64% of total cost.

The fix: optimize output length first, input length second. Strategies include:

Set max_tokens to the minimum viable for your task (don't default to 4,096)
Use structured output schemas to constrain response format
For classification tasks, return enum values instead of explanations
For extraction tasks, return only extracted fields (no preamble or summary)

Reducing average output length by 20% (3,000 tokens → 2,400 tokens) saves $1,260/month on a 5,000-session-per-day workload. That's more savings than most teams get from switching to a cheaper model.

Workload Routing: Don't Use Flash for Everything

Not every AI task needs Gemini 3.5 Flash. Routing simpler workloads to cheaper models — and complex tasks to more expensive models — can reduce total cost while improving quality.

Recommended routing logic:

Workload Type	Model	Why
Simple classification	Gemini 3.1 Flash Lite ($0.25/$1.50)	6x cheaper, good enough for binary decisions
Standard extraction	Gemini 3 Flash Preview ($0.50/$3.00)	3x cheaper, handles structured tasks
Agent sub-steps	Gemini 3.5 Flash ($1.50/$9.00)	GA stability, better reasoning
Complex reasoning	Gemini 3.1 Pro ($2.00/$12.00)	Higher quality for hard tasks

The anti-pattern: using the same model for everything. A Fortune 500 company I talked to was running all AI workloads — from simple sentiment tagging to complex contract analysis — on Gemini 3.1 Pro. Monthly cost: $47,000. After implementing workload routing (80% of tasks moved to Flash or Flash Lite), monthly cost dropped to $18,000. Same output quality on the tasks that mattered.

The key metric isn't cost per token — it's cost per successful task. A cheaper model that fails 30% of the time and requires retries can cost more than a premium model that succeeds on the first attempt.

Hidden Costs: Retries, Fallbacks, and Context Growth

Three cost factors most teams miss when budgeting AI inference:

Retry costs. If 10% of requests fail validation and require retrying, add 10% to your token budget. For agent workflows with multi-step chains, retry costs compound across steps.
Fallback to stronger models. If Gemini 3.5 Flash can't handle 5% of requests and you fallback to Gemini 3.1 Pro for those, factor in Pro-tier pricing for the fallback volume.
Context growth in agentic workflows. Multi-turn agent sessions accumulate conversation history. A 10-turn coding session might start with 10,000 input tokens and end with 50,000 tokens by turn 10. If you're not pruning context or summarizing prior turns, input costs scale quadratically.

Real example: A customer support chatbot averaging 8 turns per session. Turn 1: 2,000 tokens input. Turn 8: 14,000 tokens input (cumulative context). Average input per session: 64,000 tokens. Without context pruning: $0.096 per session. With pruning (keep only last 3 turns): $0.018 per session. That's 81% savings on a 100,000-session/month workload — $7,800 saved.

What CFOs and CTOs Should Ask This Week

If you're managing AI budgets, these are the questions that matter:

What percentage of our input tokens are cached? If it's below 50%, you're leaving money on the table. Most production workloads should cache 70-80% of input.
Which workloads can run on batch API? Anything non-user-facing — analytics, bulk processing, evaluation — should use batch pricing. A 50% discount with zero quality trade-off.
Are we routing by workload complexity? If every task uses the same model, you're overpaying for simple work and underpaying for complex reasoning.
What's our cost per successful task? Track this separately for each workload. It's the only metric that accounts for retries, fallbacks, and quality.
How are we handling context growth in multi-turn sessions? If you're not pruning or summarizing, costs scale quadratically with conversation length.

Google's CEO was right: companies are blowing through token budgets. But the fix isn't cutting AI adoption. It's using the cost levers already available — caching, batch processing, workload routing, output optimization — and treating inference costs like any other cloud spend: measurable, trackable, optimizable.

Gemini 3.5 Flash makes this easier than previous generations. At $1.50/$9.00 baseline pricing with 50-90% discounts available via caching and batch, it's positioned for production workloads where cost matters as much as quality. The question isn't whether your AI budget will grow. It's whether you're getting 45-50% more value from the budget you already have.

Continue Reading

THE DAILY BRIEF

Enterprise AI insights for technology and business leaders, twice weekly.

beri.net

Subscribe at beri.net/subscribe for twice-weekly AI insights delivered to your inbox.

LinkedIn: linkedin.com/in/rberi | X: x.com/rajeshberi

Gemini 3.5 Flash: Cut AI Token Costs 45% (Caching + Batch)

Photo by Brett Sayles on Pexels

If you're running AI in production and haven't optimized for these, you're leaving 45-50% cost savings (calculate your potential savings) on the table. Here's how the economics actually work.

The Token Budget Crisis No One Talks About

But here's the kicker: standard pricing is just the baseline. The actual cost per task depends entirely on whether you're using caching and batch processing.

Context Caching: The 90% Discount Most Teams Ignore

Batch API: The 50% Discount for Non-Urgent Work

This makes batch API ideal for:

Nightly data processing pipelines
Bulk document classification
Content moderation queues
Evaluation and testing workflows
Any workload where latency isn't user-facing

Standard pricing: $75/day input + $9/day output = $84/day ($2,520/month).
Batch pricing: $37.50/day input + $4.50/day output = $42/day ($1,260/month).

That's a 50% cost reduction — $1,260 saved monthly — with zero quality degradation. The only requirement: results don't need to be available in real-time.

The Output Token Problem: Why Cheaper Input Isn't Enough

Example: A coding agent workload with 10,000-token input and 3,000-token output per session. Input: $0.015 per session. Output: $0.027 per session. Output represents 64% of total cost.

The fix: optimize output length first, input length second. Strategies include:

Set max_tokens to the minimum viable for your task (don't default to 4,096)
Use structured output schemas to constrain response format
For classification tasks, return enum values instead of explanations
For extraction tasks, return only extracted fields (no preamble or summary)

Workload Routing: Don't Use Flash for Everything

Not every AI task needs Gemini 3.5 Flash. Routing simpler workloads to cheaper models — and complex tasks to more expensive models — can reduce total cost while improving quality.

Recommended routing logic:

Workload Type	Model	Why
Simple classification	Gemini 3.1 Flash Lite ($0.25/$1.50)	6x cheaper, good enough for binary decisions
Standard extraction	Gemini 3 Flash Preview ($0.50/$3.00)	3x cheaper, handles structured tasks
Agent sub-steps	Gemini 3.5 Flash ($1.50/$9.00)	GA stability, better reasoning
Complex reasoning	Gemini 3.1 Pro ($2.00/$12.00)	Higher quality for hard tasks

Hidden Costs: Retries, Fallbacks, and Context Growth

Three cost factors most teams miss when budgeting AI inference:

Retry costs. If 10% of requests fail validation and require retrying, add 10% to your token budget. For agent workflows with multi-step chains, retry costs compound across steps.
Fallback to stronger models. If Gemini 3.5 Flash can't handle 5% of requests and you fallback to Gemini 3.1 Pro for those, factor in Pro-tier pricing for the fallback volume.
Context growth in agentic workflows. Multi-turn agent sessions accumulate conversation history. A 10-turn coding session might start with 10,000 input tokens and end with 50,000 tokens by turn 10. If you're not pruning context or summarizing prior turns, input costs scale quadratically.

What CFOs and CTOs Should Ask This Week

If you're managing AI budgets, these are the questions that matter:

What percentage of our input tokens are cached? If it's below 50%, you're leaving money on the table. Most production workloads should cache 70-80% of input.
Which workloads can run on batch API? Anything non-user-facing — analytics, bulk processing, evaluation — should use batch pricing. A 50% discount with zero quality trade-off.
Are we routing by workload complexity? If every task uses the same model, you're overpaying for simple work and underpaying for complex reasoning.
What's our cost per successful task? Track this separately for each workload. It's the only metric that accounts for retries, fallbacks, and quality.
How are we handling context growth in multi-turn sessions? If you're not pruning or summarizing, costs scale quadratically with conversation length.

Continue Reading

THE DAILY BRIEF

Google GeminiCost OptimizationEnterprise AIToken BudgetAI Infrastructure

Gemini 3.5 Flash: Cut AI Token Costs 45% (Caching + Batch)

Google CEO: companies 'blowing through token budgets.' Gemini 3.5 Flash cuts costs 45-50% via caching and batch API. Real enterprise workload math.

By Rajesh Beri·June 4, 2026·9 min read

If you're running AI in production and haven't optimized for these, you're leaving 45-50% cost savings (calculate your potential savings) on the table. Here's how the economics actually work.

The Token Budget Crisis No One Talks About

But here's the kicker: standard pricing is just the baseline. The actual cost per task depends entirely on whether you're using caching and batch processing.

Context Caching: The 90% Discount Most Teams Ignore

Batch API: The 50% Discount for Non-Urgent Work

This makes batch API ideal for:

Nightly data processing pipelines
Bulk document classification
Content moderation queues
Evaluation and testing workflows
Any workload where latency isn't user-facing

Standard pricing: $75/day input + $9/day output = $84/day ($2,520/month).
Batch pricing: $37.50/day input + $4.50/day output = $42/day ($1,260/month).

That's a 50% cost reduction — $1,260 saved monthly — with zero quality degradation. The only requirement: results don't need to be available in real-time.

The Output Token Problem: Why Cheaper Input Isn't Enough

Example: A coding agent workload with 10,000-token input and 3,000-token output per session. Input: $0.015 per session. Output: $0.027 per session. Output represents 64% of total cost.

The fix: optimize output length first, input length second. Strategies include:

Set max_tokens to the minimum viable for your task (don't default to 4,096)
Use structured output schemas to constrain response format
For classification tasks, return enum values instead of explanations
For extraction tasks, return only extracted fields (no preamble or summary)

Workload Routing: Don't Use Flash for Everything

Not every AI task needs Gemini 3.5 Flash. Routing simpler workloads to cheaper models — and complex tasks to more expensive models — can reduce total cost while improving quality.

Recommended routing logic:

Workload Type	Model	Why
Simple classification	Gemini 3.1 Flash Lite ($0.25/$1.50)	6x cheaper, good enough for binary decisions
Standard extraction	Gemini 3 Flash Preview ($0.50/$3.00)	3x cheaper, handles structured tasks
Agent sub-steps	Gemini 3.5 Flash ($1.50/$9.00)	GA stability, better reasoning
Complex reasoning	Gemini 3.1 Pro ($2.00/$12.00)	Higher quality for hard tasks

Hidden Costs: Retries, Fallbacks, and Context Growth

Three cost factors most teams miss when budgeting AI inference:

Retry costs. If 10% of requests fail validation and require retrying, add 10% to your token budget. For agent workflows with multi-step chains, retry costs compound across steps.
Fallback to stronger models. If Gemini 3.5 Flash can't handle 5% of requests and you fallback to Gemini 3.1 Pro for those, factor in Pro-tier pricing for the fallback volume.
Context growth in agentic workflows. Multi-turn agent sessions accumulate conversation history. A 10-turn coding session might start with 10,000 input tokens and end with 50,000 tokens by turn 10. If you're not pruning context or summarizing prior turns, input costs scale quadratically.

What CFOs and CTOs Should Ask This Week

If you're managing AI budgets, these are the questions that matter:

What percentage of our input tokens are cached? If it's below 50%, you're leaving money on the table. Most production workloads should cache 70-80% of input.
Which workloads can run on batch API? Anything non-user-facing — analytics, bulk processing, evaluation — should use batch pricing. A 50% discount with zero quality trade-off.
Are we routing by workload complexity? If every task uses the same model, you're overpaying for simple work and underpaying for complex reasoning.
What's our cost per successful task? Track this separately for each workload. It's the only metric that accounts for retries, fallbacks, and quality.
How are we handling context growth in multi-turn sessions? If you're not pruning or summarizing, costs scale quadratically with conversation length.

Continue Reading

THE DAILY BRIEF

Enterprise AI insights for technology and business leaders, twice weekly.

beri.net

Subscribe at beri.net/subscribe for twice-weekly AI insights delivered to your inbox.

LinkedIn: linkedin.com/in/rberi | X: x.com/rajeshberi

Frequently Asked Questions

What is Gemini 3.5 Flash and how does it help with AI token costs?

Gemini 3.5 Flash is a production-grade AI model that offers cost optimization features, including context caching and batch API, which can reduce token costs by 45-50% for enterprises.

How does context caching work in Gemini 3.5 Flash?

Context caching allows enterprises to significantly reduce input costs by charging only $0.15 per million tokens for cache hits, compared to $1.50 for uncached requests, leading to potential savings of up to 90%.

What are the benefits of using the Batch API with Gemini 3.5 Flash?

The Batch API provides a 50% discount on token costs for non-urgent workloads, charging $0.75 per million input tokens and $4.50 per million output tokens, making it ideal for tasks that can tolerate higher latency.

Why are output tokens a significant cost factor in Gemini 3.5 Flash?

Output tokens are six times more expensive than input tokens in Gemini 3.5 Flash, making them a major contributor to total costs, especially in workloads involving heavy code generation or multi-step reasoning.

What is the recommended approach for routing AI workloads to different models?

Enterprises should route simpler tasks to cheaper models like Gemini 3.1 Flash Lite, while reserving Gemini 3.5 Flash for more complex tasks, optimizing both cost and quality.

Microsoft Copilot

Latest Articles

View All →

Gemini 3.5 Flash: Cut AI Token Costs 45% (Caching + Batch)

The Token Budget Crisis No One Talks About

Context Caching: The 90% Discount Most Teams Ignore

Batch API: The 50% Discount for Non-Urgent Work

The Output Token Problem: Why Cheaper Input Isn't Enough

Workload Routing: Don't Use Flash for Everything

Hidden Costs: Retries, Fallbacks, and Context Growth

What CFOs and CTOs Should Ask This Week

Continue Reading

THE DAILY BRIEF

The Token Budget Crisis No One Talks About

Context Caching: The 90% Discount Most Teams Ignore

Batch API: The 50% Discount for Non-Urgent Work

The Output Token Problem: Why Cheaper Input Isn't Enough

Workload Routing: Don't Use Flash for Everything

Hidden Costs: Retries, Fallbacks, and Context Growth

What CFOs and CTOs Should Ask This Week

Continue Reading

The Token Budget Crisis No One Talks About

Context Caching: The 90% Discount Most Teams Ignore

Batch API: The 50% Discount for Non-Urgent Work

The Output Token Problem: Why Cheaper Input Isn't Enough

Workload Routing: Don't Use Flash for Everything

Hidden Costs: Retries, Fallbacks, and Context Growth

What CFOs and CTOs Should Ask This Week

Continue Reading

THE DAILY BRIEF

Frequently Asked Questions

What is Gemini 3.5 Flash and how does it help with AI token costs?

How does context caching work in Gemini 3.5 Flash?

What are the benefits of using the Batch API with Gemini 3.5 Flash?

Why are output tokens a significant cost factor in Gemini 3.5 Flash?

What is the recommended approach for routing AI workloads to different models?

Stay Ahead of the Curve

Related Articles

Copilot Cowork Is Billing Now: What Each Task Costs

85% Pilot AI Agents. Only 5% Ship. Here's Why.

Your AI Agents Are Running. Is Anyone in Charge?

Databricks $188B: Stop Token Waste, Start AI Governance

Latest Articles

Copilot Cowork Is Billing Now: What Each Task Costs

85% Pilot AI Agents. Only 5% Ship. Here's Why.

Your AI Agents Are Running. Is Anyone in Charge?

29 Countries Just Split the AI World in Two. Your Compliance Team Isn't Ready.