Gemini 3.5 Flash: Cut AI Token Costs 45% (Caching + Batch)

Google CEO: companies 'blowing through token budgets.' Gemini 3.5 Flash cuts costs 45-50% via caching and batch API. Real enterprise workload math.

By Rajesh Beri·June 4, 2026·9 min read
Share:

THE DAILY BRIEF

Google GeminiCost OptimizationEnterprise AIToken BudgetAI Infrastructure

Gemini 3.5 Flash: Cut AI Token Costs 45% (Caching + Batch)

Google CEO: companies 'blowing through token budgets.' Gemini 3.5 Flash cuts costs 45-50% via caching and batch API. Real enterprise workload math.

By Rajesh Beri·June 4, 2026·9 min read

At Google I/O 2026, Google's CEO made a blunt admission from the keynote stage: enterprise customers are "already blowing through their annual token budgets." That line wasn't marketing hyperbole. It was validation of what CFOs and CTOs have been quietly escalating internally for months. AI inference costs are spiraling, and the math doesn't work at scale.

Google's response is Gemini 3.5 Flash — a production-grade model positioned as the cost-optimized middle ground between experimental Flash preview models and premium Pro-tier offerings. But Flash isn't just about cheaper per-token pricing. The real savings come from two enterprise features most teams aren't fully leveraging: context caching and batch API.

If you're running AI in production and haven't optimized for these, you're leaving 45-50% cost savings on the table. Here's how the economics actually work.

The Token Budget Crisis No One Talks About

Most enterprises treat AI inference costs like cloud compute in 2010 — they know it's expensive, but nobody's tracking unit economics. Finance teams see monthly invoices climbing from $50K to $200K to $500K, but can't tie those numbers back to specific workloads, cost per task, or ROI.

The problem compounds when you move from experimentation to production. A chatbot pilot serving 100 employees might cost $2,000/month. Scale that same architecture to 10,000 employees and you're looking at $200,000/month — before accounting for retry logic, fallback models, or multi-turn conversations that inflate context windows.

Google's CEO wasn't exaggerating. Multiple conversations with enterprise AI leaders confirm the same pattern: teams budget $500K for the year, blow through it by Q2, then scramble to justify emergency budget extensions. Why? Because they're optimizing for model quality, not cost per successful task.

Gemini 3.5 Flash addresses this directly. At $1.50 per million input tokens and $9.00 per million output tokens, it's positioned 40% cheaper than Gemini 3.1 Pro ($2.00/$12.00) while maintaining production-grade stability that preview models can't guarantee.

But here's the kicker: standard pricing is just the baseline. The actual cost per task depends entirely on whether you're using caching and batch processing.

Context Caching: The 90% Discount Most Teams Ignore

Context caching is the single biggest cost lever in Gemini 3.5 Flash, yet most enterprise deployments don't use it. The math is straightforward: cache hits cost $0.15 per million tokens instead of $1.50 — a 90% reduction on input costs.

Here's how it works in practice. Your AI agent includes a 10,000-token system prompt with instructions, few-shot examples, tool definitions, and response schemas. Without caching, you pay $1.50 per million tokens every single time a user makes a request. For a workload running 100,000 requests per day, that's $75/day just for the system prompt.

With context caching enabled, you pay the full $1.50 rate once to populate the cache, then $0.15 per million tokens for every subsequent cache hit. Same workload: $7.50/day instead of $75/day. That's $2,025 saved per month on system prompts alone.

The catch: cache storage costs $1.00 per hour. This means caching only delivers ROI when cached content is reused frequently within each storage hour. For most production workloads — chatbots, coding agents, document analysis pipelines — this threshold is trivial. If you're processing more than 100 requests per hour using the same base context, caching pays for itself immediately.

Real-world example from a Fortune 500 company running a coding agent: 5,000 daily sessions, 10,000-token input per session (code context + instructions), 3,000-token output (generated code). Total input tokens: 50 million per day. Without caching: $75/day input cost. With 80% cache hit rate: $15/day for uncached input + $6/day for cached input = $21/day total. That's a 72% reduction in input costs, or $1,620 saved per month.

Output tokens still cost $9.00 per million (no caching benefit), so the total monthly cost drops from $6,300 to $4,680 — a 26% savings just from caching input. And that's before batch API optimization.

Batch API: The 50% Discount for Non-Urgent Work

If you're running AI workloads that don't need real-time responses, Google offers a 50% discount via the Batch API. Same model, same quality, half the cost — as long as you can tolerate higher latency.

Batch pricing for Gemini 3.5 Flash: $0.75 per million input tokens (vs $1.50 standard), $4.50 per million output tokens (vs $9.00 standard). The trade-off: batch requests may take minutes to hours instead of seconds.

This makes batch API ideal for:

  • Nightly data processing pipelines
  • Bulk document classification
  • Content moderation queues
  • Evaluation and testing workflows
  • Any workload where latency isn't user-facing

Real-world example: A document analysis pipeline processing 500 documents per day. Each document averages 100,000 tokens input (long-form content), 2,000 tokens output (summary + metadata). Total daily tokens: 50 million input, 1 million output.

Standard pricing: $75/day input + $9/day output = $84/day ($2,520/month).
Batch pricing: $37.50/day input + $4.50/day output = $42/day ($1,260/month).

That's a 50% cost reduction — $1,260 saved monthly — with zero quality degradation. The only requirement: results don't need to be available in real-time.

Combining caching and batch: 65% total savings. If the same document pipeline has 60% shared context across documents (common templates, instructions, schemas), context caching reduces input costs another 54%. Final monthly cost: $765 instead of $2,520. That's 70% cheaper than the naive implementation.

The Output Token Problem: Why Cheaper Input Isn't Enough

Most cost optimization guides focus on input token costs, but output tokens are the real budget killer. In Gemini 3.5 Flash, output costs 6x more than input ($9.00 vs $1.50 per million tokens). For workloads with heavy code generation, long explanations, or multi-step reasoning, output tokens dominate total cost.

Example: A coding agent workload with 10,000-token input and 3,000-token output per session. Input: $0.015 per session. Output: $0.027 per session. Output represents 64% of total cost.

The fix: optimize output length first, input length second. Strategies include:

  • Set max_tokens to the minimum viable for your task (don't default to 4,096)
  • Use structured output schemas to constrain response format
  • For classification tasks, return enum values instead of explanations
  • For extraction tasks, return only extracted fields (no preamble or summary)

Reducing average output length by 20% (3,000 tokens → 2,400 tokens) saves $1,260/month on a 5,000-session-per-day workload. That's more savings than most teams get from switching to a cheaper model.

Workload Routing: Don't Use Flash for Everything

Not every AI task needs Gemini 3.5 Flash. Routing simpler workloads to cheaper models — and complex tasks to more expensive models — can reduce total cost while improving quality.

Recommended routing logic:

Workload Type Model Why
Simple classification Gemini 3.1 Flash Lite ($0.25/$1.50) 6x cheaper, good enough for binary decisions
Standard extraction Gemini 3 Flash Preview ($0.50/$3.00) 3x cheaper, handles structured tasks
Agent sub-steps Gemini 3.5 Flash ($1.50/$9.00) GA stability, better reasoning
Complex reasoning Gemini 3.1 Pro ($2.00/$12.00) Higher quality for hard tasks

The anti-pattern: using the same model for everything. A Fortune 500 company I talked to was running all AI workloads — from simple sentiment tagging to complex contract analysis — on Gemini 3.1 Pro. Monthly cost: $47,000. After implementing workload routing (80% of tasks moved to Flash or Flash Lite), monthly cost dropped to $18,000. Same output quality on the tasks that mattered.

The key metric isn't cost per token — it's cost per successful task. A cheaper model that fails 30% of the time and requires retries can cost more than a premium model that succeeds on the first attempt.

Hidden Costs: Retries, Fallbacks, and Context Growth

Three cost factors most teams miss when budgeting AI inference:

  1. Retry costs. If 10% of requests fail validation and require retrying, add 10% to your token budget. For agent workflows with multi-step chains, retry costs compound across steps.

  2. Fallback to stronger models. If Gemini 3.5 Flash can't handle 5% of requests and you fallback to Gemini 3.1 Pro for those, factor in Pro-tier pricing for the fallback volume.

  3. Context growth in agentic workflows. Multi-turn agent sessions accumulate conversation history. A 10-turn coding session might start with 10,000 input tokens and end with 50,000 tokens by turn 10. If you're not pruning context or summarizing prior turns, input costs scale quadratically.

Real example: A customer support chatbot averaging 8 turns per session. Turn 1: 2,000 tokens input. Turn 8: 14,000 tokens input (cumulative context). Average input per session: 64,000 tokens. Without context pruning: $0.096 per session. With pruning (keep only last 3 turns): $0.018 per session. That's 81% savings on a 100,000-session/month workload — $7,800 saved.

What CFOs and CTOs Should Ask This Week

If you're managing AI budgets, these are the questions that matter:

  1. What percentage of our input tokens are cached? If it's below 50%, you're leaving money on the table. Most production workloads should cache 70-80% of input.

  2. Which workloads can run on batch API? Anything non-user-facing — analytics, bulk processing, evaluation — should use batch pricing. A 50% discount with zero quality trade-off.

  3. Are we routing by workload complexity? If every task uses the same model, you're overpaying for simple work and underpaying for complex reasoning.

  4. What's our cost per successful task? Track this separately for each workload. It's the only metric that accounts for retries, fallbacks, and quality.

  5. How are we handling context growth in multi-turn sessions? If you're not pruning or summarizing, costs scale quadratically with conversation length.

Google's CEO was right: companies are blowing through token budgets. But the fix isn't cutting AI adoption. It's using the cost levers already available — caching, batch processing, workload routing, output optimization — and treating inference costs like any other cloud spend: measurable, trackable, optimizable.

Gemini 3.5 Flash makes this easier than previous generations. At $1.50/$9.00 baseline pricing with 50-90% discounts available via caching and batch, it's positioned for production workloads where cost matters as much as quality. The question isn't whether your AI budget will grow. It's whether you're getting 45-50% more value from the budget you already have.


Continue Reading

THE DAILY BRIEF

Enterprise AI insights for technology and business leaders, twice weekly.

thedailybrief.com

Subscribe at thedailybrief.com/subscribe for weekly AI insights delivered to your inbox.

LinkedIn: linkedin.com/in/rberi  |  X: x.com/rajeshberi

© 2026 Rajesh Beri. All rights reserved.

Gemini 3.5 Flash: Cut AI Token Costs 45% (Caching + Batch)

Photo by Brett Sayles on Pexels

At Google I/O 2026, Google's CEO made a blunt admission from the keynote stage: enterprise customers are "already blowing through their annual token budgets." That line wasn't marketing hyperbole. It was validation of what CFOs and CTOs have been quietly escalating internally for months. AI inference costs are spiraling, and the math doesn't work at scale.

Google's response is Gemini 3.5 Flash — a production-grade model positioned as the cost-optimized middle ground between experimental Flash preview models and premium Pro-tier offerings. But Flash isn't just about cheaper per-token pricing. The real savings come from two enterprise features most teams aren't fully leveraging: context caching and batch API.

If you're running AI in production and haven't optimized for these, you're leaving 45-50% cost savings on the table. Here's how the economics actually work.

The Token Budget Crisis No One Talks About

Most enterprises treat AI inference costs like cloud compute in 2010 — they know it's expensive, but nobody's tracking unit economics. Finance teams see monthly invoices climbing from $50K to $200K to $500K, but can't tie those numbers back to specific workloads, cost per task, or ROI.

The problem compounds when you move from experimentation to production. A chatbot pilot serving 100 employees might cost $2,000/month. Scale that same architecture to 10,000 employees and you're looking at $200,000/month — before accounting for retry logic, fallback models, or multi-turn conversations that inflate context windows.

Google's CEO wasn't exaggerating. Multiple conversations with enterprise AI leaders confirm the same pattern: teams budget $500K for the year, blow through it by Q2, then scramble to justify emergency budget extensions. Why? Because they're optimizing for model quality, not cost per successful task.

Gemini 3.5 Flash addresses this directly. At $1.50 per million input tokens and $9.00 per million output tokens, it's positioned 40% cheaper than Gemini 3.1 Pro ($2.00/$12.00) while maintaining production-grade stability that preview models can't guarantee.

But here's the kicker: standard pricing is just the baseline. The actual cost per task depends entirely on whether you're using caching and batch processing.

Context Caching: The 90% Discount Most Teams Ignore

Context caching is the single biggest cost lever in Gemini 3.5 Flash, yet most enterprise deployments don't use it. The math is straightforward: cache hits cost $0.15 per million tokens instead of $1.50 — a 90% reduction on input costs.

Here's how it works in practice. Your AI agent includes a 10,000-token system prompt with instructions, few-shot examples, tool definitions, and response schemas. Without caching, you pay $1.50 per million tokens every single time a user makes a request. For a workload running 100,000 requests per day, that's $75/day just for the system prompt.

With context caching enabled, you pay the full $1.50 rate once to populate the cache, then $0.15 per million tokens for every subsequent cache hit. Same workload: $7.50/day instead of $75/day. That's $2,025 saved per month on system prompts alone.

The catch: cache storage costs $1.00 per hour. This means caching only delivers ROI when cached content is reused frequently within each storage hour. For most production workloads — chatbots, coding agents, document analysis pipelines — this threshold is trivial. If you're processing more than 100 requests per hour using the same base context, caching pays for itself immediately.

Real-world example from a Fortune 500 company running a coding agent: 5,000 daily sessions, 10,000-token input per session (code context + instructions), 3,000-token output (generated code). Total input tokens: 50 million per day. Without caching: $75/day input cost. With 80% cache hit rate: $15/day for uncached input + $6/day for cached input = $21/day total. That's a 72% reduction in input costs, or $1,620 saved per month.

Output tokens still cost $9.00 per million (no caching benefit), so the total monthly cost drops from $6,300 to $4,680 — a 26% savings just from caching input. And that's before batch API optimization.

Batch API: The 50% Discount for Non-Urgent Work

If you're running AI workloads that don't need real-time responses, Google offers a 50% discount via the Batch API. Same model, same quality, half the cost — as long as you can tolerate higher latency.

Batch pricing for Gemini 3.5 Flash: $0.75 per million input tokens (vs $1.50 standard), $4.50 per million output tokens (vs $9.00 standard). The trade-off: batch requests may take minutes to hours instead of seconds.

This makes batch API ideal for:

  • Nightly data processing pipelines
  • Bulk document classification
  • Content moderation queues
  • Evaluation and testing workflows
  • Any workload where latency isn't user-facing

Real-world example: A document analysis pipeline processing 500 documents per day. Each document averages 100,000 tokens input (long-form content), 2,000 tokens output (summary + metadata). Total daily tokens: 50 million input, 1 million output.

Standard pricing: $75/day input + $9/day output = $84/day ($2,520/month).
Batch pricing: $37.50/day input + $4.50/day output = $42/day ($1,260/month).

That's a 50% cost reduction — $1,260 saved monthly — with zero quality degradation. The only requirement: results don't need to be available in real-time.

Combining caching and batch: 65% total savings. If the same document pipeline has 60% shared context across documents (common templates, instructions, schemas), context caching reduces input costs another 54%. Final monthly cost: $765 instead of $2,520. That's 70% cheaper than the naive implementation.

The Output Token Problem: Why Cheaper Input Isn't Enough

Most cost optimization guides focus on input token costs, but output tokens are the real budget killer. In Gemini 3.5 Flash, output costs 6x more than input ($9.00 vs $1.50 per million tokens). For workloads with heavy code generation, long explanations, or multi-step reasoning, output tokens dominate total cost.

Example: A coding agent workload with 10,000-token input and 3,000-token output per session. Input: $0.015 per session. Output: $0.027 per session. Output represents 64% of total cost.

The fix: optimize output length first, input length second. Strategies include:

  • Set max_tokens to the minimum viable for your task (don't default to 4,096)
  • Use structured output schemas to constrain response format
  • For classification tasks, return enum values instead of explanations
  • For extraction tasks, return only extracted fields (no preamble or summary)

Reducing average output length by 20% (3,000 tokens → 2,400 tokens) saves $1,260/month on a 5,000-session-per-day workload. That's more savings than most teams get from switching to a cheaper model.

Workload Routing: Don't Use Flash for Everything

Not every AI task needs Gemini 3.5 Flash. Routing simpler workloads to cheaper models — and complex tasks to more expensive models — can reduce total cost while improving quality.

Recommended routing logic:

Workload Type Model Why
Simple classification Gemini 3.1 Flash Lite ($0.25/$1.50) 6x cheaper, good enough for binary decisions
Standard extraction Gemini 3 Flash Preview ($0.50/$3.00) 3x cheaper, handles structured tasks
Agent sub-steps Gemini 3.5 Flash ($1.50/$9.00) GA stability, better reasoning
Complex reasoning Gemini 3.1 Pro ($2.00/$12.00) Higher quality for hard tasks

The anti-pattern: using the same model for everything. A Fortune 500 company I talked to was running all AI workloads — from simple sentiment tagging to complex contract analysis — on Gemini 3.1 Pro. Monthly cost: $47,000. After implementing workload routing (80% of tasks moved to Flash or Flash Lite), monthly cost dropped to $18,000. Same output quality on the tasks that mattered.

The key metric isn't cost per token — it's cost per successful task. A cheaper model that fails 30% of the time and requires retries can cost more than a premium model that succeeds on the first attempt.

Hidden Costs: Retries, Fallbacks, and Context Growth

Three cost factors most teams miss when budgeting AI inference:

  1. Retry costs. If 10% of requests fail validation and require retrying, add 10% to your token budget. For agent workflows with multi-step chains, retry costs compound across steps.

  2. Fallback to stronger models. If Gemini 3.5 Flash can't handle 5% of requests and you fallback to Gemini 3.1 Pro for those, factor in Pro-tier pricing for the fallback volume.

  3. Context growth in agentic workflows. Multi-turn agent sessions accumulate conversation history. A 10-turn coding session might start with 10,000 input tokens and end with 50,000 tokens by turn 10. If you're not pruning context or summarizing prior turns, input costs scale quadratically.

Real example: A customer support chatbot averaging 8 turns per session. Turn 1: 2,000 tokens input. Turn 8: 14,000 tokens input (cumulative context). Average input per session: 64,000 tokens. Without context pruning: $0.096 per session. With pruning (keep only last 3 turns): $0.018 per session. That's 81% savings on a 100,000-session/month workload — $7,800 saved.

What CFOs and CTOs Should Ask This Week

If you're managing AI budgets, these are the questions that matter:

  1. What percentage of our input tokens are cached? If it's below 50%, you're leaving money on the table. Most production workloads should cache 70-80% of input.

  2. Which workloads can run on batch API? Anything non-user-facing — analytics, bulk processing, evaluation — should use batch pricing. A 50% discount with zero quality trade-off.

  3. Are we routing by workload complexity? If every task uses the same model, you're overpaying for simple work and underpaying for complex reasoning.

  4. What's our cost per successful task? Track this separately for each workload. It's the only metric that accounts for retries, fallbacks, and quality.

  5. How are we handling context growth in multi-turn sessions? If you're not pruning or summarizing, costs scale quadratically with conversation length.

Google's CEO was right: companies are blowing through token budgets. But the fix isn't cutting AI adoption. It's using the cost levers already available — caching, batch processing, workload routing, output optimization — and treating inference costs like any other cloud spend: measurable, trackable, optimizable.

Gemini 3.5 Flash makes this easier than previous generations. At $1.50/$9.00 baseline pricing with 50-90% discounts available via caching and batch, it's positioned for production workloads where cost matters as much as quality. The question isn't whether your AI budget will grow. It's whether you're getting 45-50% more value from the budget you already have.


Continue Reading

Share:

THE DAILY BRIEF

Google GeminiCost OptimizationEnterprise AIToken BudgetAI Infrastructure

Gemini 3.5 Flash: Cut AI Token Costs 45% (Caching + Batch)

Google CEO: companies 'blowing through token budgets.' Gemini 3.5 Flash cuts costs 45-50% via caching and batch API. Real enterprise workload math.

By Rajesh Beri·June 4, 2026·9 min read

At Google I/O 2026, Google's CEO made a blunt admission from the keynote stage: enterprise customers are "already blowing through their annual token budgets." That line wasn't marketing hyperbole. It was validation of what CFOs and CTOs have been quietly escalating internally for months. AI inference costs are spiraling, and the math doesn't work at scale.

Google's response is Gemini 3.5 Flash — a production-grade model positioned as the cost-optimized middle ground between experimental Flash preview models and premium Pro-tier offerings. But Flash isn't just about cheaper per-token pricing. The real savings come from two enterprise features most teams aren't fully leveraging: context caching and batch API.

If you're running AI in production and haven't optimized for these, you're leaving 45-50% cost savings on the table. Here's how the economics actually work.

The Token Budget Crisis No One Talks About

Most enterprises treat AI inference costs like cloud compute in 2010 — they know it's expensive, but nobody's tracking unit economics. Finance teams see monthly invoices climbing from $50K to $200K to $500K, but can't tie those numbers back to specific workloads, cost per task, or ROI.

The problem compounds when you move from experimentation to production. A chatbot pilot serving 100 employees might cost $2,000/month. Scale that same architecture to 10,000 employees and you're looking at $200,000/month — before accounting for retry logic, fallback models, or multi-turn conversations that inflate context windows.

Google's CEO wasn't exaggerating. Multiple conversations with enterprise AI leaders confirm the same pattern: teams budget $500K for the year, blow through it by Q2, then scramble to justify emergency budget extensions. Why? Because they're optimizing for model quality, not cost per successful task.

Gemini 3.5 Flash addresses this directly. At $1.50 per million input tokens and $9.00 per million output tokens, it's positioned 40% cheaper than Gemini 3.1 Pro ($2.00/$12.00) while maintaining production-grade stability that preview models can't guarantee.

But here's the kicker: standard pricing is just the baseline. The actual cost per task depends entirely on whether you're using caching and batch processing.

Context Caching: The 90% Discount Most Teams Ignore

Context caching is the single biggest cost lever in Gemini 3.5 Flash, yet most enterprise deployments don't use it. The math is straightforward: cache hits cost $0.15 per million tokens instead of $1.50 — a 90% reduction on input costs.

Here's how it works in practice. Your AI agent includes a 10,000-token system prompt with instructions, few-shot examples, tool definitions, and response schemas. Without caching, you pay $1.50 per million tokens every single time a user makes a request. For a workload running 100,000 requests per day, that's $75/day just for the system prompt.

With context caching enabled, you pay the full $1.50 rate once to populate the cache, then $0.15 per million tokens for every subsequent cache hit. Same workload: $7.50/day instead of $75/day. That's $2,025 saved per month on system prompts alone.

The catch: cache storage costs $1.00 per hour. This means caching only delivers ROI when cached content is reused frequently within each storage hour. For most production workloads — chatbots, coding agents, document analysis pipelines — this threshold is trivial. If you're processing more than 100 requests per hour using the same base context, caching pays for itself immediately.

Real-world example from a Fortune 500 company running a coding agent: 5,000 daily sessions, 10,000-token input per session (code context + instructions), 3,000-token output (generated code). Total input tokens: 50 million per day. Without caching: $75/day input cost. With 80% cache hit rate: $15/day for uncached input + $6/day for cached input = $21/day total. That's a 72% reduction in input costs, or $1,620 saved per month.

Output tokens still cost $9.00 per million (no caching benefit), so the total monthly cost drops from $6,300 to $4,680 — a 26% savings just from caching input. And that's before batch API optimization.

Batch API: The 50% Discount for Non-Urgent Work

If you're running AI workloads that don't need real-time responses, Google offers a 50% discount via the Batch API. Same model, same quality, half the cost — as long as you can tolerate higher latency.

Batch pricing for Gemini 3.5 Flash: $0.75 per million input tokens (vs $1.50 standard), $4.50 per million output tokens (vs $9.00 standard). The trade-off: batch requests may take minutes to hours instead of seconds.

This makes batch API ideal for:

  • Nightly data processing pipelines
  • Bulk document classification
  • Content moderation queues
  • Evaluation and testing workflows
  • Any workload where latency isn't user-facing

Real-world example: A document analysis pipeline processing 500 documents per day. Each document averages 100,000 tokens input (long-form content), 2,000 tokens output (summary + metadata). Total daily tokens: 50 million input, 1 million output.

Standard pricing: $75/day input + $9/day output = $84/day ($2,520/month).
Batch pricing: $37.50/day input + $4.50/day output = $42/day ($1,260/month).

That's a 50% cost reduction — $1,260 saved monthly — with zero quality degradation. The only requirement: results don't need to be available in real-time.

Combining caching and batch: 65% total savings. If the same document pipeline has 60% shared context across documents (common templates, instructions, schemas), context caching reduces input costs another 54%. Final monthly cost: $765 instead of $2,520. That's 70% cheaper than the naive implementation.

The Output Token Problem: Why Cheaper Input Isn't Enough

Most cost optimization guides focus on input token costs, but output tokens are the real budget killer. In Gemini 3.5 Flash, output costs 6x more than input ($9.00 vs $1.50 per million tokens). For workloads with heavy code generation, long explanations, or multi-step reasoning, output tokens dominate total cost.

Example: A coding agent workload with 10,000-token input and 3,000-token output per session. Input: $0.015 per session. Output: $0.027 per session. Output represents 64% of total cost.

The fix: optimize output length first, input length second. Strategies include:

  • Set max_tokens to the minimum viable for your task (don't default to 4,096)
  • Use structured output schemas to constrain response format
  • For classification tasks, return enum values instead of explanations
  • For extraction tasks, return only extracted fields (no preamble or summary)

Reducing average output length by 20% (3,000 tokens → 2,400 tokens) saves $1,260/month on a 5,000-session-per-day workload. That's more savings than most teams get from switching to a cheaper model.

Workload Routing: Don't Use Flash for Everything

Not every AI task needs Gemini 3.5 Flash. Routing simpler workloads to cheaper models — and complex tasks to more expensive models — can reduce total cost while improving quality.

Recommended routing logic:

Workload Type Model Why
Simple classification Gemini 3.1 Flash Lite ($0.25/$1.50) 6x cheaper, good enough for binary decisions
Standard extraction Gemini 3 Flash Preview ($0.50/$3.00) 3x cheaper, handles structured tasks
Agent sub-steps Gemini 3.5 Flash ($1.50/$9.00) GA stability, better reasoning
Complex reasoning Gemini 3.1 Pro ($2.00/$12.00) Higher quality for hard tasks

The anti-pattern: using the same model for everything. A Fortune 500 company I talked to was running all AI workloads — from simple sentiment tagging to complex contract analysis — on Gemini 3.1 Pro. Monthly cost: $47,000. After implementing workload routing (80% of tasks moved to Flash or Flash Lite), monthly cost dropped to $18,000. Same output quality on the tasks that mattered.

The key metric isn't cost per token — it's cost per successful task. A cheaper model that fails 30% of the time and requires retries can cost more than a premium model that succeeds on the first attempt.

Hidden Costs: Retries, Fallbacks, and Context Growth

Three cost factors most teams miss when budgeting AI inference:

  1. Retry costs. If 10% of requests fail validation and require retrying, add 10% to your token budget. For agent workflows with multi-step chains, retry costs compound across steps.

  2. Fallback to stronger models. If Gemini 3.5 Flash can't handle 5% of requests and you fallback to Gemini 3.1 Pro for those, factor in Pro-tier pricing for the fallback volume.

  3. Context growth in agentic workflows. Multi-turn agent sessions accumulate conversation history. A 10-turn coding session might start with 10,000 input tokens and end with 50,000 tokens by turn 10. If you're not pruning context or summarizing prior turns, input costs scale quadratically.

Real example: A customer support chatbot averaging 8 turns per session. Turn 1: 2,000 tokens input. Turn 8: 14,000 tokens input (cumulative context). Average input per session: 64,000 tokens. Without context pruning: $0.096 per session. With pruning (keep only last 3 turns): $0.018 per session. That's 81% savings on a 100,000-session/month workload — $7,800 saved.

What CFOs and CTOs Should Ask This Week

If you're managing AI budgets, these are the questions that matter:

  1. What percentage of our input tokens are cached? If it's below 50%, you're leaving money on the table. Most production workloads should cache 70-80% of input.

  2. Which workloads can run on batch API? Anything non-user-facing — analytics, bulk processing, evaluation — should use batch pricing. A 50% discount with zero quality trade-off.

  3. Are we routing by workload complexity? If every task uses the same model, you're overpaying for simple work and underpaying for complex reasoning.

  4. What's our cost per successful task? Track this separately for each workload. It's the only metric that accounts for retries, fallbacks, and quality.

  5. How are we handling context growth in multi-turn sessions? If you're not pruning or summarizing, costs scale quadratically with conversation length.

Google's CEO was right: companies are blowing through token budgets. But the fix isn't cutting AI adoption. It's using the cost levers already available — caching, batch processing, workload routing, output optimization — and treating inference costs like any other cloud spend: measurable, trackable, optimizable.

Gemini 3.5 Flash makes this easier than previous generations. At $1.50/$9.00 baseline pricing with 50-90% discounts available via caching and batch, it's positioned for production workloads where cost matters as much as quality. The question isn't whether your AI budget will grow. It's whether you're getting 45-50% more value from the budget you already have.


Continue Reading

THE DAILY BRIEF

Enterprise AI insights for technology and business leaders, twice weekly.

thedailybrief.com

Subscribe at thedailybrief.com/subscribe for weekly AI insights delivered to your inbox.

LinkedIn: linkedin.com/in/rberi  |  X: x.com/rajeshberi

© 2026 Rajesh Beri. All rights reserved.

Newsletter

Stay Ahead of the Curve

Weekly enterprise AI insights for technology leaders. No spam, no vendor pitches—unsubscribe anytime.

Subscribe