Forrester: 22% of Production AI Agents Lose Money

Forrester says 22% of AI agents that reach production deliver negative ROI at 12 months. The root cause isn't model quality. Here's how CIOs and CFOs fix it.

By Rajesh Beri·May 13, 2026·15 min read
Share:

THE DAILY BRIEF

Enterprise AIAI AgentsAI ROIForresterCIOAI Governance

Forrester: 22% of Production AI Agents Lose Money

Forrester says 22% of AI agents that reach production deliver negative ROI at 12 months. The root cause isn't model quality. Here's how CIOs and CFOs fix it.

By Rajesh Beri·May 13, 2026·15 min read

Forrester just published the number every CIO and CFO has been afraid to ask about: 22% of AI agents that successfully reach production deliver negative ROI at the 12-month mark. Not the 88% of agent pilots that die in development — those are well-documented. This is the harder problem: agents that survived the pilot gauntlet, got executive sign-off, made it into live workflows, and still lost money. The kicker, from Forrester's root-cause analysis: 41% of those failures traced back to unclear success criteria, 33% to insufficient tool or data access, and 26% to evaluation drift. None were fundamentally model-quality problems.

For enterprises that have already crossed the production threshold, this changes the conversation. The question is no longer "Can we build an agent that works?" It's "Did we define success the day we wrote the budget — or are we measuring vanity metrics 12 months in?" After dozens of conversations with CIOs running agent programs, I've seen the pattern Forrester documents: technically successful, financially indefensible. The enterprise CFO ROI-measurement gap is the financial side of the same problem. This piece breaks down why it happens, what the data says about who avoids it, and two frameworks you can apply this week before your next agent ships.

What Changed: The 22% Number and What Sits Behind It

Forrester and Anaconda's 2026 enterprise survey put concrete numbers on a problem that's been mostly anecdotal. Across the agent deployments that successfully shipped to production:

  • 41% achieve positive payback within 12 months
  • 18% achieve payback within 6 months
  • 22% report negative ROI at 12 months
  • The remaining ~37% report flat or marginal returns

Read that again: only 41% of agents that reach production deliver clear positive ROI inside a year. The conventional industry framing — that 88% of pilots fail to graduate to production, per Forrester and Anaconda 2026 research — hides the second wave of failures that happen after you've already counted the win.

The root-cause split is what makes this important. From Forrester's analysis of the 22%:

  • 41% — unclear success criteria. The agent did what it was built to do. Nobody had defined what "success" meant in dollar terms, so 12 months later there was no defensible ROI story.
  • 33% — insufficient tool or data access. The agent could reason but couldn't act on enough live systems to complete the work end-to-end. Humans kept absorbing the last-mile cost.
  • 26% — evaluation coverage drift. The agent passed launch eval, then real production data diverged from training distribution. No automated evaluation ran in production, so drift went unmeasured until business outcomes degraded.

None of these are GPT-vs-Claude-vs-Gemini problems. They're operating-model problems. That matters because most enterprise AI budgets are still allocated as if model selection is the lever. It isn't.

Three other 2026 data points reinforce the picture:

The convergence is what's striking. Different methodologies, different sample frames, same conclusion: the gap between AI agent capability and AI agent ROI is now the dominant enterprise risk in the category — and it isn't closing on its own.

Why This Matters: The CIO and CFO Read

This is a dual-audience problem because the failure modes split across the technical and financial sides of the house.

Technical Implications (CIO, CTO, VP Engineering)

The 33% "insufficient tool or data access" bucket is the most fixable but also the most consistently underestimated. Agents that look great in demo environments — where they get clean API access to a small set of systems — degrade quickly when they hit production reality: legacy ERPs without modern APIs, authentication tokens that expire mid-workflow, rate limits, data quality drift, role-based access controls that the agent doesn't have credentials for.

BCG and Forrester's 2026 production-blocker survey ranked the top production blockers among CIOs: evaluation and observability (64%), governance and compliance (57%), model reliability and non-determinism (51%). All three sit downstream of the same architectural decision — whether you treated the agent as deployable software with versioning, rollback, and continuous evaluation, or as a prompt-engineered experiment you scaled.

The 41% rollback rate in the same dataset is telling: 41% of enterprises report at least one production rollback within 12 months. But agents with automated evaluation coverage roll back at 9%. Agents without roll back at 47%. The five-fold difference is the cost of skipping evaluation infrastructure.

Business Implications (CFO, CMO, COO)

For finance leaders, the 22% number lands differently. The question is no longer "Will this work?" — it's "Can I defend this on the next earnings call?" Three patterns from the data:

  • Productivity-ROI disconnect. WRITER's data shows individual AI super-users save up to 9 hours per week, but only 29% of organizations see significant organizational ROI from generative AI and 23% from agents. Hours saved at the individual level rarely roll up to P&L improvement at the company level unless the organization restructures around the freed capacity.
  • Time-to-value varies 3x by function. BCG and Forrester 2026 data put SDR/outbound agents at 3.4 months payback, customer service at 4.7 months, finance and ops at 8.9 months, HR at 9.4 months, legal and compliance at 11.2 months. If your CFO is benchmarking everything against the SDR result, your finance-ops agent is going to look like a failure 8 months in.
  • Industry adoption gaps amplify competitive risk. Banking and insurance are at 47% production adoption; software/internet at 44%; healthcare at 18%; government at 14%. The competitive cost of falling behind your industry's median is now measurable, not theoretical — see the pilot-to-production gap in banking for an industry deep-dive.

Both audiences end up at the same conclusion: the gating constraint isn't model quality. It's measurement infrastructure, evaluation discipline, and operating-model maturity.

Market Context: Who's Building the Infrastructure to Solve This

A new category of vendors emerged in 2025-2026 specifically to address the post-production failure modes. The six platforms anchoring the agent observability and evaluation market in 2026, per industry coverage of LangSmith, Langfuse, and Arize Phoenix:

  • LangSmith (LangChain-native, deepest framework integration) — best for teams already on LangChain who want execution timeline tracing and custom evaluators.
  • Langfuse (open-source leader, self-hostable) — best for regulated industries that need data residency control.
  • Arize Phoenix (ML-grade rigor, OpenTelemetry-based) — best for organizations with existing ML platform discipline.
  • Helicone (drop-in proxy, simplest install) — best for teams that need observability before they have a framework lock-in.
  • Datadog LLM Observability — best for Datadog shops looking to unify APM and LLM telemetry.
  • Honeycomb LLM Observability — best for event-driven architectures that need deep tracing.

Analyst guidance from Forrester's 2026 enterprise software predictions and Gartner's Hype Cycle for Agentic AI converges on the same point: production monitoring should be a continuous cycle of monitoring → evaluation datasets → experimentation → redeployment, not a one-time launch eval. Forrester also expects 30% of enterprise application vendors to launch Model Context Protocol servers in 2026, and half of enterprise ERP vendors to introduce autonomous governance modules — meaning the infrastructure layer for closing this loop is being commoditized as we speak.

The infrastructure exists. The data shows most enterprises haven't bought it. Only 31% of organizations have implemented a measurement framework for agentic AI, despite 52% of executives in gen-AI-using organizations already having AI agents in production. That gap — 52% in production, 31% with measurement — is roughly the gap that produces the 22% negative-ROI failures.

Framework #1: The Success Criteria Definition Worksheet

The single highest-leverage intervention against the 41% "unclear success criteria" failure mode is forcing every agent deployment through a written success-criteria document before budget approval. This worksheet operationalizes that.

Score each dimension 1–5. Total possible score: 25. Below 15 = do not deploy. 15–19 = remediate before deployment. 20+ = deploy with confidence.

Dimension Score 1 (Fail) Score 3 (Acceptable) Score 5 (Strong)
Baseline cost is documented No baseline measured Baseline estimated from intuition Baseline measured in $ / hours / errors before agent ships
Primary KPI is dollarized "Save time" or "improve experience" One KPI is dollarized Every primary KPI converts to $ within 30 days
Success threshold is pre-committed No threshold defined Threshold defined post-launch Threshold signed by exec sponsor before launch
Attribution method is automated Manual attribution at quarterly review Manual attribution monthly Usage and outcomes auto-attributed via telemetry
Kill criteria are defined No kill criteria Kill criteria defined informally Kill criteria signed off, with named owner and timeline

A worked example: an enterprise rolling out a finance reconciliation agent.

  • Baseline: 4.2 FTE × $145,000 fully-loaded = $609,000/year in current reconciliation cost. Average cycle: 6.3 days. (Score: 5)
  • Dollarized KPI: Target — reduce FTE allocation to 1.5 ($217,500), accept agent cost of $90,000/year. Net target savings: $301,500/year. (Score: 5)
  • Pre-committed threshold: Must hit $200K+ annualized savings by Month 9 or program enters remediation. Signed by CFO. (Score: 5)
  • Automated attribution: Agent emits per-transaction telemetry tagged with reconciliation outcome and human-touched flag, piped to Datadog. (Score: 4)
  • Kill criteria: If Month 9 savings < $150K annualized, agent is sunset within 60 days; owner is the VP Finance Transformation. (Score: 4)

Total: 23/25. This program survives the 12-month review because the answer to "did it work?" was answered before the agent shipped, not after.

Compare to the typical pattern that produces the 22% failures: "We're deploying an AI agent to improve finance ops. We'll measure the impact and iterate." That scores about 8/25 and is mathematically destined to be a candidate for the negative-ROI bucket.

This worksheet should sit between your AI program management office and any budget approval. If it's not signed before money moves, you're funding optionality, not an investment.

Framework #2: The 90-Day Pre-Production Readiness Checklist

If success criteria addresses the 41% failure mode, this checklist addresses the 33% "insufficient tool/data access" and 26% "evaluation drift" modes. Run this 90 days before your intended production date. If you can't check 12+ of 15 boxes, push the launch.

Technical Readiness (5 items)

  • API/tool inventory is complete. Every system the agent will read from or write to is identified, with access pattern, rate limit, and failure mode documented.
  • Authentication is production-grade. Service accounts (not personal OAuth tokens) are provisioned, with rotation schedule and monitoring for expiration.
  • Data access matches role. The agent operates with least-privilege RBAC; access patterns are audit-logged for SOC2/HIPAA/GDPR review.
  • Rate limits and cost ceilings are coded. Hard daily/monthly budget caps prevent the 47,000-call infinite-retry scenario that ate one startup's payroll budget overnight.
  • Rollback procedure is rehearsed. Kill-switch tested in staging; named on-call rotation owns rollback authority.

Evaluation Readiness (5 items)

  • Production eval set exists. Minimum 200 representative examples from real (or anonymized) production data, not synthetic.
  • Automated evals run on every change. Prompt changes, model upgrades, and tool changes all gate on eval pass rate ≥ launch threshold.
  • Drift detection is live. Statistical monitoring of input distribution, tool-call patterns, and output quality runs continuously, not weekly.
  • Human review sampling is scheduled. 5-10% of production runs flow to a human reviewer with a defined cadence and reviewer assignment.
  • Eval coverage matches risk surface. High-risk workflows (financial commitments, customer-facing communication, irreversible actions) have stricter eval gates than low-risk ones.

Organizational Readiness (5 items)

  • Named agent owner. One person, one role, with accountability and budget authority. Per Forrester data, 94% of successful production agents have named owners; the unsuccessful ones often don't.
  • Executive sponsor signs success criteria. The CFO, COO, or business unit leader who will defend the ROI on the next quarterly review.
  • Scope is single-workflow. 81% of successful production agents are scoped to one workflow with binary success criteria, per the same data. Multi-workflow scope sharply increases failure odds.
  • Change management plan exists. End users have been trained, escalation paths are documented, and feedback loops are open.
  • Sunset criteria are documented. What conditions trigger sunsetting the agent? Who decides? Over what timeline?

The combination — 94% of successful agents have named owners, 87% have automated evals, 81% are single-workflow — is so consistent across Forrester, BCG, and Anaconda 2026 data that it functions as a profile. Match the profile or carry the risk of joining the 22%.

Case Study: XPO Logistics — What the 41% Looks Like When It Works

The most instructive successful example in the 2026 data is XPO Logistics' AI routing optimization deployment. It's a useful counter-case because XPO did the unglamorous work that the 22% skipped.

What they built: AI routing optimization for linehaul operations, with strict bottom-line attribution.

Outcomes (12-month):

  • 80% reduction in linehaul diversions
  • 12% compression in empty miles
  • $29 million per efficiency point gained
  • Operating ratio improved 180 basis points

Why it worked: XPO built the measurement infrastructure before the AI deployment. Every routing decision was already tagged with a cost basis from their pre-existing logistics analytics. When the AI agent went live, the optimization action and the measurement action were the same transaction. There was no separate "measurement project" running in parallel that could drift or get deprioritized.

This matters because it inverts the typical enterprise sequence. Most organizations: (1) deploy agent, (2) hope ROI shows up, (3) build measurement to prove ROI six months in. XPO: (1) build measurement, (2) deploy agent into the measurement, (3) compound the optimization.

The contrast with the failure modes is precise. XPO's success criteria were dollarized before deployment. Their tool access was complete because the agent operated inside their existing logistics analytics surface. Their evaluation drift was bounded because every decision was attributed in real time. They hit all three Forrester root-cause prevention vectors by architecture, not by discipline. The same pattern shows up in 171% ROI case studies from the agents that succeed — measurement-by-architecture beats measurement-by-project every time.

The lesson, repeated across the data: enterprises with a decade-plus investment in data foundations (XPO, Bank of America, large insurers) realize agent ROI 2-3x faster than enterprises starting from scratch. This is not a fairness statement — it's a sequencing one. The infrastructure work the 22% want to skip is the work that produces the 41% positive-ROI outcomes.

What to Do About It

For CIOs

  • Mandate the Success Criteria Worksheet on every agent budget request, starting with the next funding cycle. Don't sign budgets without 20+ scores.
  • Invest in agent observability and evaluation infrastructure before scaling agent count. Two production agents with real measurement beat 20 with launch evals only.
  • Set a 90-day pre-production gate using the readiness checklist. Treat it as a hard gate, not a guideline.

For CFOs

  • Stop accepting "hours saved" as the primary ROI metric for AI agents. Demand dollarized KPIs, named accountability, and pre-committed thresholds.
  • Build a portfolio view: which agents have crossed 6 months in production, what's the realized vs. projected ROI, what's the trend? The 22% becomes visible only when you look at the portfolio, not individual programs.
  • Pre-commit kill criteria for every agent program. The cost of running a sunsetting agent for 18 months because no one owns the sunset decision is a hidden line item in most enterprise AI budgets.

For Business Unit Leaders

  • Insist on single-workflow scope for your first three agents. Multi-workflow ambition is the #1 predictor of post-production failure.
  • Name the owner before the budget. If no one will put their name on accountability for the outcome, the program shouldn't be funded.
  • Plan for a one-quarter measurement gap. Every realistic deployment has a quarter where measurement infrastructure runs ahead of business outcome. Budget for that gap — don't pretend it doesn't exist.

The 22% problem is not a model problem. It's a measurement-and-discipline problem. The 41% of agents that deliver positive ROI inside 12 months aren't using smarter models. They're using the same models on top of better-designed operating systems. The infrastructure to do this exists in 2026. The data shows most enterprises haven't yet decided to buy it.


Continue Reading


Running an AI agent program with measurable ROI — or fighting the 22% problem? I'd love to compare notes on what's working in production. Connect with me on LinkedIn or Twitter/X.

THE DAILY BRIEF

Enterprise AI insights for technology and business leaders, twice weekly.

thedailybrief.com

Subscribe at thedailybrief.com/subscribe for weekly AI insights delivered to your inbox.

LinkedIn: linkedin.com/in/rberi  |  X: x.com/rajeshberi

© 2026 Rajesh Beri. All rights reserved.

Forrester: 22% of Production AI Agents Lose Money

Photo by Tima Miroshnichenko on Pexels

Forrester just published the number every CIO and CFO has been afraid to ask about: 22% of AI agents that successfully reach production deliver negative ROI at the 12-month mark. Not the 88% of agent pilots that die in development — those are well-documented. This is the harder problem: agents that survived the pilot gauntlet, got executive sign-off, made it into live workflows, and still lost money. The kicker, from Forrester's root-cause analysis: 41% of those failures traced back to unclear success criteria, 33% to insufficient tool or data access, and 26% to evaluation drift. None were fundamentally model-quality problems.

For enterprises that have already crossed the production threshold, this changes the conversation. The question is no longer "Can we build an agent that works?" It's "Did we define success the day we wrote the budget — or are we measuring vanity metrics 12 months in?" After dozens of conversations with CIOs running agent programs, I've seen the pattern Forrester documents: technically successful, financially indefensible. The enterprise CFO ROI-measurement gap is the financial side of the same problem. This piece breaks down why it happens, what the data says about who avoids it, and two frameworks you can apply this week before your next agent ships.

What Changed: The 22% Number and What Sits Behind It

Forrester and Anaconda's 2026 enterprise survey put concrete numbers on a problem that's been mostly anecdotal. Across the agent deployments that successfully shipped to production:

  • 41% achieve positive payback within 12 months
  • 18% achieve payback within 6 months
  • 22% report negative ROI at 12 months
  • The remaining ~37% report flat or marginal returns

Read that again: only 41% of agents that reach production deliver clear positive ROI inside a year. The conventional industry framing — that 88% of pilots fail to graduate to production, per Forrester and Anaconda 2026 research — hides the second wave of failures that happen after you've already counted the win.

The root-cause split is what makes this important. From Forrester's analysis of the 22%:

  • 41% — unclear success criteria. The agent did what it was built to do. Nobody had defined what "success" meant in dollar terms, so 12 months later there was no defensible ROI story.
  • 33% — insufficient tool or data access. The agent could reason but couldn't act on enough live systems to complete the work end-to-end. Humans kept absorbing the last-mile cost.
  • 26% — evaluation coverage drift. The agent passed launch eval, then real production data diverged from training distribution. No automated evaluation ran in production, so drift went unmeasured until business outcomes degraded.

None of these are GPT-vs-Claude-vs-Gemini problems. They're operating-model problems. That matters because most enterprise AI budgets are still allocated as if model selection is the lever. It isn't.

Three other 2026 data points reinforce the picture:

The convergence is what's striking. Different methodologies, different sample frames, same conclusion: the gap between AI agent capability and AI agent ROI is now the dominant enterprise risk in the category — and it isn't closing on its own.

Why This Matters: The CIO and CFO Read

This is a dual-audience problem because the failure modes split across the technical and financial sides of the house.

Technical Implications (CIO, CTO, VP Engineering)

The 33% "insufficient tool or data access" bucket is the most fixable but also the most consistently underestimated. Agents that look great in demo environments — where they get clean API access to a small set of systems — degrade quickly when they hit production reality: legacy ERPs without modern APIs, authentication tokens that expire mid-workflow, rate limits, data quality drift, role-based access controls that the agent doesn't have credentials for.

BCG and Forrester's 2026 production-blocker survey ranked the top production blockers among CIOs: evaluation and observability (64%), governance and compliance (57%), model reliability and non-determinism (51%). All three sit downstream of the same architectural decision — whether you treated the agent as deployable software with versioning, rollback, and continuous evaluation, or as a prompt-engineered experiment you scaled.

The 41% rollback rate in the same dataset is telling: 41% of enterprises report at least one production rollback within 12 months. But agents with automated evaluation coverage roll back at 9%. Agents without roll back at 47%. The five-fold difference is the cost of skipping evaluation infrastructure.

Business Implications (CFO, CMO, COO)

For finance leaders, the 22% number lands differently. The question is no longer "Will this work?" — it's "Can I defend this on the next earnings call?" Three patterns from the data:

  • Productivity-ROI disconnect. WRITER's data shows individual AI super-users save up to 9 hours per week, but only 29% of organizations see significant organizational ROI from generative AI and 23% from agents. Hours saved at the individual level rarely roll up to P&L improvement at the company level unless the organization restructures around the freed capacity.
  • Time-to-value varies 3x by function. BCG and Forrester 2026 data put SDR/outbound agents at 3.4 months payback, customer service at 4.7 months, finance and ops at 8.9 months, HR at 9.4 months, legal and compliance at 11.2 months. If your CFO is benchmarking everything against the SDR result, your finance-ops agent is going to look like a failure 8 months in.
  • Industry adoption gaps amplify competitive risk. Banking and insurance are at 47% production adoption; software/internet at 44%; healthcare at 18%; government at 14%. The competitive cost of falling behind your industry's median is now measurable, not theoretical — see the pilot-to-production gap in banking for an industry deep-dive.

Both audiences end up at the same conclusion: the gating constraint isn't model quality. It's measurement infrastructure, evaluation discipline, and operating-model maturity.

Market Context: Who's Building the Infrastructure to Solve This

A new category of vendors emerged in 2025-2026 specifically to address the post-production failure modes. The six platforms anchoring the agent observability and evaluation market in 2026, per industry coverage of LangSmith, Langfuse, and Arize Phoenix:

  • LangSmith (LangChain-native, deepest framework integration) — best for teams already on LangChain who want execution timeline tracing and custom evaluators.
  • Langfuse (open-source leader, self-hostable) — best for regulated industries that need data residency control.
  • Arize Phoenix (ML-grade rigor, OpenTelemetry-based) — best for organizations with existing ML platform discipline.
  • Helicone (drop-in proxy, simplest install) — best for teams that need observability before they have a framework lock-in.
  • Datadog LLM Observability — best for Datadog shops looking to unify APM and LLM telemetry.
  • Honeycomb LLM Observability — best for event-driven architectures that need deep tracing.

Analyst guidance from Forrester's 2026 enterprise software predictions and Gartner's Hype Cycle for Agentic AI converges on the same point: production monitoring should be a continuous cycle of monitoring → evaluation datasets → experimentation → redeployment, not a one-time launch eval. Forrester also expects 30% of enterprise application vendors to launch Model Context Protocol servers in 2026, and half of enterprise ERP vendors to introduce autonomous governance modules — meaning the infrastructure layer for closing this loop is being commoditized as we speak.

The infrastructure exists. The data shows most enterprises haven't bought it. Only 31% of organizations have implemented a measurement framework for agentic AI, despite 52% of executives in gen-AI-using organizations already having AI agents in production. That gap — 52% in production, 31% with measurement — is roughly the gap that produces the 22% negative-ROI failures.

Framework #1: The Success Criteria Definition Worksheet

The single highest-leverage intervention against the 41% "unclear success criteria" failure mode is forcing every agent deployment through a written success-criteria document before budget approval. This worksheet operationalizes that.

Score each dimension 1–5. Total possible score: 25. Below 15 = do not deploy. 15–19 = remediate before deployment. 20+ = deploy with confidence.

Dimension Score 1 (Fail) Score 3 (Acceptable) Score 5 (Strong)
Baseline cost is documented No baseline measured Baseline estimated from intuition Baseline measured in $ / hours / errors before agent ships
Primary KPI is dollarized "Save time" or "improve experience" One KPI is dollarized Every primary KPI converts to $ within 30 days
Success threshold is pre-committed No threshold defined Threshold defined post-launch Threshold signed by exec sponsor before launch
Attribution method is automated Manual attribution at quarterly review Manual attribution monthly Usage and outcomes auto-attributed via telemetry
Kill criteria are defined No kill criteria Kill criteria defined informally Kill criteria signed off, with named owner and timeline

A worked example: an enterprise rolling out a finance reconciliation agent.

  • Baseline: 4.2 FTE × $145,000 fully-loaded = $609,000/year in current reconciliation cost. Average cycle: 6.3 days. (Score: 5)
  • Dollarized KPI: Target — reduce FTE allocation to 1.5 ($217,500), accept agent cost of $90,000/year. Net target savings: $301,500/year. (Score: 5)
  • Pre-committed threshold: Must hit $200K+ annualized savings by Month 9 or program enters remediation. Signed by CFO. (Score: 5)
  • Automated attribution: Agent emits per-transaction telemetry tagged with reconciliation outcome and human-touched flag, piped to Datadog. (Score: 4)
  • Kill criteria: If Month 9 savings < $150K annualized, agent is sunset within 60 days; owner is the VP Finance Transformation. (Score: 4)

Total: 23/25. This program survives the 12-month review because the answer to "did it work?" was answered before the agent shipped, not after.

Compare to the typical pattern that produces the 22% failures: "We're deploying an AI agent to improve finance ops. We'll measure the impact and iterate." That scores about 8/25 and is mathematically destined to be a candidate for the negative-ROI bucket.

This worksheet should sit between your AI program management office and any budget approval. If it's not signed before money moves, you're funding optionality, not an investment.

Framework #2: The 90-Day Pre-Production Readiness Checklist

If success criteria addresses the 41% failure mode, this checklist addresses the 33% "insufficient tool/data access" and 26% "evaluation drift" modes. Run this 90 days before your intended production date. If you can't check 12+ of 15 boxes, push the launch.

Technical Readiness (5 items)

  • API/tool inventory is complete. Every system the agent will read from or write to is identified, with access pattern, rate limit, and failure mode documented.
  • Authentication is production-grade. Service accounts (not personal OAuth tokens) are provisioned, with rotation schedule and monitoring for expiration.
  • Data access matches role. The agent operates with least-privilege RBAC; access patterns are audit-logged for SOC2/HIPAA/GDPR review.
  • Rate limits and cost ceilings are coded. Hard daily/monthly budget caps prevent the 47,000-call infinite-retry scenario that ate one startup's payroll budget overnight.
  • Rollback procedure is rehearsed. Kill-switch tested in staging; named on-call rotation owns rollback authority.

Evaluation Readiness (5 items)

  • Production eval set exists. Minimum 200 representative examples from real (or anonymized) production data, not synthetic.
  • Automated evals run on every change. Prompt changes, model upgrades, and tool changes all gate on eval pass rate ≥ launch threshold.
  • Drift detection is live. Statistical monitoring of input distribution, tool-call patterns, and output quality runs continuously, not weekly.
  • Human review sampling is scheduled. 5-10% of production runs flow to a human reviewer with a defined cadence and reviewer assignment.
  • Eval coverage matches risk surface. High-risk workflows (financial commitments, customer-facing communication, irreversible actions) have stricter eval gates than low-risk ones.

Organizational Readiness (5 items)

  • Named agent owner. One person, one role, with accountability and budget authority. Per Forrester data, 94% of successful production agents have named owners; the unsuccessful ones often don't.
  • Executive sponsor signs success criteria. The CFO, COO, or business unit leader who will defend the ROI on the next quarterly review.
  • Scope is single-workflow. 81% of successful production agents are scoped to one workflow with binary success criteria, per the same data. Multi-workflow scope sharply increases failure odds.
  • Change management plan exists. End users have been trained, escalation paths are documented, and feedback loops are open.
  • Sunset criteria are documented. What conditions trigger sunsetting the agent? Who decides? Over what timeline?

The combination — 94% of successful agents have named owners, 87% have automated evals, 81% are single-workflow — is so consistent across Forrester, BCG, and Anaconda 2026 data that it functions as a profile. Match the profile or carry the risk of joining the 22%.

Case Study: XPO Logistics — What the 41% Looks Like When It Works

The most instructive successful example in the 2026 data is XPO Logistics' AI routing optimization deployment. It's a useful counter-case because XPO did the unglamorous work that the 22% skipped.

What they built: AI routing optimization for linehaul operations, with strict bottom-line attribution.

Outcomes (12-month):

  • 80% reduction in linehaul diversions
  • 12% compression in empty miles
  • $29 million per efficiency point gained
  • Operating ratio improved 180 basis points

Why it worked: XPO built the measurement infrastructure before the AI deployment. Every routing decision was already tagged with a cost basis from their pre-existing logistics analytics. When the AI agent went live, the optimization action and the measurement action were the same transaction. There was no separate "measurement project" running in parallel that could drift or get deprioritized.

This matters because it inverts the typical enterprise sequence. Most organizations: (1) deploy agent, (2) hope ROI shows up, (3) build measurement to prove ROI six months in. XPO: (1) build measurement, (2) deploy agent into the measurement, (3) compound the optimization.

The contrast with the failure modes is precise. XPO's success criteria were dollarized before deployment. Their tool access was complete because the agent operated inside their existing logistics analytics surface. Their evaluation drift was bounded because every decision was attributed in real time. They hit all three Forrester root-cause prevention vectors by architecture, not by discipline. The same pattern shows up in 171% ROI case studies from the agents that succeed — measurement-by-architecture beats measurement-by-project every time.

The lesson, repeated across the data: enterprises with a decade-plus investment in data foundations (XPO, Bank of America, large insurers) realize agent ROI 2-3x faster than enterprises starting from scratch. This is not a fairness statement — it's a sequencing one. The infrastructure work the 22% want to skip is the work that produces the 41% positive-ROI outcomes.

What to Do About It

For CIOs

  • Mandate the Success Criteria Worksheet on every agent budget request, starting with the next funding cycle. Don't sign budgets without 20+ scores.
  • Invest in agent observability and evaluation infrastructure before scaling agent count. Two production agents with real measurement beat 20 with launch evals only.
  • Set a 90-day pre-production gate using the readiness checklist. Treat it as a hard gate, not a guideline.

For CFOs

  • Stop accepting "hours saved" as the primary ROI metric for AI agents. Demand dollarized KPIs, named accountability, and pre-committed thresholds.
  • Build a portfolio view: which agents have crossed 6 months in production, what's the realized vs. projected ROI, what's the trend? The 22% becomes visible only when you look at the portfolio, not individual programs.
  • Pre-commit kill criteria for every agent program. The cost of running a sunsetting agent for 18 months because no one owns the sunset decision is a hidden line item in most enterprise AI budgets.

For Business Unit Leaders

  • Insist on single-workflow scope for your first three agents. Multi-workflow ambition is the #1 predictor of post-production failure.
  • Name the owner before the budget. If no one will put their name on accountability for the outcome, the program shouldn't be funded.
  • Plan for a one-quarter measurement gap. Every realistic deployment has a quarter where measurement infrastructure runs ahead of business outcome. Budget for that gap — don't pretend it doesn't exist.

The 22% problem is not a model problem. It's a measurement-and-discipline problem. The 41% of agents that deliver positive ROI inside 12 months aren't using smarter models. They're using the same models on top of better-designed operating systems. The infrastructure to do this exists in 2026. The data shows most enterprises haven't yet decided to buy it.


Continue Reading


Running an AI agent program with measurable ROI — or fighting the 22% problem? I'd love to compare notes on what's working in production. Connect with me on LinkedIn or Twitter/X.

Share:

THE DAILY BRIEF

Enterprise AIAI AgentsAI ROIForresterCIOAI Governance

Forrester: 22% of Production AI Agents Lose Money

Forrester says 22% of AI agents that reach production deliver negative ROI at 12 months. The root cause isn't model quality. Here's how CIOs and CFOs fix it.

By Rajesh Beri·May 13, 2026·15 min read

Forrester just published the number every CIO and CFO has been afraid to ask about: 22% of AI agents that successfully reach production deliver negative ROI at the 12-month mark. Not the 88% of agent pilots that die in development — those are well-documented. This is the harder problem: agents that survived the pilot gauntlet, got executive sign-off, made it into live workflows, and still lost money. The kicker, from Forrester's root-cause analysis: 41% of those failures traced back to unclear success criteria, 33% to insufficient tool or data access, and 26% to evaluation drift. None were fundamentally model-quality problems.

For enterprises that have already crossed the production threshold, this changes the conversation. The question is no longer "Can we build an agent that works?" It's "Did we define success the day we wrote the budget — or are we measuring vanity metrics 12 months in?" After dozens of conversations with CIOs running agent programs, I've seen the pattern Forrester documents: technically successful, financially indefensible. The enterprise CFO ROI-measurement gap is the financial side of the same problem. This piece breaks down why it happens, what the data says about who avoids it, and two frameworks you can apply this week before your next agent ships.

What Changed: The 22% Number and What Sits Behind It

Forrester and Anaconda's 2026 enterprise survey put concrete numbers on a problem that's been mostly anecdotal. Across the agent deployments that successfully shipped to production:

  • 41% achieve positive payback within 12 months
  • 18% achieve payback within 6 months
  • 22% report negative ROI at 12 months
  • The remaining ~37% report flat or marginal returns

Read that again: only 41% of agents that reach production deliver clear positive ROI inside a year. The conventional industry framing — that 88% of pilots fail to graduate to production, per Forrester and Anaconda 2026 research — hides the second wave of failures that happen after you've already counted the win.

The root-cause split is what makes this important. From Forrester's analysis of the 22%:

  • 41% — unclear success criteria. The agent did what it was built to do. Nobody had defined what "success" meant in dollar terms, so 12 months later there was no defensible ROI story.
  • 33% — insufficient tool or data access. The agent could reason but couldn't act on enough live systems to complete the work end-to-end. Humans kept absorbing the last-mile cost.
  • 26% — evaluation coverage drift. The agent passed launch eval, then real production data diverged from training distribution. No automated evaluation ran in production, so drift went unmeasured until business outcomes degraded.

None of these are GPT-vs-Claude-vs-Gemini problems. They're operating-model problems. That matters because most enterprise AI budgets are still allocated as if model selection is the lever. It isn't.

Three other 2026 data points reinforce the picture:

The convergence is what's striking. Different methodologies, different sample frames, same conclusion: the gap between AI agent capability and AI agent ROI is now the dominant enterprise risk in the category — and it isn't closing on its own.

Why This Matters: The CIO and CFO Read

This is a dual-audience problem because the failure modes split across the technical and financial sides of the house.

Technical Implications (CIO, CTO, VP Engineering)

The 33% "insufficient tool or data access" bucket is the most fixable but also the most consistently underestimated. Agents that look great in demo environments — where they get clean API access to a small set of systems — degrade quickly when they hit production reality: legacy ERPs without modern APIs, authentication tokens that expire mid-workflow, rate limits, data quality drift, role-based access controls that the agent doesn't have credentials for.

BCG and Forrester's 2026 production-blocker survey ranked the top production blockers among CIOs: evaluation and observability (64%), governance and compliance (57%), model reliability and non-determinism (51%). All three sit downstream of the same architectural decision — whether you treated the agent as deployable software with versioning, rollback, and continuous evaluation, or as a prompt-engineered experiment you scaled.

The 41% rollback rate in the same dataset is telling: 41% of enterprises report at least one production rollback within 12 months. But agents with automated evaluation coverage roll back at 9%. Agents without roll back at 47%. The five-fold difference is the cost of skipping evaluation infrastructure.

Business Implications (CFO, CMO, COO)

For finance leaders, the 22% number lands differently. The question is no longer "Will this work?" — it's "Can I defend this on the next earnings call?" Three patterns from the data:

  • Productivity-ROI disconnect. WRITER's data shows individual AI super-users save up to 9 hours per week, but only 29% of organizations see significant organizational ROI from generative AI and 23% from agents. Hours saved at the individual level rarely roll up to P&L improvement at the company level unless the organization restructures around the freed capacity.
  • Time-to-value varies 3x by function. BCG and Forrester 2026 data put SDR/outbound agents at 3.4 months payback, customer service at 4.7 months, finance and ops at 8.9 months, HR at 9.4 months, legal and compliance at 11.2 months. If your CFO is benchmarking everything against the SDR result, your finance-ops agent is going to look like a failure 8 months in.
  • Industry adoption gaps amplify competitive risk. Banking and insurance are at 47% production adoption; software/internet at 44%; healthcare at 18%; government at 14%. The competitive cost of falling behind your industry's median is now measurable, not theoretical — see the pilot-to-production gap in banking for an industry deep-dive.

Both audiences end up at the same conclusion: the gating constraint isn't model quality. It's measurement infrastructure, evaluation discipline, and operating-model maturity.

Market Context: Who's Building the Infrastructure to Solve This

A new category of vendors emerged in 2025-2026 specifically to address the post-production failure modes. The six platforms anchoring the agent observability and evaluation market in 2026, per industry coverage of LangSmith, Langfuse, and Arize Phoenix:

  • LangSmith (LangChain-native, deepest framework integration) — best for teams already on LangChain who want execution timeline tracing and custom evaluators.
  • Langfuse (open-source leader, self-hostable) — best for regulated industries that need data residency control.
  • Arize Phoenix (ML-grade rigor, OpenTelemetry-based) — best for organizations with existing ML platform discipline.
  • Helicone (drop-in proxy, simplest install) — best for teams that need observability before they have a framework lock-in.
  • Datadog LLM Observability — best for Datadog shops looking to unify APM and LLM telemetry.
  • Honeycomb LLM Observability — best for event-driven architectures that need deep tracing.

Analyst guidance from Forrester's 2026 enterprise software predictions and Gartner's Hype Cycle for Agentic AI converges on the same point: production monitoring should be a continuous cycle of monitoring → evaluation datasets → experimentation → redeployment, not a one-time launch eval. Forrester also expects 30% of enterprise application vendors to launch Model Context Protocol servers in 2026, and half of enterprise ERP vendors to introduce autonomous governance modules — meaning the infrastructure layer for closing this loop is being commoditized as we speak.

The infrastructure exists. The data shows most enterprises haven't bought it. Only 31% of organizations have implemented a measurement framework for agentic AI, despite 52% of executives in gen-AI-using organizations already having AI agents in production. That gap — 52% in production, 31% with measurement — is roughly the gap that produces the 22% negative-ROI failures.

Framework #1: The Success Criteria Definition Worksheet

The single highest-leverage intervention against the 41% "unclear success criteria" failure mode is forcing every agent deployment through a written success-criteria document before budget approval. This worksheet operationalizes that.

Score each dimension 1–5. Total possible score: 25. Below 15 = do not deploy. 15–19 = remediate before deployment. 20+ = deploy with confidence.

Dimension Score 1 (Fail) Score 3 (Acceptable) Score 5 (Strong)
Baseline cost is documented No baseline measured Baseline estimated from intuition Baseline measured in $ / hours / errors before agent ships
Primary KPI is dollarized "Save time" or "improve experience" One KPI is dollarized Every primary KPI converts to $ within 30 days
Success threshold is pre-committed No threshold defined Threshold defined post-launch Threshold signed by exec sponsor before launch
Attribution method is automated Manual attribution at quarterly review Manual attribution monthly Usage and outcomes auto-attributed via telemetry
Kill criteria are defined No kill criteria Kill criteria defined informally Kill criteria signed off, with named owner and timeline

A worked example: an enterprise rolling out a finance reconciliation agent.

  • Baseline: 4.2 FTE × $145,000 fully-loaded = $609,000/year in current reconciliation cost. Average cycle: 6.3 days. (Score: 5)
  • Dollarized KPI: Target — reduce FTE allocation to 1.5 ($217,500), accept agent cost of $90,000/year. Net target savings: $301,500/year. (Score: 5)
  • Pre-committed threshold: Must hit $200K+ annualized savings by Month 9 or program enters remediation. Signed by CFO. (Score: 5)
  • Automated attribution: Agent emits per-transaction telemetry tagged with reconciliation outcome and human-touched flag, piped to Datadog. (Score: 4)
  • Kill criteria: If Month 9 savings < $150K annualized, agent is sunset within 60 days; owner is the VP Finance Transformation. (Score: 4)

Total: 23/25. This program survives the 12-month review because the answer to "did it work?" was answered before the agent shipped, not after.

Compare to the typical pattern that produces the 22% failures: "We're deploying an AI agent to improve finance ops. We'll measure the impact and iterate." That scores about 8/25 and is mathematically destined to be a candidate for the negative-ROI bucket.

This worksheet should sit between your AI program management office and any budget approval. If it's not signed before money moves, you're funding optionality, not an investment.

Framework #2: The 90-Day Pre-Production Readiness Checklist

If success criteria addresses the 41% failure mode, this checklist addresses the 33% "insufficient tool/data access" and 26% "evaluation drift" modes. Run this 90 days before your intended production date. If you can't check 12+ of 15 boxes, push the launch.

Technical Readiness (5 items)

  • API/tool inventory is complete. Every system the agent will read from or write to is identified, with access pattern, rate limit, and failure mode documented.
  • Authentication is production-grade. Service accounts (not personal OAuth tokens) are provisioned, with rotation schedule and monitoring for expiration.
  • Data access matches role. The agent operates with least-privilege RBAC; access patterns are audit-logged for SOC2/HIPAA/GDPR review.
  • Rate limits and cost ceilings are coded. Hard daily/monthly budget caps prevent the 47,000-call infinite-retry scenario that ate one startup's payroll budget overnight.
  • Rollback procedure is rehearsed. Kill-switch tested in staging; named on-call rotation owns rollback authority.

Evaluation Readiness (5 items)

  • Production eval set exists. Minimum 200 representative examples from real (or anonymized) production data, not synthetic.
  • Automated evals run on every change. Prompt changes, model upgrades, and tool changes all gate on eval pass rate ≥ launch threshold.
  • Drift detection is live. Statistical monitoring of input distribution, tool-call patterns, and output quality runs continuously, not weekly.
  • Human review sampling is scheduled. 5-10% of production runs flow to a human reviewer with a defined cadence and reviewer assignment.
  • Eval coverage matches risk surface. High-risk workflows (financial commitments, customer-facing communication, irreversible actions) have stricter eval gates than low-risk ones.

Organizational Readiness (5 items)

  • Named agent owner. One person, one role, with accountability and budget authority. Per Forrester data, 94% of successful production agents have named owners; the unsuccessful ones often don't.
  • Executive sponsor signs success criteria. The CFO, COO, or business unit leader who will defend the ROI on the next quarterly review.
  • Scope is single-workflow. 81% of successful production agents are scoped to one workflow with binary success criteria, per the same data. Multi-workflow scope sharply increases failure odds.
  • Change management plan exists. End users have been trained, escalation paths are documented, and feedback loops are open.
  • Sunset criteria are documented. What conditions trigger sunsetting the agent? Who decides? Over what timeline?

The combination — 94% of successful agents have named owners, 87% have automated evals, 81% are single-workflow — is so consistent across Forrester, BCG, and Anaconda 2026 data that it functions as a profile. Match the profile or carry the risk of joining the 22%.

Case Study: XPO Logistics — What the 41% Looks Like When It Works

The most instructive successful example in the 2026 data is XPO Logistics' AI routing optimization deployment. It's a useful counter-case because XPO did the unglamorous work that the 22% skipped.

What they built: AI routing optimization for linehaul operations, with strict bottom-line attribution.

Outcomes (12-month):

  • 80% reduction in linehaul diversions
  • 12% compression in empty miles
  • $29 million per efficiency point gained
  • Operating ratio improved 180 basis points

Why it worked: XPO built the measurement infrastructure before the AI deployment. Every routing decision was already tagged with a cost basis from their pre-existing logistics analytics. When the AI agent went live, the optimization action and the measurement action were the same transaction. There was no separate "measurement project" running in parallel that could drift or get deprioritized.

This matters because it inverts the typical enterprise sequence. Most organizations: (1) deploy agent, (2) hope ROI shows up, (3) build measurement to prove ROI six months in. XPO: (1) build measurement, (2) deploy agent into the measurement, (3) compound the optimization.

The contrast with the failure modes is precise. XPO's success criteria were dollarized before deployment. Their tool access was complete because the agent operated inside their existing logistics analytics surface. Their evaluation drift was bounded because every decision was attributed in real time. They hit all three Forrester root-cause prevention vectors by architecture, not by discipline. The same pattern shows up in 171% ROI case studies from the agents that succeed — measurement-by-architecture beats measurement-by-project every time.

The lesson, repeated across the data: enterprises with a decade-plus investment in data foundations (XPO, Bank of America, large insurers) realize agent ROI 2-3x faster than enterprises starting from scratch. This is not a fairness statement — it's a sequencing one. The infrastructure work the 22% want to skip is the work that produces the 41% positive-ROI outcomes.

What to Do About It

For CIOs

  • Mandate the Success Criteria Worksheet on every agent budget request, starting with the next funding cycle. Don't sign budgets without 20+ scores.
  • Invest in agent observability and evaluation infrastructure before scaling agent count. Two production agents with real measurement beat 20 with launch evals only.
  • Set a 90-day pre-production gate using the readiness checklist. Treat it as a hard gate, not a guideline.

For CFOs

  • Stop accepting "hours saved" as the primary ROI metric for AI agents. Demand dollarized KPIs, named accountability, and pre-committed thresholds.
  • Build a portfolio view: which agents have crossed 6 months in production, what's the realized vs. projected ROI, what's the trend? The 22% becomes visible only when you look at the portfolio, not individual programs.
  • Pre-commit kill criteria for every agent program. The cost of running a sunsetting agent for 18 months because no one owns the sunset decision is a hidden line item in most enterprise AI budgets.

For Business Unit Leaders

  • Insist on single-workflow scope for your first three agents. Multi-workflow ambition is the #1 predictor of post-production failure.
  • Name the owner before the budget. If no one will put their name on accountability for the outcome, the program shouldn't be funded.
  • Plan for a one-quarter measurement gap. Every realistic deployment has a quarter where measurement infrastructure runs ahead of business outcome. Budget for that gap — don't pretend it doesn't exist.

The 22% problem is not a model problem. It's a measurement-and-discipline problem. The 41% of agents that deliver positive ROI inside 12 months aren't using smarter models. They're using the same models on top of better-designed operating systems. The infrastructure to do this exists in 2026. The data shows most enterprises haven't yet decided to buy it.


Continue Reading


Running an AI agent program with measurable ROI — or fighting the 22% problem? I'd love to compare notes on what's working in production. Connect with me on LinkedIn or Twitter/X.

THE DAILY BRIEF

Enterprise AI insights for technology and business leaders, twice weekly.

thedailybrief.com

Subscribe at thedailybrief.com/subscribe for weekly AI insights delivered to your inbox.

LinkedIn: linkedin.com/in/rberi  |  X: x.com/rajeshberi

© 2026 Rajesh Beri. All rights reserved.

Newsletter

Stay Ahead of the Curve

Weekly enterprise AI insights for technology leaders. No spam, no vendor pitches—unsubscribe anytime.

Subscribe