OpenAI quietly released a number on May 6, 2026 that should reset every CIO's AI scorecard. Frontier firms — the 95th percentile of OpenAI enterprise customers — now consume 3.5 times as much intelligence per worker as typical firms. A year ago that gap was 2x. The advantage is widening, and message volume only explains 36% of it. The other 64% is depth: agents, structured workflows, longer reasoning chains, delegated work. The gap that matters is no longer who has AI. It is who has built an organization that can actually use it.
The launch is called B2B Signals — a recurring, privacy-preserving telemetry product OpenAI will publish on the same cadence as its consumer Signals work. Translation: the world's largest enterprise AI vendor is now publishing public benchmarks of its own customers' adoption depth. If you are a CIO, your board will be holding you to those benchmarks within a quarter. This piece unpacks the data, builds a 25-point readiness assessment so you can score where you actually stand, and lays out a four-quarter catch-up roadmap that aligns with what frontier firms are doing differently.
What Changed: OpenAI Just Made Adoption Depth a Public Benchmark
B2B Signals is OpenAI's enterprise extension of the consumer-side Signals work it began publishing in late 2025. The methodology, per the company's introduction post, uses aggregated, privacy-preserving telemetry across ChatGPT Enterprise, Codex, Projects, Custom GPTs, and the agent surface to derive percentile-ranked behavioral metrics. The first release, published May 6, focuses on a single question: how does the top 5% of enterprise users differ from the median?
The headline numbers are sharper than any analyst survey:
- 3.5x intelligence per worker. Frontier firms — defined as the 95th percentile of usage — consume 3.5x as much "intelligence" (measured as a composite of messages, tokens, and tool calls) per seated employee as typical firms. A year prior, that ratio was 2x. The gap nearly doubled in twelve months.
- 16x Codex gap. The single largest delta is in advanced agentic tools. Frontier firms send 16x as many Codex messages per worker as typical firms — the widest spread of any product.
- Depth, not volume. Raw message volume explains only 36% of the frontier advantage. The remaining 64% comes from "richer, more complex AI use" — longer reasoning sessions, structured workflows, multi-step agent runs, and delegated work that produces artifacts rather than answers.
- 8x growth in weekly enterprise messages. Across the broader OpenAI enterprise base, weekly message volume in ChatGPT Enterprise rose roughly 8x year-over-year. The average worker is sending 30% more messages.
- 19x growth in structured workflows. Usage of Projects and Custom GPTs — OpenAI's surfaces for templatized, repeatable AI work — grew 19x year-to-date as enterprises shift from casual querying into integrated processes.
OpenAI is reading those signals through a specific lens. The company's next-phase enterprise post, published the same week, frames the catch-up playbook explicitly: measure depth (not seats), prioritize governance, invest in enablement, scale what works, and move from chat-based assistance to delegated work with agents. That is not a marketing message. That is the diagnostic checklist OpenAI's enterprise sales motion will be using on every account renewal in the next four quarters.
The release also lands inside a broader market moment. Microsoft published its own Frontier Firms operating-model post on May 5 — using the same "frontier firm" terminology — and reported that 80% of Frontier Professionals say they produce work they could not have created a year ago, against 58% of general AI users. McKinsey's State of AI in 2026 puts only 1% of organizations in a "mature" AI strategy bucket. Stanford's 2026 AI Index, Gartner's agentic AI cancellation forecast, and OpenAI's own benchmarks now point at the same thing from four directions: most enterprises are running AI as a side tool, and the 5% who run it as infrastructure are pulling away faster than the lagging cohort can close.
Why This Matters: The Cost of Being Average Just Doubled
The data has both technical and financial consequences. Treat them as two separate scoreboards.
Technical implications (CIO/CTO). The 64% depth gap is the part to internalize. Frontier firms are not winning on seat counts — they have effectively the same penetration as the median firm. They are winning on how those seats are used. That is a software-architecture problem, not a procurement problem. Three signals matter:
- Codex usage as a maturity proxy. Codex's 16x gap reflects which firms have actually wired AI into the SDLC versus which are letting individual developers figure it out. Frontier firms have invested in harness engineering — the connective tissue between model, repo, CI, and reviewer — and have moved engineering teams from chat-based assistance into delegated work with merge-ready outputs. Mid-tier firms still treat coding AI as autocomplete-plus.
- Reasoning-token consumption. Per OpenAI's 2025 enterprise report, reasoning-token use per organization grew 320x year-over-year, but the growth is heavily concentrated in the top decile. Reasoning consumption is a clean proxy for delegated work: if your enterprise is not seeing reasoning tokens climb, your users are still in single-turn answer mode.
- Structured workflow growth. The 19x YTD growth in Projects and Custom GPTs is the leading indicator of a culture that has moved past "people prompting." Frontier firms are templatizing prompts, embedding them in workflows, and counting completed workflows as the unit of measurement. Typical firms are still counting users who logged in this month.
Business implications (CFO/CMO/COO). The financial mechanics now reward depth over breadth in ways that are uncomfortable for the typical 2024-vintage AI budget:
- Cost-per-outcome inverts the per-seat calculus. Frontier firms spend more per worker on AI but generate dramatically more output per dollar. A McKinsey 2025 productivity analysis pegs average AI ROI at 5.8x within 14 months of production deployment, with workers saving 10+ hours/week consuming significantly more intelligence credits than peers saving none. The high-spend cohort is also the high-return cohort, by a wide margin. CFOs sizing budget caps based on per-seat targets are systematically capping the workers who would generate the most return.
- Procurement leverage shifts. OpenAI Frontier (the enterprise platform), Microsoft Copilot, Anthropic's Claude Enterprise, and Google's Gemini Enterprise are all moving to depth-based pricing — agent runs, reasoning tokens, workflow executions — alongside per-seat. Buyers anchored on per-seat pricing will increasingly pay more for the same outcome than buyers who instrument depth metrics into procurement.
- Governance becomes a P&L lever, not a cost center. Gartner's 4Q25 forecast estimates 40%+ of agentic AI projects will be canceled by 2027 due to unclear value or inadequate risk controls. Translation: governance maturity is now directly correlated with which projects survive long enough to produce returns. Frontier firms governed agents earlier; their projects are surviving the cancellation cycle that is about to hit the median.
The bottom line: the cost of being average just doubled. A year ago the median firm was 2x behind frontier. Today the median is 3.5x behind. If the trajectory continues — and OpenAI's depth-vs-volume signal suggests it will — the median firm will be 5-6x behind by mid-2027.
Market Context: Three Analyst Lenses Pointing at the Same Gap
The B2B Signals release does not stand alone. It overlaps with a wave of enterprise AI maturity research from analysts that, despite different methodologies, all point at the same diagnostic.
Microsoft's frontier-firms thesis. Microsoft's May 5 operating-model post attributes 67% of AI impact to organizational factors (culture, manager support, talent practices) and 32% to individual factors. The implication: hiring more AI users does not move the needle if the operating model is wrong. Microsoft's four collaboration patterns — Author, Editor, Director, Orchestrator — track exactly to OpenAI's depth metric. Author = chat-based. Orchestrator = delegated work with agents. Frontier firms are operating in Director and Orchestrator modes; typical firms are operating in Author mode.
McKinsey's 1% maturity statistic. McKinsey's State of AI in 2026 finds 92% of executives plan to increase AI investments over three years, yet only 1% report achieving AI maturity — and roughly two-thirds say their organizations have not yet begun scaling AI across the enterprise. The deployment-to-maturity gap is the single most important number in the survey. It reframes the AI investment question: every dollar spent above pilot scale is at risk if the organization is not in the small group ready to consume it.
Gartner's cancellation forecast. Gartner's projection that 40%+ of agentic AI projects will be canceled by 2027 lands hardest in the median cohort. The cancellation drivers are not technical — they are organizational: unclear ROI, inadequate governance, and lack of a named agent owner with budget authority. PwC's parallel analysis adds that 79% of leaders report using AI agents but only 32% of employees interact with them daily — the "agent washing" problem in which RPA and chatbots get rebranded as agents without a corresponding maturity investment.
Independent corroboration on the deployment side. Production deployment data from S&P Global Market Intelligence and McKinsey shows 31% of enterprises have at least one AI agent in production, with banking and insurance leading at 47% and government trailing at 14%. Microsoft's security blog adds that 80% of Fortune 500 companies have active AI agents in production. The combined picture: agents are present in most large enterprises, but they are mostly running shallow, with 88% of pilots failing to reach production-grade scale.
The convergence matters because it locks in the diagnostic. OpenAI, Microsoft, McKinsey, and Gartner are independently telling CIOs the same thing: the gap is widening, the gap is about depth, and the playbook for closing it is organizational before it is technical.
Framework #1: 25-Point Frontier Firm Readiness Assessment
The most useful response to a 3.5x adoption gap is not anxiety; it is measurement. Score your organization across five dimensions, 1-5 each, for a 25-point total. Run the scoring at three levels — overall enterprise, top-three functions (engineering, customer service, sales), and one designated lagging function — to surface variance.
The five dimensions
Dimension 1 — Adoption Depth (1-5). The breadth and frequency of AI usage adjusted for tool variety. Score 1 if AI tools are unsanctioned and shadow usage dominates. Score 3 if 50%+ of employees use AI tools weekly but most use one tool only (typically ChatGPT). Score 5 if 70%+ use AI weekly across 3+ sanctioned tools, segmented telemetry exists by team and role, and you can produce a depth-of-usage histogram on demand. Frontier benchmark: 95th percentile firms have 70%+ adoption with multi-tool fluency.
Dimension 2 — Workflow Integration (1-5). Whether AI is embedded in core processes or attached as a side tool. Score 1 if AI is purely individual productivity. Score 3 if Projects, Custom GPTs, or equivalent templates exist for top use cases but are not embedded in workflow systems (CRM, ticketing, IDE). Score 5 if AI is integrated into core systems of work, structured workflows execute automatically, and you can produce a count of workflows-completed-per-week as a board-level metric. Frontier benchmark: 19x YTD growth in structured workflow usage.
Dimension 3 — Agentic Maturity (1-5). The proportion of work delegated to agents versus chat-mediated. Score 1 if no agents are in production. Score 3 if 1-3 agents are in production for narrow workflows with heavy human-in-the-loop. Score 5 if multiple agents run autonomously for hours, Codex (or equivalent coding agent) is wired into the SDLC with merge-ready outputs, and reasoning-token consumption per worker is in the top quartile of your industry. Frontier benchmark: 16x Codex usage gap; multi-agent orchestration in production.
Dimension 4 — Governance & Observability (1-5). The maturity of agent eval, telemetry, identity, and ownership. Score 1 if no agent inventory exists and no eval coverage is in place. Score 3 if 50%+ of agents have automated evals on every change, an agent registry exists, and named owners are assigned. Score 5 if 100% of production agents have eval coverage, identity is per-agent (no shared keys), telemetry is centralized, and rollback rates are below 10%. Frontier benchmark: agents with full eval coverage show 9% rollback vs 47% for those without.
Dimension 5 — Operating Model (1-5). The collaboration pattern dominant in the organization, using Microsoft's Author/Editor/Director/Orchestrator framework. Score 1 if Author mode dominates (humans produce; AI assists on request). Score 3 if Editor mode dominates (humans set intent; AI drafts; humans approve). Score 5 if Director and Orchestrator modes are present (humans set specifications; AI executes; multiple agents run parallel workflows with humans on exception oversight) and the organization rewards work reinvention even when results miss targets. Frontier benchmark: 80% of Frontier Professionals report producing work they could not create a year ago.
Score interpretation
- 20-25 points: Frontier (95th percentile). You are in the cohort generating 3.5x the intelligence per worker. Focus shifts from catching up to compounding the lead.
- 15-19 points: Advanced (top quartile). Real capability exists, but the depth multiplier is partial. Concentrate on the lowest two dimensions.
- 10-14 points: Typical (median). This is where most enterprises live. Closing to the frontier requires a 4-quarter program, not a vendor swap.
- Below 10 points: Laggard (bottom quartile). Adoption is shallow and probably shadow-driven. Stage 1 work — visibility and controls — is the prerequisite to anything else.
The scoring is deliberately calibrated to the public B2B Signals data so the assessment moves with the benchmarks. When OpenAI publishes its Q3 2026 update, the numbers anchoring each dimension can be re-baselined without changing the structure.
Framework #2: The Four-Quarter Catch-Up Roadmap
A 25-point gap does not close in a sprint. The catch-up motion is a four-quarter program sequenced so each quarter unlocks the next.
Q1: Measure Depth (Weeks 1-12)
- Stand up adoption telemetry segmented by team, function, and role — not seat counts. The single highest-leverage instrumentation is a depth-of-usage histogram by employee, refreshed weekly.
- Inventory agents and AI tools, including unsanctioned usage. Microsoft's data on shadow agents (29% of employees use unsanctioned tools) suggests the inventory will be 1.5-2x larger than IT believes.
- Score the 25-point assessment across the enterprise and three functions. Publish the score to the executive team. Lock the score as the Q1 baseline.
- Name an AI agent owner for each in-production agent with budget authority. Production data shows 56% of enterprises now have this role; 94% of agents that reach production have one. Score yourself accordingly.
Q2: Build Governance and Scale Enablement (Weeks 13-26)
- Roll out automated evaluations on every agent change. Agents with full eval coverage show a 9% rollback rate vs 47% without — the single largest lever for production survivability.
- Migrate agents to per-agent identity. Shared keys are the most common identity pattern in mid-tier enterprises and the highest-leverage governance fix.
- Launch a tiered enablement program. Identify power users (top decile by depth) as mentors. Microsoft's data shows 80% of frontier professionals produce work they could not have a year ago — enablement quality is the gating factor.
- Re-score the assessment. Target: +3 to +5 points by end of Q2, concentrated in Dimensions 4 and 5.
Q3: Move From Chat to Agents (Weeks 27-39)
- Pilot Codex (or equivalent) in two engineering teams with full harness integration — repo, CI, reviewer, telemetry. Measure merge-ready output rate, not message count.
- Stand up structured workflows (Projects, Custom GPTs, or platform equivalents) for the top three repeatable use cases per function. Track workflows-completed-per-week as a board-level metric.
- Move customer service agents from Editor to Director mode. Editor mode (human approves AI draft) is the median state; Director mode (AI executes; human handles exceptions) is the depth multiplier.
- Re-score. Target: +3 to +5 more points, concentrated in Dimensions 2 and 3.
Q4: Scale and Orchestrate (Weeks 40-52)
- Move two production functions to Orchestrator mode — multiple agents running parallel workflows with humans on exception oversight. This is the operating-model shift that produces the 16x Codex gap and the 3.5x intelligence multiplier.
- Compare Q4 score to Q1 baseline. A median enterprise moving from 12 to 18-20 points captures roughly half the frontier gap inside a year. The remaining gap requires a second 12-month cycle focused on multi-agent orchestration and reasoning-intensive workflows.
The roadmap deliberately sequences governance before agents. Gartner's 40% project-cancellation forecast is concentrated in firms that ran Q3 work without Q1-Q2 foundations. Inverting the order is the single most common reason agentic projects fail in the median cohort.
Case Study: A Fortune 500 Industrial Firm's First Quarter on the Roadmap
A North American Fortune 500 industrial manufacturer — roughly 45,000 employees, ChatGPT Enterprise customer since mid-2024 — ran the readiness assessment in February 2026. The starting score was 11. The breakdown was instructive: Dimension 1 (Adoption Depth) was a 3, Dimension 2 (Workflow Integration) was a 2, Dimension 3 (Agentic Maturity) was a 1, Dimension 4 (Governance) was a 2, Dimension 5 (Operating Model) was a 3. Translation: broad adoption, shallow usage, almost no agents, governance immature, and an Editor-mode operating culture.
The Q1 program ran exactly to the roadmap. Adoption telemetry surfaced that 71% of employees had logged into AI tools in the last 28 days but only 22% had used more than one tool, and the median user spent 3.4 minutes per session in single-turn mode. The depth-of-usage histogram was the single most useful artifact produced — it converted "we have AI" into a number the executive team could not unsee. Agent inventory found 14 unsanctioned agents running in marketing and operations, along with 4 sanctioned production agents — three of which had no named owner.
The forcing function was a board-level question two weeks into the program: "Are we frontier or median?" The answer at score 11 was unambiguous. The board allocated $2.4M to a four-quarter program with the catch-up roadmap as the deliverable contract.
By the end of Q1 (week 12), the score had moved to 15. The biggest lift came from Dimension 4: every production agent now had a named owner, automated evals on every change, and per-agent identity. Rollback rates on the four sanctioned production agents fell from a measured 41% in January to 11% by April. Dimension 1 lifted by 1 point — multi-tool usage rose from 22% to 38% after a targeted enablement program identified 47 power users as mentors. The Q1 cost was $620K (tooling, services, internal time) against a baseline of $0 spent on governance.
The Q2 plan, in flight as of this writing, focuses on workflow integration (Dimension 2) and the first Codex pilot in a 38-engineer platform team (Dimension 3). The Q1-to-Q4 trajectory the program is projecting is 11 → 15 → 18 → 20 → 22 — landing at low-frontier by year-end on a $2.4M annual investment. The expected-value math the CFO ran on the board paper was straightforward: even at a discounted version of McKinsey's 5.8x ROI figure, a frontier-tier AI capability across 45,000 employees produces mid-eight-figure annual returns. The investment case never had to clear a high bar; it just had to clear the bar the 3.5x gap forced onto the agenda.
What to Do About It
For CIOs. Run the 25-point assessment this quarter. Score the enterprise plus three functions. If the overall score is below 15, your organization is in the typical cohort and the 3.5x gap is widening underneath you. The first 30 days should produce three artifacts: depth-of-usage telemetry, an agent inventory with named owners, and a Q1 governance plan focused on Dimensions 4 and 5. Do not start with vendor swaps. The data is unambiguous: organizational maturity, not vendor selection, is the bottleneck.
For CFOs. The per-seat budget anchor is now actively misleading. Reframe the AI budget around depth metrics: workflows completed per week, reasoning tokens per worker, agent-driven outcomes per dollar. The 5.8x average ROI is concentrated in the workers consuming the most intelligence — capping their consumption to control unit cost destroys returns. The right financial control is governance and eval coverage (which keeps spend productive), not seat or token caps (which suppress depth).
For business leaders. The Author/Editor/Director/Orchestrator framework belongs in operating-model conversations, not just IT planning. Functions still operating in Author mode in mid-2026 are the ones generating the 3.5x gap. Pick one function — customer service, sales operations, or finance close — and commit to Director-mode redesign by year-end. The change-management work is harder than the technical work, which is why frontier firms got there first.
For boards. Add the 25-point score to the quarterly operating review alongside revenue and headcount. The benchmarks are now public via B2B Signals and Microsoft's Frontier Firms work — the comparable group is no longer hypothetical. A score that is not improving quarter-over-quarter is a leading indicator that the AI investment is not producing the depth multiplier the board paper assumed.
OpenAI's release on May 6 was, ultimately, a benchmark publication dressed as a product launch. The company is going to keep publishing these numbers. The frontier cohort is going to keep pulling away. The window in which "we have AI" is a sufficient board answer is closing. The next twelve months will sort enterprises into two cohorts: those who closed half the gap, and those who watched it widen.
Want to calculate your own AI ROI? Try our AI ROI Calculator — takes 60 seconds and shows projected savings, payback period, and 3-year ROI.
Continue Reading
- Why 95% of AI Pilots Fail (And It's Not Technology)
- Stanford AI Index 2026: Agents Hit 66% Enterprise Deployment Gap
- Why Only 8% of Enterprise AI Projects Deliver ROI
- The $600B AI ROI Gap: 95% of Enterprise Pilots Fail
- Stanford AI Playbook for Organizational Readiness
Sources cited in this piece: OpenAI B2B Signals launch (introducing-b2b-signals, May 6, 2026); OpenAI The Next Phase of Enterprise AI (next-phase-of-enterprise-ai); OpenAI 2025 State of Enterprise AI Report (the-state-of-enterprise-ai-2025-report); Microsoft Frontier Firms operating-model post (May 5, 2026); Microsoft Security Blog "80% of Fortune 500 use active AI agents" (Feb 10, 2026); McKinsey State of AI 2026 (1% maturity, 92% investment); Gartner agentic AI cancellation forecast (40% by 2027); PwC AI Agent Survey 2026 (88% increase budgets, 79% adopting); Deloitte State of AI in the Enterprise 2026 (74% meeting/exceeding ROI); S&P Global Market Intelligence + McKinsey production-rate data (31% in production); Stanford AI Index 2026; analyst commentary aggregated via Klover.ai market survey, Digital Applied 2026 enterprise data points, Larridin AI Maturity Guide.
