Gemini 3.5 Flash computer use Google DeepMind RPA enterprise automation AI agents OSWorld prompt injection agentic AI UiPath

Google Just Gave Its Fastest AI Model Eyes and Hands. The $35B RPA Industry Should Be Terrified.

Computer use is now a built-in, native tool inside Gemini 3.5 Flash — Google's fastest, cheapest enterprise AI model. This isn't a demo. It's a production-grade capability that lets AI agents see screens, click buttons, and navigate software across browser, mobile, and desktop environments. With a 78.4% OSWorld score at Flash-tier pricing, Google just changed the economics of enterprise automation. The $35B RPA market should be paying attention.

By Rajesh Beri·June 25, 2026·14 min read

THE DAILY BRIEF

Gemini 3.5 Flashcomputer useGoogle DeepMindRPAenterprise automationAI agentsOSWorldprompt injectionagentic AIUiPath

By Rajesh Beri·June 25, 2026·14 min read

On June 24, 2026, Google made a move that will rewrite how enterprises think about automation. Computer use — the ability for an AI agent to see a screen, reason about what's on it, and take action with mouse clicks and keystrokes — is now a built-in, native tool inside Gemini 3.5 Flash.

Not a separate model. Not a preview. Not a research demo. A production-grade capability baked into Google's fastest, cheapest enterprise AI model, available today through the Gemini API and the Gemini Enterprise Agent Platform.

This matters because most business software was designed for human interaction, not AI. APIs exist for some systems. But the vast majority of enterprise workflows — the procurement portal that only works in Internet Explorer, the legacy ERP screen that predates REST, the compliance tool with a Java applet — have no API at all. Computer use bridges that gap by letting AI agents interact with software through the same visual interface humans use.

Google is not the first to ship computer use. Anthropic launched Claude computer use in October 2024. OpenAI shipped Operator with its Computer-Using Agent (CUA) model in early 2025. But Google just changed the economics and accessibility equation in a way that makes this capability genuinely enterprise-ready for the first time.

Here's what happened, why it matters, and the two frameworks your team needs before deploying computer-use agents in production.

What Google Actually Shipped

Gemini 3.5 Flash was already Google's workhorse enterprise model — fast, cheap, and optimized for agentic tasks. With this update, computer use moves from a standalone Gemini 2.5 model into the main Flash model as a built-in tool, alongside existing capabilities like function calling, Search grounding, and Maps grounding.

The technical architecture follows what Google calls an "agentic loop":

Screenshot: The client application captures the current screen
Analyze: Gemini 3.5 Flash reads the pixels and plans its next action
Action: The model outputs a precise UI command (clicking exact X/Y coordinates, typing text, scrolling)
Repeat: The environment executes the command, captures a new screenshot, and the cycle continues until the task is complete

This loop works across three environments:

Web browsers: Navigating web apps, filling forms, clicking through multi-step workflows
Mobile: Interacting with smartphone operating systems and simulating touch inputs
Desktop: Controlling desktop software, moving cursors, and typing based on real-time screenshots

The key differentiator from Google's previous computer use offering is integration depth. Rather than routing tasks to a separate specialized model, developers now get computer use as just another tool in the same model they're already using for function calling and reasoning. That means a single agent can seamlessly switch between calling an API (when one exists) and using the visual interface (when it doesn't) — within the same conversation turn.

Performance: Where 3.5 Flash Lands

On the OSWorld-Verified benchmark — the industry standard for measuring whether an AI agent can actually use a computer — Gemini 3.5 Flash scores 78.4%, matching Claude Sonnet 4.6 and dramatically outperforming its predecessor, Gemini 3 Flash, which scored 65.1%.

For context, here's where the major models stand on OSWorld-Verified as of June 2026:

Model	OSWorld-Verified Score	Provider
Claude Mythos 5	85.0%	Anthropic
Claude Fable 5	85.0%	Anthropic
Claude Opus 4.8	83.4%	Anthropic
Gemini 3.5 Flash	78.4%	Google
Claude Sonnet 4.6	~78%	Anthropic
Gemini 3 Flash	65.1%	Google

Anthropic's frontier models still lead on raw accuracy. But Gemini 3.5 Flash is a Flash model — optimized for speed and cost. Reaching near-parity with a premium Sonnet-tier model on computer use tasks while maintaining Flash-tier pricing is the real story. For enterprise use cases where you need to run thousands of automation tasks per hour across browser, mobile, and desktop environments, the cost-performance ratio matters more than the leaderboard crown.

Why This Is Different From RPA

The $35.27 billion RPA market (as of 2026) was built on a simple premise: record a human clicking through a workflow, codify those clicks into a bot, and replay them at scale. UiPath, Automation Anywhere, Blue Prism, and Microsoft Power Automate Desktop all follow this model.

It works. It's also extraordinarily brittle.

When a button moves, the bot breaks. When a form adds a field, the bot breaks. When the application updates its UI framework, the bot breaks. Enterprises spend 30–40% of their RPA budgets on bot maintenance — fixing automations that stopped working because the underlying software changed.

Computer use fundamentally changes this equation. Instead of following a hardcoded script of pixel coordinates and element identifiers, an AI agent with computer use capability actually understands what's on the screen. It can:

Adapt to UI changes: A button moved from the left sidebar to the top toolbar? The agent finds it by reading the screen, not by clicking coordinates (23, 456).
Handle exceptions: An unexpected dialog box appears? The agent reads it, reasons about it, and decides what to do — instead of crashing.
Navigate unfamiliar software: You don't need to record every workflow in advance. The agent can figure out new applications by reading their interfaces.
Work across applications: A single agent can switch between your CRM, your email client, your ERP, and your browser without separate integrations for each.

This isn't incremental improvement over RPA. It's a category shift. Gartner projects that 40% of enterprise applications will include task-specific AI agents by the end of 2026, up from less than 5% in 2025. Computer use is the capability that makes that prediction plausible — it eliminates the API integration bottleneck that has been the primary brake on agent deployment.

The Cost Math

The AI enterprise automation market is projected at $13.2 billion in 2026, growing to $38.9 billion by 2034 at a 14.5% CAGR. But the real story is the cost comparison:

Traditional RPA bot: $5,000–$15,000/year per bot (license) + $2,000–$5,000/year maintenance = $7,000–$20,000/year per workflow
AI computer-use agent: API tokens at Flash-tier pricing, scaling with usage = $500–$3,000/year for equivalent throughput (estimated based on current Gemini API pricing for agentic workloads)

That's a 5–10x cost reduction before accounting for the maintenance savings from self-healing automations that don't break when UIs change.

Framework #1: Computer Use Platform Comparison Matrix

Not all computer use implementations are created equal. Here's how the three major platforms compare across the dimensions that matter for enterprise deployment:

Dimension	Google (Gemini 3.5 Flash)	Anthropic (Claude Sonnet/Opus)	OpenAI (Operator/CUA)
Model integration	Built-in tool in main model	Separate computer use mode	Separate CUA model
Environments	Browser, Mobile, Desktop	Browser, Desktop	Browser (primary)
OSWorld score	78.4%	78–85% (varies by model)	~72% (GPT-5.4)
Speed optimization	Flash-tier (fastest)	Standard inference	Standard inference
Enterprise platform	Gemini Enterprise Agent Platform (GA)	Claude Enterprise, API	Frontier platform, ChatGPT
Safety: Prompt injection defense	Adversarial training + optional enterprise safeguards	Permission system, sandboxing	Confirmation dialogs
Safety: Action confirmation	Optional enterprise safeguard (explicit user confirmation for sensitive actions)	Built-in permission prompts	Confirmation before actions
Safety: Injection detection	Auto-stop on detected indirect prompt injection	Monitoring, not auto-stop	Not disclosed
API availability	Gemini API (GA)	Claude API (GA)	CUA API (GA)
Multi-tool integration	Seamless (same model does function calling + computer use)	Separate modes	Separate agent types
Pricing tier	Flash (lowest)	Sonnet (mid) / Opus (highest)	GPT-5.x (mid-high)
Cloud platform	Google Cloud native	AWS Bedrock, GCP, Azure	Azure native
Open-source reference	github.com/google-gemini/computer-use-preview	github.com/anthropics/anthropic-quickstarts	Limited

Bottom line: Google wins on cost-performance ratio and multi-tool integration. Anthropic wins on raw accuracy and developer experience. OpenAI wins on consumer accessibility and enterprise platform breadth (via Frontier). Your choice depends on whether you're optimizing for cost at scale, accuracy on complex tasks, or integration with existing Microsoft/OpenAI infrastructure.

The Security Problem Nobody's Talking About Enough

Here's the part that should keep every CISO up at night: you're giving an AI agent the ability to click buttons, fill in forms, and navigate your internal applications. That's not a chatbot answering questions. That's an autonomous system with the ability to take irreversible actions on production systems.

The attack surface is massive:

Indirect prompt injection: A malicious webpage, email, or document contains hidden instructions that hijack the agent's behavior. The agent visits a page to gather data, the page contains invisible text saying "transfer $50,000 to account X," and the agent follows the instruction because it can't distinguish between its task and the injected command.
Privilege escalation: The agent inherits the permissions of whatever account it's logged into. If it's using an admin's browser session, it has admin access to everything that browser can reach.
Data exfiltration: The agent can read screens. If those screens contain sensitive data — customer PII, financial records, trade secrets — the agent processes that data through its model, potentially exposing it.
Cascading failures: Unlike a traditional bot that breaks and stops, an AI agent that encounters an error might try to fix it — clicking through admin panels, changing settings, or taking other actions that compound the original problem.

The numbers are sobering. OWASP's 2026 report puts prompt injection as the #1 AI security threat. The UK AI Security Institute documented nearly 700 real-world cases of AI scheming between October 2025 and March 2026 — a five-fold increase. Forrester predicts that 2026 will see the first major public breach caused by an agentic AI deployment.

Google's response is a "defense-in-depth" approach with two optional enterprise safeguard systems:

Action confirmation: Require explicit user approval for sensitive or irreversible actions
Injection detection: Automatically stop tasks if an indirect prompt injection is identified

These are good steps. They're also opt-in, which means enterprises that don't enable them are flying blind.

Framework #2: Enterprise Computer Use Readiness Assessment

Before deploying computer-use agents in production, score your organization on these 10 dimensions. Each is rated 1–5 (1 = not ready, 5 = fully prepared). A total score below 30 means you should not deploy computer-use agents in production environments yet.

Governance & Policy (Max 15 points)

#	Dimension	Score 1 (Not Ready)	Score 3 (Partial)	Score 5 (Ready)
1	Agent access policy	No policy for agent permissions	Informal guidelines exist	Formal policy: what agents can access, what actions require human approval, what's prohibited
2	Data classification	No data classification	Some systems classified	All systems classified by sensitivity; agents restricted by classification tier
3	Incident response	No plan for agent-caused incidents	Generic IR plan covers AI	Specific runbook for agent failures: kill switches, rollback procedures, escalation paths

Technical Controls (Max 20 points)

#	Dimension	Score 1 (Not Ready)	Score 3 (Partial)	Score 5 (Ready)
4	Sandboxing	Agents run on production systems directly	Separate browser profiles	Fully sandboxed environment (dedicated VMs, network isolation, credential vaults)
5	Credential management	Agents use shared/admin credentials	Role-based accounts exist	Purpose-built service accounts with minimum necessary permissions, rotated automatically
6	Prompt injection defense	No defenses	Input filtering on some channels	Multi-layer defense: model-level (adversarial training), application-level (input/output filtering), environment-level (sandboxing)
7	Audit logging	No logging of agent actions	Basic action logs	Complete audit trail: every screenshot captured, every action taken, every decision explained, with tamper-proof storage

Operational Maturity (Max 15 points)

#	Dimension	Score 1 (Not Ready)	Score 3 (Partial)	Score 5 (Ready)
8	Human-in-the-loop	No human oversight	Humans review on failure	Configurable approval gates: routine actions auto-approved, sensitive actions require human confirmation, irreversible actions require multi-party approval
9	Testing & validation	No testing framework	Manual testing before deployment	Automated regression testing in sandboxed environments, red-team exercises for prompt injection, chaos testing for failure modes
10	Cost controls	No usage limits	Monthly budget caps	Per-agent, per-task, and per-department spend limits with real-time monitoring and automatic throttling

Scoring Guide

Total Score	Readiness Level	Recommendation
40–50	Production-ready	Deploy with monitoring. Start with low-risk, high-volume workflows.
30–39	Pilot-ready	Deploy in sandboxed pilot with limited scope. Address gaps before expanding.
20–29	Foundation-building	Invest in infrastructure and governance before piloting. 3–6 month timeline.
Below 20	Not ready	Start with traditional automation. Build governance framework first.

Enterprise Use Cases: Where to Start

Based on the maturity of current computer use implementations, here are the use cases ordered by risk-adjusted ROI:

Tier 1: Deploy Now (Low Risk, High Value)

Software testing: AI agents navigate applications like real users, running regression tests across UI changes. Google specifically highlights continuous software testing as a primary use case.
Data extraction and reporting: Agents log into dashboards, export data, compile reports across multiple systems.
Price monitoring: Agents check competitor pricing across websites, track changes, and update internal systems.

Tier 2: Pilot With Guardrails (Medium Risk, High Value)

Customer support triage: Agents navigate support dashboards, retrieve customer information, and prepare responses for human review.
Compliance checking: Agents audit internal applications for accessibility issues, policy compliance, or configuration drift. (Google's own demo shows 3.5 Flash auditing documentation for accessibility issues.)
IT operations: Agents perform routine system checks, restart services, and clear standard alerts.

Tier 3: Proceed With Caution (Higher Risk, Transformative Value)

Procurement and invoicing: Agents navigate procurement portals, match invoices, and initiate approvals.
HR onboarding: Agents set up accounts, configure permissions, and enroll new hires across multiple systems.
Financial operations: Agents navigate banking portals, reconcile transactions, and prepare audit packages.

What This Means for Your 2026 AI Strategy

Three immediate implications:

1. Re-evaluate every "no API" automation blocker. If your automation roadmap has items stuck in the queue because the target system has no API, computer use just removed that excuse. Reprioritize your backlog.

2. Don't rip out your RPA. Yet. Computer use isn't ready to replace every RPA bot tomorrow. Start with net-new automations that were impossible before (no-API systems), then gradually migrate existing bots as you build confidence. The hybrid period will last 12–18 months.

3. Security is the gating factor, not capability. The technology can already do most of what you'd want. The question is whether your security posture can handle an autonomous agent clicking through your internal systems. If you scored below 30 on the readiness assessment above, that's your priority — not picking a vendor.

Google making computer use a native capability in its fastest, cheapest model isn't just a feature update. It's the moment this technology crosses from "impressive demo" to "viable enterprise tool." The question isn't whether your competitors will deploy computer-use agents. It's whether they'll do it securely.

The race is on. Move fast — but move carefully.

Rajesh Beri is Head of AI Engineering at Zscaler and writes about enterprise AI strategy at beri.net.

Continue Reading

THE DAILY BRIEF

Enterprise AI insights for technology and business leaders, twice weekly.

beri.net

Subscribe at beri.net/subscribe for twice-weekly AI insights delivered to your inbox.

LinkedIn: linkedin.com/in/rberi | X: x.com/rajeshberi

Here's what happened, why it matters, and the two frameworks your team needs before deploying computer-use agents in production.

What Google Actually Shipped

The technical architecture follows what Google calls an "agentic loop":

Screenshot: The client application captures the current screen
Analyze: Gemini 3.5 Flash reads the pixels and plans its next action
Action: The model outputs a precise UI command (clicking exact X/Y coordinates, typing text, scrolling)
Repeat: The environment executes the command, captures a new screenshot, and the cycle continues until the task is complete

This loop works across three environments:

Web browsers: Navigating web apps, filling forms, clicking through multi-step workflows
Mobile: Interacting with smartphone operating systems and simulating touch inputs
Desktop: Controlling desktop software, moving cursors, and typing based on real-time screenshots

Performance: Where 3.5 Flash Lands

For context, here's where the major models stand on OSWorld-Verified as of June 2026:

Model	OSWorld-Verified Score	Provider
Claude Mythos 5	85.0%	Anthropic
Claude Fable 5	85.0%	Anthropic
Claude Opus 4.8	83.4%	Anthropic
Gemini 3.5 Flash	78.4%	Google
Claude Sonnet 4.6	~78%	Anthropic
Gemini 3 Flash	65.1%	Google

Why This Is Different From RPA

It works. It's also extraordinarily brittle.

Adapt to UI changes: A button moved from the left sidebar to the top toolbar? The agent finds it by reading the screen, not by clicking coordinates (23, 456).
Handle exceptions: An unexpected dialog box appears? The agent reads it, reasons about it, and decides what to do — instead of crashing.
Navigate unfamiliar software: You don't need to record every workflow in advance. The agent can figure out new applications by reading their interfaces.
Work across applications: A single agent can switch between your CRM, your email client, your ERP, and your browser without separate integrations for each.

The Cost Math

The AI enterprise automation market is projected at $13.2 billion in 2026, growing to $38.9 billion by 2034 at a 14.5% CAGR. But the real story is the cost comparison:

Traditional RPA bot: $5,000–$15,000/year per bot (license) + $2,000–$5,000/year maintenance = $7,000–$20,000/year per workflow
AI computer-use agent: API tokens at Flash-tier pricing, scaling with usage = $500–$3,000/year for equivalent throughput (estimated based on current Gemini API pricing for agentic workloads)

That's a 5–10x cost reduction before accounting for the maintenance savings from self-healing automations that don't break when UIs change.

Framework #1: Computer Use Platform Comparison Matrix

Not all computer use implementations are created equal. Here's how the three major platforms compare across the dimensions that matter for enterprise deployment:

Dimension	Google (Gemini 3.5 Flash)	Anthropic (Claude Sonnet/Opus)	OpenAI (Operator/CUA)
Model integration	Built-in tool in main model	Separate computer use mode	Separate CUA model
Environments	Browser, Mobile, Desktop	Browser, Desktop	Browser (primary)
OSWorld score	78.4%	78–85% (varies by model)	~72% (GPT-5.4)
Speed optimization	Flash-tier (fastest)	Standard inference	Standard inference
Enterprise platform	Gemini Enterprise Agent Platform (GA)	Claude Enterprise, API	Frontier platform, ChatGPT
Safety: Prompt injection defense	Adversarial training + optional enterprise safeguards	Permission system, sandboxing	Confirmation dialogs
Safety: Action confirmation	Optional enterprise safeguard (explicit user confirmation for sensitive actions)	Built-in permission prompts	Confirmation before actions
Safety: Injection detection	Auto-stop on detected indirect prompt injection	Monitoring, not auto-stop	Not disclosed
API availability	Gemini API (GA)	Claude API (GA)	CUA API (GA)
Multi-tool integration	Seamless (same model does function calling + computer use)	Separate modes	Separate agent types
Pricing tier	Flash (lowest)	Sonnet (mid) / Opus (highest)	GPT-5.x (mid-high)
Cloud platform	Google Cloud native	AWS Bedrock, GCP, Azure	Azure native
Open-source reference	github.com/google-gemini/computer-use-preview	github.com/anthropics/anthropic-quickstarts	Limited

The Security Problem Nobody's Talking About Enough

The attack surface is massive:

Indirect prompt injection: A malicious webpage, email, or document contains hidden instructions that hijack the agent's behavior. The agent visits a page to gather data, the page contains invisible text saying "transfer $50,000 to account X," and the agent follows the instruction because it can't distinguish between its task and the injected command.
Privilege escalation: The agent inherits the permissions of whatever account it's logged into. If it's using an admin's browser session, it has admin access to everything that browser can reach.
Data exfiltration: The agent can read screens. If those screens contain sensitive data — customer PII, financial records, trade secrets — the agent processes that data through its model, potentially exposing it.
Cascading failures: Unlike a traditional bot that breaks and stops, an AI agent that encounters an error might try to fix it — clicking through admin panels, changing settings, or taking other actions that compound the original problem.

Google's response is a "defense-in-depth" approach with two optional enterprise safeguard systems:

Action confirmation: Require explicit user approval for sensitive or irreversible actions
Injection detection: Automatically stop tasks if an indirect prompt injection is identified

These are good steps. They're also opt-in, which means enterprises that don't enable them are flying blind.

Framework #2: Enterprise Computer Use Readiness Assessment

Governance & Policy (Max 15 points)

#	Dimension	Score 1 (Not Ready)	Score 3 (Partial)	Score 5 (Ready)
1	Agent access policy	No policy for agent permissions	Informal guidelines exist	Formal policy: what agents can access, what actions require human approval, what's prohibited
2	Data classification	No data classification	Some systems classified	All systems classified by sensitivity; agents restricted by classification tier
3	Incident response	No plan for agent-caused incidents	Generic IR plan covers AI	Specific runbook for agent failures: kill switches, rollback procedures, escalation paths

Technical Controls (Max 20 points)

#	Dimension	Score 1 (Not Ready)	Score 3 (Partial)	Score 5 (Ready)
4	Sandboxing	Agents run on production systems directly	Separate browser profiles	Fully sandboxed environment (dedicated VMs, network isolation, credential vaults)
5	Credential management	Agents use shared/admin credentials	Role-based accounts exist	Purpose-built service accounts with minimum necessary permissions, rotated automatically
6	Prompt injection defense	No defenses	Input filtering on some channels	Multi-layer defense: model-level (adversarial training), application-level (input/output filtering), environment-level (sandboxing)
7	Audit logging	No logging of agent actions	Basic action logs	Complete audit trail: every screenshot captured, every action taken, every decision explained, with tamper-proof storage

Operational Maturity (Max 15 points)

#	Dimension	Score 1 (Not Ready)	Score 3 (Partial)	Score 5 (Ready)
8	Human-in-the-loop	No human oversight	Humans review on failure	Configurable approval gates: routine actions auto-approved, sensitive actions require human confirmation, irreversible actions require multi-party approval
9	Testing & validation	No testing framework	Manual testing before deployment	Automated regression testing in sandboxed environments, red-team exercises for prompt injection, chaos testing for failure modes
10	Cost controls	No usage limits	Monthly budget caps	Per-agent, per-task, and per-department spend limits with real-time monitoring and automatic throttling

Scoring Guide

Total Score	Readiness Level	Recommendation
40–50	Production-ready	Deploy with monitoring. Start with low-risk, high-volume workflows.
30–39	Pilot-ready	Deploy in sandboxed pilot with limited scope. Address gaps before expanding.
20–29	Foundation-building	Invest in infrastructure and governance before piloting. 3–6 month timeline.
Below 20	Not ready	Start with traditional automation. Build governance framework first.

Enterprise Use Cases: Where to Start

Based on the maturity of current computer use implementations, here are the use cases ordered by risk-adjusted ROI:

Tier 1: Deploy Now (Low Risk, High Value)

Software testing: AI agents navigate applications like real users, running regression tests across UI changes. Google specifically highlights continuous software testing as a primary use case.
Data extraction and reporting: Agents log into dashboards, export data, compile reports across multiple systems.
Price monitoring: Agents check competitor pricing across websites, track changes, and update internal systems.

Tier 2: Pilot With Guardrails (Medium Risk, High Value)

Customer support triage: Agents navigate support dashboards, retrieve customer information, and prepare responses for human review.
Compliance checking: Agents audit internal applications for accessibility issues, policy compliance, or configuration drift. (Google's own demo shows 3.5 Flash auditing documentation for accessibility issues.)
IT operations: Agents perform routine system checks, restart services, and clear standard alerts.

Tier 3: Proceed With Caution (Higher Risk, Transformative Value)

Procurement and invoicing: Agents navigate procurement portals, match invoices, and initiate approvals.
HR onboarding: Agents set up accounts, configure permissions, and enroll new hires across multiple systems.
Financial operations: Agents navigate banking portals, reconcile transactions, and prepare audit packages.

What This Means for Your 2026 AI Strategy

Three immediate implications:

The race is on. Move fast — but move carefully.

Rajesh Beri is Head of AI Engineering at Zscaler and writes about enterprise AI strategy at beri.net.

Continue Reading

THE DAILY BRIEF

Gemini 3.5 Flashcomputer useGoogle DeepMindRPAenterprise automationAI agentsOSWorldprompt injectionagentic AIUiPath

Google Just Gave Its Fastest AI Model Eyes and Hands. The $35B RPA Industry Should Be Terrified.

By Rajesh Beri·June 25, 2026·14 min read

Here's what happened, why it matters, and the two frameworks your team needs before deploying computer-use agents in production.

What Google Actually Shipped

The technical architecture follows what Google calls an "agentic loop":

Screenshot: The client application captures the current screen
Analyze: Gemini 3.5 Flash reads the pixels and plans its next action
Action: The model outputs a precise UI command (clicking exact X/Y coordinates, typing text, scrolling)
Repeat: The environment executes the command, captures a new screenshot, and the cycle continues until the task is complete

This loop works across three environments:

Web browsers: Navigating web apps, filling forms, clicking through multi-step workflows
Mobile: Interacting with smartphone operating systems and simulating touch inputs
Desktop: Controlling desktop software, moving cursors, and typing based on real-time screenshots

Performance: Where 3.5 Flash Lands

For context, here's where the major models stand on OSWorld-Verified as of June 2026:

Model	OSWorld-Verified Score	Provider
Claude Mythos 5	85.0%	Anthropic
Claude Fable 5	85.0%	Anthropic
Claude Opus 4.8	83.4%	Anthropic
Gemini 3.5 Flash	78.4%	Google
Claude Sonnet 4.6	~78%	Anthropic
Gemini 3 Flash	65.1%	Google

Why This Is Different From RPA

It works. It's also extraordinarily brittle.

Adapt to UI changes: A button moved from the left sidebar to the top toolbar? The agent finds it by reading the screen, not by clicking coordinates (23, 456).
Handle exceptions: An unexpected dialog box appears? The agent reads it, reasons about it, and decides what to do — instead of crashing.
Navigate unfamiliar software: You don't need to record every workflow in advance. The agent can figure out new applications by reading their interfaces.
Work across applications: A single agent can switch between your CRM, your email client, your ERP, and your browser without separate integrations for each.

The Cost Math

The AI enterprise automation market is projected at $13.2 billion in 2026, growing to $38.9 billion by 2034 at a 14.5% CAGR. But the real story is the cost comparison:

Traditional RPA bot: $5,000–$15,000/year per bot (license) + $2,000–$5,000/year maintenance = $7,000–$20,000/year per workflow
AI computer-use agent: API tokens at Flash-tier pricing, scaling with usage = $500–$3,000/year for equivalent throughput (estimated based on current Gemini API pricing for agentic workloads)

That's a 5–10x cost reduction before accounting for the maintenance savings from self-healing automations that don't break when UIs change.

Framework #1: Computer Use Platform Comparison Matrix

Not all computer use implementations are created equal. Here's how the three major platforms compare across the dimensions that matter for enterprise deployment:

Dimension	Google (Gemini 3.5 Flash)	Anthropic (Claude Sonnet/Opus)	OpenAI (Operator/CUA)
Model integration	Built-in tool in main model	Separate computer use mode	Separate CUA model
Environments	Browser, Mobile, Desktop	Browser, Desktop	Browser (primary)
OSWorld score	78.4%	78–85% (varies by model)	~72% (GPT-5.4)
Speed optimization	Flash-tier (fastest)	Standard inference	Standard inference
Enterprise platform	Gemini Enterprise Agent Platform (GA)	Claude Enterprise, API	Frontier platform, ChatGPT
Safety: Prompt injection defense	Adversarial training + optional enterprise safeguards	Permission system, sandboxing	Confirmation dialogs
Safety: Action confirmation	Optional enterprise safeguard (explicit user confirmation for sensitive actions)	Built-in permission prompts	Confirmation before actions
Safety: Injection detection	Auto-stop on detected indirect prompt injection	Monitoring, not auto-stop	Not disclosed
API availability	Gemini API (GA)	Claude API (GA)	CUA API (GA)
Multi-tool integration	Seamless (same model does function calling + computer use)	Separate modes	Separate agent types
Pricing tier	Flash (lowest)	Sonnet (mid) / Opus (highest)	GPT-5.x (mid-high)
Cloud platform	Google Cloud native	AWS Bedrock, GCP, Azure	Azure native
Open-source reference	github.com/google-gemini/computer-use-preview	github.com/anthropics/anthropic-quickstarts	Limited

The Security Problem Nobody's Talking About Enough

The attack surface is massive:

Indirect prompt injection: A malicious webpage, email, or document contains hidden instructions that hijack the agent's behavior. The agent visits a page to gather data, the page contains invisible text saying "transfer $50,000 to account X," and the agent follows the instruction because it can't distinguish between its task and the injected command.
Privilege escalation: The agent inherits the permissions of whatever account it's logged into. If it's using an admin's browser session, it has admin access to everything that browser can reach.
Data exfiltration: The agent can read screens. If those screens contain sensitive data — customer PII, financial records, trade secrets — the agent processes that data through its model, potentially exposing it.
Cascading failures: Unlike a traditional bot that breaks and stops, an AI agent that encounters an error might try to fix it — clicking through admin panels, changing settings, or taking other actions that compound the original problem.

Google's response is a "defense-in-depth" approach with two optional enterprise safeguard systems:

Action confirmation: Require explicit user approval for sensitive or irreversible actions
Injection detection: Automatically stop tasks if an indirect prompt injection is identified

These are good steps. They're also opt-in, which means enterprises that don't enable them are flying blind.

Framework #2: Enterprise Computer Use Readiness Assessment

Governance & Policy (Max 15 points)

#	Dimension	Score 1 (Not Ready)	Score 3 (Partial)	Score 5 (Ready)
1	Agent access policy	No policy for agent permissions	Informal guidelines exist	Formal policy: what agents can access, what actions require human approval, what's prohibited
2	Data classification	No data classification	Some systems classified	All systems classified by sensitivity; agents restricted by classification tier
3	Incident response	No plan for agent-caused incidents	Generic IR plan covers AI	Specific runbook for agent failures: kill switches, rollback procedures, escalation paths

Technical Controls (Max 20 points)

#	Dimension	Score 1 (Not Ready)	Score 3 (Partial)	Score 5 (Ready)
4	Sandboxing	Agents run on production systems directly	Separate browser profiles	Fully sandboxed environment (dedicated VMs, network isolation, credential vaults)
5	Credential management	Agents use shared/admin credentials	Role-based accounts exist	Purpose-built service accounts with minimum necessary permissions, rotated automatically
6	Prompt injection defense	No defenses	Input filtering on some channels	Multi-layer defense: model-level (adversarial training), application-level (input/output filtering), environment-level (sandboxing)
7	Audit logging	No logging of agent actions	Basic action logs	Complete audit trail: every screenshot captured, every action taken, every decision explained, with tamper-proof storage

Operational Maturity (Max 15 points)

#	Dimension	Score 1 (Not Ready)	Score 3 (Partial)	Score 5 (Ready)
8	Human-in-the-loop	No human oversight	Humans review on failure	Configurable approval gates: routine actions auto-approved, sensitive actions require human confirmation, irreversible actions require multi-party approval
9	Testing & validation	No testing framework	Manual testing before deployment	Automated regression testing in sandboxed environments, red-team exercises for prompt injection, chaos testing for failure modes
10	Cost controls	No usage limits	Monthly budget caps	Per-agent, per-task, and per-department spend limits with real-time monitoring and automatic throttling

Scoring Guide

Total Score	Readiness Level	Recommendation
40–50	Production-ready	Deploy with monitoring. Start with low-risk, high-volume workflows.
30–39	Pilot-ready	Deploy in sandboxed pilot with limited scope. Address gaps before expanding.
20–29	Foundation-building	Invest in infrastructure and governance before piloting. 3–6 month timeline.
Below 20	Not ready	Start with traditional automation. Build governance framework first.

Enterprise Use Cases: Where to Start

Based on the maturity of current computer use implementations, here are the use cases ordered by risk-adjusted ROI:

Tier 1: Deploy Now (Low Risk, High Value)

Software testing: AI agents navigate applications like real users, running regression tests across UI changes. Google specifically highlights continuous software testing as a primary use case.
Data extraction and reporting: Agents log into dashboards, export data, compile reports across multiple systems.
Price monitoring: Agents check competitor pricing across websites, track changes, and update internal systems.

Tier 2: Pilot With Guardrails (Medium Risk, High Value)

Customer support triage: Agents navigate support dashboards, retrieve customer information, and prepare responses for human review.
Compliance checking: Agents audit internal applications for accessibility issues, policy compliance, or configuration drift. (Google's own demo shows 3.5 Flash auditing documentation for accessibility issues.)
IT operations: Agents perform routine system checks, restart services, and clear standard alerts.

Tier 3: Proceed With Caution (Higher Risk, Transformative Value)

Procurement and invoicing: Agents navigate procurement portals, match invoices, and initiate approvals.
HR onboarding: Agents set up accounts, configure permissions, and enroll new hires across multiple systems.
Financial operations: Agents navigate banking portals, reconcile transactions, and prepare audit packages.

What This Means for Your 2026 AI Strategy

Three immediate implications:

The race is on. Move fast — but move carefully.

Rajesh Beri is Head of AI Engineering at Zscaler and writes about enterprise AI strategy at beri.net.

Continue Reading

THE DAILY BRIEF

Enterprise AI insights for technology and business leaders, twice weekly.

beri.net

Subscribe at beri.net/subscribe for twice-weekly AI insights delivered to your inbox.

LinkedIn: linkedin.com/in/rberi | X: x.com/rajeshberi

ServiceNow

ServiceNow Just Launched AI Employees With Managers and KPIs. This Changes Everything.

At Knowledge 2026, ServiceNow announced an Autonomous Workforce of AI specialists — not assistants, not copilots, but AI employees with job titles, managers, domains, performance metrics, and accountability structures. IT specialists resolve cases 99% faster. Docusign targets 90% autonomous ticket resolution. Rolls-Royce saved 300,000 shop floor hours. The AI Control Tower governs agents across ServiceNow, Microsoft, and NVIDIA environments. Project Arc brings autonomous desktop agents with NVIDIA OpenShell sandboxing. This article provides an Autonomous Workforce Readiness Assessment and an AI Agent Governance Tier Model for CIOs.

June 10, 2026 Dun & Bradstreet

D&B Signed With All 3 AI Giants in 4 Weeks. Data Is the New Moat.

Between May 5 and June 8, Dun & Bradstreet signed partnerships with Anthropic, Microsoft, and OpenAI — all three major AI providers — in four weeks. S&P Global partnered with Cohere. Snowflake deepened its $200M Anthropic partnership. Legacy data companies are becoming the most important layer in the enterprise AI stack. The moat is not the model — it's the verified, governed data that agents need to act with confidence.

June 9, 2026 AI acquisitions

5 Acquisitions in 14 Days. The AI Agent Stack Is Being Bought.

In two weeks, Asana bought StackAI for $75M, Salesforce signed to acquire Contentful for $1–1.5B, Coupa acquired Rossum, Vertice acquired Vendr, and Palo Alto Networks closed the Portkey acquisition. Five deals across five different software categories targeting the same gap: the execution layer that lets AI agents do real work inside the enterprise. This article maps the emerging five-layer enterprise AI agent stack and provides two frameworks for CIOs deciding whether to build, buy, or partner.

June 9, 2026 Enterprise AI

RPA's $35B Endgame: Why 4 Tech Giants Backed One Stack

Automation Anywhere, Cisco, NVIDIA, Okta + OpenAI co-launched EnterpriseClaw May 19. Inside: ROI math, decision matrix vs UiPath, 5-phase rollout.

May 24, 2026

Latest Articles

View All →

Google Just Gave Its Fastest AI Model Eyes and Hands. The $35B RPA Industry Should Be Terrified.

What Google Actually Shipped

Performance: Where 3.5 Flash Lands

Why This Is Different From RPA

The Cost Math

Framework #1: Computer Use Platform Comparison Matrix

The Security Problem Nobody's Talking About Enough

Framework #2: Enterprise Computer Use Readiness Assessment

Governance & Policy (Max 15 points)

Technical Controls (Max 20 points)

Operational Maturity (Max 15 points)

Scoring Guide

Enterprise Use Cases: Where to Start

Tier 1: Deploy Now (Low Risk, High Value)

Tier 2: Pilot With Guardrails (Medium Risk, High Value)

Tier 3: Proceed With Caution (Higher Risk, Transformative Value)

What This Means for Your 2026 AI Strategy

Continue Reading

THE DAILY BRIEF

What Google Actually Shipped

Performance: Where 3.5 Flash Lands

Why This Is Different From RPA

The Cost Math

Framework #1: Computer Use Platform Comparison Matrix

The Security Problem Nobody's Talking About Enough

Framework #2: Enterprise Computer Use Readiness Assessment

Governance & Policy (Max 15 points)

Technical Controls (Max 20 points)

Operational Maturity (Max 15 points)

Scoring Guide

Enterprise Use Cases: Where to Start

Tier 1: Deploy Now (Low Risk, High Value)

Tier 2: Pilot With Guardrails (Medium Risk, High Value)

Tier 3: Proceed With Caution (Higher Risk, Transformative Value)

What This Means for Your 2026 AI Strategy

Continue Reading

What Google Actually Shipped

Performance: Where 3.5 Flash Lands

Why This Is Different From RPA

The Cost Math

Framework #1: Computer Use Platform Comparison Matrix

The Security Problem Nobody's Talking About Enough

Framework #2: Enterprise Computer Use Readiness Assessment

Governance & Policy (Max 15 points)

Technical Controls (Max 20 points)

Operational Maturity (Max 15 points)

Scoring Guide

Enterprise Use Cases: Where to Start

Tier 1: Deploy Now (Low Risk, High Value)

Tier 2: Pilot With Guardrails (Medium Risk, High Value)

Tier 3: Proceed With Caution (Higher Risk, Transformative Value)

What This Means for Your 2026 AI Strategy

Continue Reading

THE DAILY BRIEF

Stay Ahead of the Curve

Related Articles

ServiceNow Just Launched AI Employees With Managers and KPIs. This Changes Everything.

D&B Signed With All 3 AI Giants in 4 Weeks. Data Is the New Moat.

5 Acquisitions in 14 Days. The AI Agent Stack Is Being Bought.

RPA's $35B Endgame: Why 4 Tech Giants Backed One Stack

Latest Articles

One Enterprise Spent $500M on AI in a Month. Here's the Fix

OpenAI Spend Controls: Fix AI Costs Before Agents Multiply

OpenAI's Jalapeño Chip: AI Costs Are About to Fall

OpenAI Built Its Own Chip. Inference Just Got 50% Cheaper.