Gemini's New Agent Controls Your Apps. 3 Risks CIOs Miss.

Google embeds computer use in Gemini 3.5 Flash — AI agents now control your apps at 78.4% accuracy. Here are the 3 enterprise risks (and safeguards) every CIO must know.

By Rajesh Beri·June 25, 2026·12 min read
Share:
THE DAILY BRIEF
Enterprise AIAI AgentsGoogle GeminiAI SecurityAgentic AI
Gemini's New Agent Controls Your Apps. 3 Risks CIOs Miss.

Google embeds computer use in Gemini 3.5 Flash — AI agents now control your apps at 78.4% accuracy. Here are the 3 enterprise risks (and safeguards) every CIO must know.

By Rajesh Beri·June 25, 2026·12 min read

The conversation about AI agents took a concrete turn this week. On June 24, 2026, Google announced that Gemini 3.5 Flash now includes native computer use capability — the ability for an AI model to see what's on a screen, reason about it, and take action across browsers, mobile apps, and desktop environments. This isn't a research demo. It's generally available today via the Gemini API and the Gemini Enterprise Agent Platform.

For enterprise leaders, this changes the deployment calculus entirely. You're no longer choosing whether to use AI for text generation or analysis. You're now evaluating whether to give an AI system direct control over your enterprise software stack.

That's a different conversation. And three risks tend to get buried in the excitement.

From Chatbot to Cursor: What Computer Use Actually Means

Most enterprise AI deployments today involve a human in the middle. You ask the AI a question, it generates a response, and a human decides what to do next. Computer use eliminates that handoff.

With Gemini 3.5 Flash, an AI agent can be instructed to open your CRM, pull deal data, cross-reference it with your ERP, and write a summary report — all without a human clicking a single button. The agent sees the screen through a visual interface, identifies the relevant elements, and executes actions just as a human would.

Google describes the capability as enabling agents to "see, reason, and take action across browser, mobile and desktop environments." The primary enterprise use cases in the announcement include continuous software testing and knowledge work across professional applications. But the practical applications extend much further: automated data entry, workflow orchestration across SaaS tools without APIs, compliance checks across legacy systems, and real-time monitoring of enterprise dashboards.

The underlying shift is architectural. Previously, building an AI agent that could interact with software required either a custom API integration or a separate specialized model — Google's own Gemini 2.5 computer use model, or alternatives from Anthropic and OpenAI. Gemini 3.5 Flash embeds this capability natively. One model, one API call, full computer control.

The Benchmark Case: 78.4% on OSWorld, 19.6% Better Than Its Predecessor

Before evaluating risks, it's worth understanding the performance reality. Enterprises don't adopt technology because of capability claims — they adopt it because the numbers justify the investment.

Gemini 3.5 Flash scored 78.4% on OSWorld-Verified, the standard benchmark for agentic computer use. To put that in context: Claude Opus 4.7 scored 78.0% on the same benchmark. GPT-5.5 scored 78.7%. Google has positioned a Flash-tier model — faster, cheaper than its Pro counterpart — within a fraction of the top models in the category.

That matters for enterprise economics. Flash models typically run at a fraction of the cost of flagship models. If 3.5 Flash can execute computer use tasks at near-flagship accuracy, the ROI math on automation changes significantly.

Box, which integrated early access into its enterprise evaluation, reported that Gemini 3.5 Flash beat Gemini 3 Flash by 19.6% on its internal enterprise work evaluation set. Box specifically designed that benchmark to reflect multi-step tasks their customers perform daily. A nearly 20-point improvement on a real-world enterprise workload benchmark is not a marginal gain.

On the MCP Atlas benchmark — which measures multi-step workflows using the Model Context Protocol — Gemini 3.5 Flash scored 83.6%, ahead of Claude Sonnet 4.6 (69.5%) and GPT-5.5 (75.3%). For enterprise deployments that rely on MCP-based orchestration, this is a meaningful signal.

The model also supports 1M context tokens with 64k output tokens, enabling long-horizon tasks where an agent needs to hold substantial context while executing multi-step workflows.

Risk 1: Prompt Injection at Scale

The first risk is the one security teams are already flagging across the industry, and for good reason.

When an AI agent interacts with live digital environments — websites, applications, email threads, documents — it processes content that was not written by your organization. A malicious actor can embed instructions in that content designed to hijack the agent's behavior. This is called a prompt injection attack, and it's qualitatively different from prompt injection in a simple chat interface.

In a chat interface, a successful prompt injection might cause the AI to output something embarrassing or inaccurate. In an agentic computer use context, a successful prompt injection could cause the agent to exfiltrate data, modify records, send unauthorized communications, or take any other action the agent has permission to execute.

Google acknowledges this directly in the announcement, citing "targeted adversarial training for computer use" to reduce prompt injection risks. They've also released two enterprise safeguards: mandatory user confirmation for sensitive or irreversible actions, and automatic task termination if an indirect prompt injection is detected.

These safeguards reduce risk. They don't eliminate it. The cybersecurity research community has consistently found that adversarial training improves robustness without achieving immunity. And the automatic detection of indirect prompt injection — an attack vector that's still being actively studied — is not a solved problem.

The practical implication for CISOs: any Gemini 3.5 Flash agent that processes external content (emails, web pages, uploaded documents, third-party data feeds) needs to be treated as a potential injection vector. Defense-in-depth isn't just a recommendation here — it's the only viable architecture.

Risk 2: Uncontrolled Execution of Sensitive Actions

The second risk is about the blast radius when something goes wrong.

Human workers have intuitions about sensitive actions. An employee instructed to "update the customer records" understands implicitly that this doesn't mean deleting records, or changing payment terms, or resetting passwords. AI agents don't carry these implicit constraints unless they're explicitly programmed.

Computer use agents that operate across enterprise applications — CRM, ERP, HRIS, financial systems — need granular permission controls that most organizations haven't designed yet. The default configuration for most enterprise software assumes human users who exercise judgment. It doesn't assume autonomous agents that execute instructions literally and at machine speed.

Google's enterprise safeguard for this is explicit user confirmation for sensitive or irreversible actions. In practice, this means the agent pauses and prompts a human before executing actions that meet a defined threshold. The challenge is defining that threshold comprehensively before deployment, rather than discovering gaps after an agent executes an unintended action.

In conversations with enterprise architects over the past few months, the pattern I hear consistently is that "it seemed low-risk until the agent chained three actions together." Individual actions that appear benign — read a record, update a field, send a notification — can combine into outcomes that no one intended. Agents operating at speed across multiple systems amplify this effect.

The mitigation framework is familiar from privileged access management: least-privilege by default, explicit permission grants, audit logging of every action, and rate limiting on bulk operations. What's new is applying this framework to AI agents rather than human users or service accounts.

Risk 3: Shadow AI Agent Sprawl

The third risk is the one that typically gets addressed last, after the damage is done.

Computer use agents are significantly easier to deploy than traditional software integrations. There's no API to configure, no webhook to maintain, no SDK to integrate. You describe what you want in natural language, point the agent at your enterprise apps, and it starts working. That ease of deployment is precisely what makes shadow AI agent sprawl a near-certainty without proactive governance.

When a business analyst can spin up a Gemini 3.5 Flash agent that automates their weekly reporting workflow in an afternoon, many of them will. When a sales operations team realizes they can build an agent that scrapes competitor pricing from the web and updates their pricing model automatically, some of them will do it without telling IT. When HR discovers computer use agents can automate parts of the benefits enrollment process, the temptation to move fast is real.

The challenge isn't that these use cases are inherently problematic. Many of them will deliver genuine value. The challenge is that each one introduces a new agent operating with user-level (or sometimes elevated) permissions, processing potentially sensitive data, and making changes to enterprise systems — without the security review, data handling assessment, or compliance validation that the use case probably warrants.

The organizations that have navigated this well are the ones that built lightweight registration and review processes before agents proliferated — not the ones that tried to retroactively inventory and govern agents that had been running for months.

Google's Enterprise Safeguards: What They Cover and What They Don't

Google has built enterprise safeguards into Gemini 3.5 Flash that deserve a fair read. The adversarial training for prompt injection represents genuine investment in safety — not a checkbox exercise. The explicit confirmation workflow for sensitive actions is a practical mechanism that CISOs can evaluate against their own risk thresholds.

The defense-in-depth framework Google recommends — sandboxing, human-in-the-loop verification, strict access controls — aligns with best practices that enterprise security teams already apply to other privileged systems. The fact that Google is advocating for this rather than suggesting their safeguards are sufficient on their own is an intellectually honest position.

What the safeguards don't address is governance at the organizational level. Google can control how the model behaves. They can't control how your organization deploys it, what permissions you grant agents, how you handle the training data those agents process, or whether your procurement and legal teams have assessed the compliance implications for your industry.

The Gemini Enterprise Agent Platform provides a managed deployment environment with enterprise-grade access controls. For organizations that route deployments through that platform rather than raw API access, the governance surface is more manageable. That's a meaningful architectural consideration.

The Technical Leader's Decision Framework

For CIOs and CTOs evaluating Gemini 3.5 Flash computer use deployment, three operational questions clarify the path forward.

What classification of data will these agents touch? Public or internal data that's already broadly accessible within your organization is lower risk. Customer PII, financial records, health data, or anything subject to regulatory compliance requires a full risk assessment before agent access is granted. Map agent permissions to data classification before deployment, not after.

What actions are reversible? Computer use agents that read and synthesize information carry fundamentally different risk profiles than agents that write, modify, or delete. Start with read-only agents, establish performance baselines, and expand permissions incrementally with corresponding audit trail requirements at each stage.

How will you detect anomalous agent behavior? Standard SIEM tooling captures logs. What you also need are behavioral baselines for each agent — how many records does it typically read per session, what's the normal distribution of applications it accesses, how long do typical task completions take. Deviations from those baselines are the early warning system for both security incidents and agent malfunctions.

The Business Leader's Calculus

For CFOs, COOs, and business unit leaders, the ROI question has a straightforward structure once the risk framework is in place.

Computer use agents running on Flash-tier models eliminate the per-integration cost that has historically made AI automation expensive to scale. Traditional RPA (Robotic Process Automation) deployments require software licenses, implementation services, and ongoing maintenance for each workflow automated. Computer use agents reduce that to prompt engineering and governance overhead.

The cost differential is significant. A mid-sized enterprise running 50 automated workflows on traditional RPA tools might spend $500K-$1M annually on licensing and maintenance. The equivalent workload on a Gemini 3.5 Flash-based agent infrastructure at current API pricing runs materially lower — though accurate comparisons require scoping the specific token volumes and human oversight costs for your workflows.

The productivity case is clearer for knowledge-work automation than for transactional processing. Tasks that currently require a human to navigate multiple systems, aggregate information, and produce a structured output — market intelligence reports, competitive analysis, compliance documentation, customer onboarding checklists — are well-suited to computer use agents. Tasks that require fine-grained judgment calls at each step remain better suited to human workers augmented by AI rather than AI agents operating autonomously.

The business leader's question isn't whether to adopt computer use agents. It's which workflows to automate first, with what governance in place, and how to measure success before scaling.

What Happens Next

Google is describing Gemini 3.5 Flash computer use as Preview status, not General Availability. The distinction matters operationally. Preview means the capability is production-ready enough for enterprise pilots but the APIs may still evolve. Organizations planning production deployments at scale should build against the stable Gemini Enterprise Agent Platform endpoints rather than raw Preview API endpoints to reduce breaking change risk.

The competitive dynamic is accelerating. Anthropic's computer use capability in Claude models, OpenAI's Operator, and now Google's embedded computer use in Flash represent three mature implementations of the same fundamental capability from the three dominant enterprise AI providers. Enterprises that have been waiting for a clear winner before committing to a computer use strategy no longer have the luxury of waiting — the category has arrived.

The organizations that will extract the most value from computer use agents over the next 18 months are not necessarily the ones that move fastest. They're the ones that establish governance frameworks now, pilot in controlled environments with well-defined success metrics, and build the organizational muscle to evaluate and deploy agents safely at increasing scale.

The technology is ready. The question is whether your governance infrastructure is.

Sources

THE DAILY BRIEF

Enterprise AI insights for technology and business leaders, twice weekly.

beri.net

Subscribe at beri.net/subscribe for twice-weekly AI insights delivered to your inbox.

LinkedIn: linkedin.com/in/rberi  |  X: x.com/rajeshberi

© 2026 Rajesh Beri. All rights reserved.

Gemini's New Agent Controls Your Apps. 3 Risks CIOs Miss.

Photo by Google DeepMind on Pexels

The conversation about AI agents took a concrete turn this week. On June 24, 2026, Google announced that Gemini 3.5 Flash now includes native computer use capability — the ability for an AI model to see what's on a screen, reason about it, and take action across browsers, mobile apps, and desktop environments. This isn't a research demo. It's generally available today via the Gemini API and the Gemini Enterprise Agent Platform.

For enterprise leaders, this changes the deployment calculus entirely. You're no longer choosing whether to use AI for text generation or analysis. You're now evaluating whether to give an AI system direct control over your enterprise software stack.

That's a different conversation. And three risks tend to get buried in the excitement.

From Chatbot to Cursor: What Computer Use Actually Means

Most enterprise AI deployments today involve a human in the middle. You ask the AI a question, it generates a response, and a human decides what to do next. Computer use eliminates that handoff.

With Gemini 3.5 Flash, an AI agent can be instructed to open your CRM, pull deal data, cross-reference it with your ERP, and write a summary report — all without a human clicking a single button. The agent sees the screen through a visual interface, identifies the relevant elements, and executes actions just as a human would.

Google describes the capability as enabling agents to "see, reason, and take action across browser, mobile and desktop environments." The primary enterprise use cases in the announcement include continuous software testing and knowledge work across professional applications. But the practical applications extend much further: automated data entry, workflow orchestration across SaaS tools without APIs, compliance checks across legacy systems, and real-time monitoring of enterprise dashboards.

The underlying shift is architectural. Previously, building an AI agent that could interact with software required either a custom API integration or a separate specialized model — Google's own Gemini 2.5 computer use model, or alternatives from Anthropic and OpenAI. Gemini 3.5 Flash embeds this capability natively. One model, one API call, full computer control.

The Benchmark Case: 78.4% on OSWorld, 19.6% Better Than Its Predecessor

Before evaluating risks, it's worth understanding the performance reality. Enterprises don't adopt technology because of capability claims — they adopt it because the numbers justify the investment.

Gemini 3.5 Flash scored 78.4% on OSWorld-Verified, the standard benchmark for agentic computer use. To put that in context: Claude Opus 4.7 scored 78.0% on the same benchmark. GPT-5.5 scored 78.7%. Google has positioned a Flash-tier model — faster, cheaper than its Pro counterpart — within a fraction of the top models in the category.

That matters for enterprise economics. Flash models typically run at a fraction of the cost of flagship models. If 3.5 Flash can execute computer use tasks at near-flagship accuracy, the ROI math on automation changes significantly.

Box, which integrated early access into its enterprise evaluation, reported that Gemini 3.5 Flash beat Gemini 3 Flash by 19.6% on its internal enterprise work evaluation set. Box specifically designed that benchmark to reflect multi-step tasks their customers perform daily. A nearly 20-point improvement on a real-world enterprise workload benchmark is not a marginal gain.

On the MCP Atlas benchmark — which measures multi-step workflows using the Model Context Protocol — Gemini 3.5 Flash scored 83.6%, ahead of Claude Sonnet 4.6 (69.5%) and GPT-5.5 (75.3%). For enterprise deployments that rely on MCP-based orchestration, this is a meaningful signal.

The model also supports 1M context tokens with 64k output tokens, enabling long-horizon tasks where an agent needs to hold substantial context while executing multi-step workflows.

Risk 1: Prompt Injection at Scale

The first risk is the one security teams are already flagging across the industry, and for good reason.

When an AI agent interacts with live digital environments — websites, applications, email threads, documents — it processes content that was not written by your organization. A malicious actor can embed instructions in that content designed to hijack the agent's behavior. This is called a prompt injection attack, and it's qualitatively different from prompt injection in a simple chat interface.

In a chat interface, a successful prompt injection might cause the AI to output something embarrassing or inaccurate. In an agentic computer use context, a successful prompt injection could cause the agent to exfiltrate data, modify records, send unauthorized communications, or take any other action the agent has permission to execute.

Google acknowledges this directly in the announcement, citing "targeted adversarial training for computer use" to reduce prompt injection risks. They've also released two enterprise safeguards: mandatory user confirmation for sensitive or irreversible actions, and automatic task termination if an indirect prompt injection is detected.

These safeguards reduce risk. They don't eliminate it. The cybersecurity research community has consistently found that adversarial training improves robustness without achieving immunity. And the automatic detection of indirect prompt injection — an attack vector that's still being actively studied — is not a solved problem.

The practical implication for CISOs: any Gemini 3.5 Flash agent that processes external content (emails, web pages, uploaded documents, third-party data feeds) needs to be treated as a potential injection vector. Defense-in-depth isn't just a recommendation here — it's the only viable architecture.

Risk 2: Uncontrolled Execution of Sensitive Actions

The second risk is about the blast radius when something goes wrong.

Human workers have intuitions about sensitive actions. An employee instructed to "update the customer records" understands implicitly that this doesn't mean deleting records, or changing payment terms, or resetting passwords. AI agents don't carry these implicit constraints unless they're explicitly programmed.

Computer use agents that operate across enterprise applications — CRM, ERP, HRIS, financial systems — need granular permission controls that most organizations haven't designed yet. The default configuration for most enterprise software assumes human users who exercise judgment. It doesn't assume autonomous agents that execute instructions literally and at machine speed.

Google's enterprise safeguard for this is explicit user confirmation for sensitive or irreversible actions. In practice, this means the agent pauses and prompts a human before executing actions that meet a defined threshold. The challenge is defining that threshold comprehensively before deployment, rather than discovering gaps after an agent executes an unintended action.

In conversations with enterprise architects over the past few months, the pattern I hear consistently is that "it seemed low-risk until the agent chained three actions together." Individual actions that appear benign — read a record, update a field, send a notification — can combine into outcomes that no one intended. Agents operating at speed across multiple systems amplify this effect.

The mitigation framework is familiar from privileged access management: least-privilege by default, explicit permission grants, audit logging of every action, and rate limiting on bulk operations. What's new is applying this framework to AI agents rather than human users or service accounts.

Risk 3: Shadow AI Agent Sprawl

The third risk is the one that typically gets addressed last, after the damage is done.

Computer use agents are significantly easier to deploy than traditional software integrations. There's no API to configure, no webhook to maintain, no SDK to integrate. You describe what you want in natural language, point the agent at your enterprise apps, and it starts working. That ease of deployment is precisely what makes shadow AI agent sprawl a near-certainty without proactive governance.

When a business analyst can spin up a Gemini 3.5 Flash agent that automates their weekly reporting workflow in an afternoon, many of them will. When a sales operations team realizes they can build an agent that scrapes competitor pricing from the web and updates their pricing model automatically, some of them will do it without telling IT. When HR discovers computer use agents can automate parts of the benefits enrollment process, the temptation to move fast is real.

The challenge isn't that these use cases are inherently problematic. Many of them will deliver genuine value. The challenge is that each one introduces a new agent operating with user-level (or sometimes elevated) permissions, processing potentially sensitive data, and making changes to enterprise systems — without the security review, data handling assessment, or compliance validation that the use case probably warrants.

The organizations that have navigated this well are the ones that built lightweight registration and review processes before agents proliferated — not the ones that tried to retroactively inventory and govern agents that had been running for months.

Google's Enterprise Safeguards: What They Cover and What They Don't

Google has built enterprise safeguards into Gemini 3.5 Flash that deserve a fair read. The adversarial training for prompt injection represents genuine investment in safety — not a checkbox exercise. The explicit confirmation workflow for sensitive actions is a practical mechanism that CISOs can evaluate against their own risk thresholds.

The defense-in-depth framework Google recommends — sandboxing, human-in-the-loop verification, strict access controls — aligns with best practices that enterprise security teams already apply to other privileged systems. The fact that Google is advocating for this rather than suggesting their safeguards are sufficient on their own is an intellectually honest position.

What the safeguards don't address is governance at the organizational level. Google can control how the model behaves. They can't control how your organization deploys it, what permissions you grant agents, how you handle the training data those agents process, or whether your procurement and legal teams have assessed the compliance implications for your industry.

The Gemini Enterprise Agent Platform provides a managed deployment environment with enterprise-grade access controls. For organizations that route deployments through that platform rather than raw API access, the governance surface is more manageable. That's a meaningful architectural consideration.

The Technical Leader's Decision Framework

For CIOs and CTOs evaluating Gemini 3.5 Flash computer use deployment, three operational questions clarify the path forward.

What classification of data will these agents touch? Public or internal data that's already broadly accessible within your organization is lower risk. Customer PII, financial records, health data, or anything subject to regulatory compliance requires a full risk assessment before agent access is granted. Map agent permissions to data classification before deployment, not after.

What actions are reversible? Computer use agents that read and synthesize information carry fundamentally different risk profiles than agents that write, modify, or delete. Start with read-only agents, establish performance baselines, and expand permissions incrementally with corresponding audit trail requirements at each stage.

How will you detect anomalous agent behavior? Standard SIEM tooling captures logs. What you also need are behavioral baselines for each agent — how many records does it typically read per session, what's the normal distribution of applications it accesses, how long do typical task completions take. Deviations from those baselines are the early warning system for both security incidents and agent malfunctions.

The Business Leader's Calculus

For CFOs, COOs, and business unit leaders, the ROI question has a straightforward structure once the risk framework is in place.

Computer use agents running on Flash-tier models eliminate the per-integration cost that has historically made AI automation expensive to scale. Traditional RPA (Robotic Process Automation) deployments require software licenses, implementation services, and ongoing maintenance for each workflow automated. Computer use agents reduce that to prompt engineering and governance overhead.

The cost differential is significant. A mid-sized enterprise running 50 automated workflows on traditional RPA tools might spend $500K-$1M annually on licensing and maintenance. The equivalent workload on a Gemini 3.5 Flash-based agent infrastructure at current API pricing runs materially lower — though accurate comparisons require scoping the specific token volumes and human oversight costs for your workflows.

The productivity case is clearer for knowledge-work automation than for transactional processing. Tasks that currently require a human to navigate multiple systems, aggregate information, and produce a structured output — market intelligence reports, competitive analysis, compliance documentation, customer onboarding checklists — are well-suited to computer use agents. Tasks that require fine-grained judgment calls at each step remain better suited to human workers augmented by AI rather than AI agents operating autonomously.

The business leader's question isn't whether to adopt computer use agents. It's which workflows to automate first, with what governance in place, and how to measure success before scaling.

What Happens Next

Google is describing Gemini 3.5 Flash computer use as Preview status, not General Availability. The distinction matters operationally. Preview means the capability is production-ready enough for enterprise pilots but the APIs may still evolve. Organizations planning production deployments at scale should build against the stable Gemini Enterprise Agent Platform endpoints rather than raw Preview API endpoints to reduce breaking change risk.

The competitive dynamic is accelerating. Anthropic's computer use capability in Claude models, OpenAI's Operator, and now Google's embedded computer use in Flash represent three mature implementations of the same fundamental capability from the three dominant enterprise AI providers. Enterprises that have been waiting for a clear winner before committing to a computer use strategy no longer have the luxury of waiting — the category has arrived.

The organizations that will extract the most value from computer use agents over the next 18 months are not necessarily the ones that move fastest. They're the ones that establish governance frameworks now, pilot in controlled environments with well-defined success metrics, and build the organizational muscle to evaluate and deploy agents safely at increasing scale.

The technology is ready. The question is whether your governance infrastructure is.

Sources

Share:
THE DAILY BRIEF
Enterprise AIAI AgentsGoogle GeminiAI SecurityAgentic AI
Gemini's New Agent Controls Your Apps. 3 Risks CIOs Miss.

Google embeds computer use in Gemini 3.5 Flash — AI agents now control your apps at 78.4% accuracy. Here are the 3 enterprise risks (and safeguards) every CIO must know.

By Rajesh Beri·June 25, 2026·12 min read

The conversation about AI agents took a concrete turn this week. On June 24, 2026, Google announced that Gemini 3.5 Flash now includes native computer use capability — the ability for an AI model to see what's on a screen, reason about it, and take action across browsers, mobile apps, and desktop environments. This isn't a research demo. It's generally available today via the Gemini API and the Gemini Enterprise Agent Platform.

For enterprise leaders, this changes the deployment calculus entirely. You're no longer choosing whether to use AI for text generation or analysis. You're now evaluating whether to give an AI system direct control over your enterprise software stack.

That's a different conversation. And three risks tend to get buried in the excitement.

From Chatbot to Cursor: What Computer Use Actually Means

Most enterprise AI deployments today involve a human in the middle. You ask the AI a question, it generates a response, and a human decides what to do next. Computer use eliminates that handoff.

With Gemini 3.5 Flash, an AI agent can be instructed to open your CRM, pull deal data, cross-reference it with your ERP, and write a summary report — all without a human clicking a single button. The agent sees the screen through a visual interface, identifies the relevant elements, and executes actions just as a human would.

Google describes the capability as enabling agents to "see, reason, and take action across browser, mobile and desktop environments." The primary enterprise use cases in the announcement include continuous software testing and knowledge work across professional applications. But the practical applications extend much further: automated data entry, workflow orchestration across SaaS tools without APIs, compliance checks across legacy systems, and real-time monitoring of enterprise dashboards.

The underlying shift is architectural. Previously, building an AI agent that could interact with software required either a custom API integration or a separate specialized model — Google's own Gemini 2.5 computer use model, or alternatives from Anthropic and OpenAI. Gemini 3.5 Flash embeds this capability natively. One model, one API call, full computer control.

The Benchmark Case: 78.4% on OSWorld, 19.6% Better Than Its Predecessor

Before evaluating risks, it's worth understanding the performance reality. Enterprises don't adopt technology because of capability claims — they adopt it because the numbers justify the investment.

Gemini 3.5 Flash scored 78.4% on OSWorld-Verified, the standard benchmark for agentic computer use. To put that in context: Claude Opus 4.7 scored 78.0% on the same benchmark. GPT-5.5 scored 78.7%. Google has positioned a Flash-tier model — faster, cheaper than its Pro counterpart — within a fraction of the top models in the category.

That matters for enterprise economics. Flash models typically run at a fraction of the cost of flagship models. If 3.5 Flash can execute computer use tasks at near-flagship accuracy, the ROI math on automation changes significantly.

Box, which integrated early access into its enterprise evaluation, reported that Gemini 3.5 Flash beat Gemini 3 Flash by 19.6% on its internal enterprise work evaluation set. Box specifically designed that benchmark to reflect multi-step tasks their customers perform daily. A nearly 20-point improvement on a real-world enterprise workload benchmark is not a marginal gain.

On the MCP Atlas benchmark — which measures multi-step workflows using the Model Context Protocol — Gemini 3.5 Flash scored 83.6%, ahead of Claude Sonnet 4.6 (69.5%) and GPT-5.5 (75.3%). For enterprise deployments that rely on MCP-based orchestration, this is a meaningful signal.

The model also supports 1M context tokens with 64k output tokens, enabling long-horizon tasks where an agent needs to hold substantial context while executing multi-step workflows.

Risk 1: Prompt Injection at Scale

The first risk is the one security teams are already flagging across the industry, and for good reason.

When an AI agent interacts with live digital environments — websites, applications, email threads, documents — it processes content that was not written by your organization. A malicious actor can embed instructions in that content designed to hijack the agent's behavior. This is called a prompt injection attack, and it's qualitatively different from prompt injection in a simple chat interface.

In a chat interface, a successful prompt injection might cause the AI to output something embarrassing or inaccurate. In an agentic computer use context, a successful prompt injection could cause the agent to exfiltrate data, modify records, send unauthorized communications, or take any other action the agent has permission to execute.

Google acknowledges this directly in the announcement, citing "targeted adversarial training for computer use" to reduce prompt injection risks. They've also released two enterprise safeguards: mandatory user confirmation for sensitive or irreversible actions, and automatic task termination if an indirect prompt injection is detected.

These safeguards reduce risk. They don't eliminate it. The cybersecurity research community has consistently found that adversarial training improves robustness without achieving immunity. And the automatic detection of indirect prompt injection — an attack vector that's still being actively studied — is not a solved problem.

The practical implication for CISOs: any Gemini 3.5 Flash agent that processes external content (emails, web pages, uploaded documents, third-party data feeds) needs to be treated as a potential injection vector. Defense-in-depth isn't just a recommendation here — it's the only viable architecture.

Risk 2: Uncontrolled Execution of Sensitive Actions

The second risk is about the blast radius when something goes wrong.

Human workers have intuitions about sensitive actions. An employee instructed to "update the customer records" understands implicitly that this doesn't mean deleting records, or changing payment terms, or resetting passwords. AI agents don't carry these implicit constraints unless they're explicitly programmed.

Computer use agents that operate across enterprise applications — CRM, ERP, HRIS, financial systems — need granular permission controls that most organizations haven't designed yet. The default configuration for most enterprise software assumes human users who exercise judgment. It doesn't assume autonomous agents that execute instructions literally and at machine speed.

Google's enterprise safeguard for this is explicit user confirmation for sensitive or irreversible actions. In practice, this means the agent pauses and prompts a human before executing actions that meet a defined threshold. The challenge is defining that threshold comprehensively before deployment, rather than discovering gaps after an agent executes an unintended action.

In conversations with enterprise architects over the past few months, the pattern I hear consistently is that "it seemed low-risk until the agent chained three actions together." Individual actions that appear benign — read a record, update a field, send a notification — can combine into outcomes that no one intended. Agents operating at speed across multiple systems amplify this effect.

The mitigation framework is familiar from privileged access management: least-privilege by default, explicit permission grants, audit logging of every action, and rate limiting on bulk operations. What's new is applying this framework to AI agents rather than human users or service accounts.

Risk 3: Shadow AI Agent Sprawl

The third risk is the one that typically gets addressed last, after the damage is done.

Computer use agents are significantly easier to deploy than traditional software integrations. There's no API to configure, no webhook to maintain, no SDK to integrate. You describe what you want in natural language, point the agent at your enterprise apps, and it starts working. That ease of deployment is precisely what makes shadow AI agent sprawl a near-certainty without proactive governance.

When a business analyst can spin up a Gemini 3.5 Flash agent that automates their weekly reporting workflow in an afternoon, many of them will. When a sales operations team realizes they can build an agent that scrapes competitor pricing from the web and updates their pricing model automatically, some of them will do it without telling IT. When HR discovers computer use agents can automate parts of the benefits enrollment process, the temptation to move fast is real.

The challenge isn't that these use cases are inherently problematic. Many of them will deliver genuine value. The challenge is that each one introduces a new agent operating with user-level (or sometimes elevated) permissions, processing potentially sensitive data, and making changes to enterprise systems — without the security review, data handling assessment, or compliance validation that the use case probably warrants.

The organizations that have navigated this well are the ones that built lightweight registration and review processes before agents proliferated — not the ones that tried to retroactively inventory and govern agents that had been running for months.

Google's Enterprise Safeguards: What They Cover and What They Don't

Google has built enterprise safeguards into Gemini 3.5 Flash that deserve a fair read. The adversarial training for prompt injection represents genuine investment in safety — not a checkbox exercise. The explicit confirmation workflow for sensitive actions is a practical mechanism that CISOs can evaluate against their own risk thresholds.

The defense-in-depth framework Google recommends — sandboxing, human-in-the-loop verification, strict access controls — aligns with best practices that enterprise security teams already apply to other privileged systems. The fact that Google is advocating for this rather than suggesting their safeguards are sufficient on their own is an intellectually honest position.

What the safeguards don't address is governance at the organizational level. Google can control how the model behaves. They can't control how your organization deploys it, what permissions you grant agents, how you handle the training data those agents process, or whether your procurement and legal teams have assessed the compliance implications for your industry.

The Gemini Enterprise Agent Platform provides a managed deployment environment with enterprise-grade access controls. For organizations that route deployments through that platform rather than raw API access, the governance surface is more manageable. That's a meaningful architectural consideration.

The Technical Leader's Decision Framework

For CIOs and CTOs evaluating Gemini 3.5 Flash computer use deployment, three operational questions clarify the path forward.

What classification of data will these agents touch? Public or internal data that's already broadly accessible within your organization is lower risk. Customer PII, financial records, health data, or anything subject to regulatory compliance requires a full risk assessment before agent access is granted. Map agent permissions to data classification before deployment, not after.

What actions are reversible? Computer use agents that read and synthesize information carry fundamentally different risk profiles than agents that write, modify, or delete. Start with read-only agents, establish performance baselines, and expand permissions incrementally with corresponding audit trail requirements at each stage.

How will you detect anomalous agent behavior? Standard SIEM tooling captures logs. What you also need are behavioral baselines for each agent — how many records does it typically read per session, what's the normal distribution of applications it accesses, how long do typical task completions take. Deviations from those baselines are the early warning system for both security incidents and agent malfunctions.

The Business Leader's Calculus

For CFOs, COOs, and business unit leaders, the ROI question has a straightforward structure once the risk framework is in place.

Computer use agents running on Flash-tier models eliminate the per-integration cost that has historically made AI automation expensive to scale. Traditional RPA (Robotic Process Automation) deployments require software licenses, implementation services, and ongoing maintenance for each workflow automated. Computer use agents reduce that to prompt engineering and governance overhead.

The cost differential is significant. A mid-sized enterprise running 50 automated workflows on traditional RPA tools might spend $500K-$1M annually on licensing and maintenance. The equivalent workload on a Gemini 3.5 Flash-based agent infrastructure at current API pricing runs materially lower — though accurate comparisons require scoping the specific token volumes and human oversight costs for your workflows.

The productivity case is clearer for knowledge-work automation than for transactional processing. Tasks that currently require a human to navigate multiple systems, aggregate information, and produce a structured output — market intelligence reports, competitive analysis, compliance documentation, customer onboarding checklists — are well-suited to computer use agents. Tasks that require fine-grained judgment calls at each step remain better suited to human workers augmented by AI rather than AI agents operating autonomously.

The business leader's question isn't whether to adopt computer use agents. It's which workflows to automate first, with what governance in place, and how to measure success before scaling.

What Happens Next

Google is describing Gemini 3.5 Flash computer use as Preview status, not General Availability. The distinction matters operationally. Preview means the capability is production-ready enough for enterprise pilots but the APIs may still evolve. Organizations planning production deployments at scale should build against the stable Gemini Enterprise Agent Platform endpoints rather than raw Preview API endpoints to reduce breaking change risk.

The competitive dynamic is accelerating. Anthropic's computer use capability in Claude models, OpenAI's Operator, and now Google's embedded computer use in Flash represent three mature implementations of the same fundamental capability from the three dominant enterprise AI providers. Enterprises that have been waiting for a clear winner before committing to a computer use strategy no longer have the luxury of waiting — the category has arrived.

The organizations that will extract the most value from computer use agents over the next 18 months are not necessarily the ones that move fastest. They're the ones that establish governance frameworks now, pilot in controlled environments with well-defined success metrics, and build the organizational muscle to evaluate and deploy agents safely at increasing scale.

The technology is ready. The question is whether your governance infrastructure is.

Sources

THE DAILY BRIEF

Enterprise AI insights for technology and business leaders, twice weekly.

beri.net

Subscribe at beri.net/subscribe for twice-weekly AI insights delivered to your inbox.

LinkedIn: linkedin.com/in/rberi  |  X: x.com/rajeshberi

© 2026 Rajesh Beri. All rights reserved.

Newsletter

Stay Ahead of the Curve

Weekly enterprise AI insights for technology leaders. No spam, no vendor pitches—unsubscribe anytime.

Subscribe

Latest Articles

View All →