On June 22, 2026, OpenAI launched Patch the Planet, a joint initiative with Trail of Bits that produced hundreds of discovered bugs, 64 pull requests, and 51 issues across 19 critical open-source projects in its first five days. Thirty-seven patches are already merged. The same announcement introduced the full release of GPT-5.5-Cyber, a cybersecurity-specialized model that scored 85.6% on CyberGym — the highest single-model score on the benchmark that measures whether an AI agent can reproduce known software vulnerabilities in controlled environments.
This is not a research paper about what AI might someday do for security. This is a field report about what it did last week.
If you run security operations, application security, or software engineering at an enterprise, these results compress your decision timeline. The question is no longer whether AI-powered vulnerability discovery works. The question is whether your organization is using it before your adversaries do — and Palo Alto Networks now estimates you have three to five months before AI-driven exploits become the norm.
The Arms Race That Created Patch the Planet
The catalyst for Patch the Planet was a problem that Anthropic accidentally created and OpenAI is now racing to solve from a different angle.
When Anthropic launched Project Glasswing in April 2026, it gave 50 partner organizations — including the U.S. government, Microsoft, CrowdStrike, Cloudflare, and Palo Alto Networks — access to Claude Mythos Preview, a model specifically designed for advanced vulnerability discovery. The results were staggering: Glasswing partners found more than 10,000 high- or critical-severity security flaws across their codebases. Anthropic subsequently expanded to 150 organizations across 15 countries.
But Glasswing exposed a fundamental bottleneck. As Anthropic itself acknowledged, "the bottleneck in cybersecurity is now verifying, disclosing, and patching the large numbers of vulnerabilities that Mythos-class models can surface." Finding bugs at machine speed is only valuable if someone can fix them at something approaching machine speed.
OpenAI's response is architecturally different. Where Glasswing gives enterprises tools to scan their own code, Patch the Planet sends security engineers — backed by AI — directly to the open-source maintainers who build the code that enterprises depend on. Trail of Bits committed roughly a fifth of its entire workforce to the opening five-day sprint: 25 engineers working full-time with GPT-5.5-Cyber and Codex Security across projects that include cURL, the Go project, Python and python.org, PyPI, aiohttp, pyca/cryptography, Sigstore, NATS, freenginx, urllib3, SimpleX, Valkey, and RustCrypto.
The initiative also brought in HackerOne and Calif for vulnerability triage and coordinated disclosure — closing the loop between discovery, validation, and remediation.
What the Numbers Actually Mean
The raw metrics from Patch the Planet's first week deserve context.
64 pull requests and 51 issues across 19 projects sounds impressive but understates the work. Several projects accept reports through private channels — HackerOne advisories, security mailing lists, private forks — so the public tally on GitHub undercounts the total findings, many of which are still in coordinated disclosure.
More significant than the count is what was inside those patches. At python.org, Trail of Bits added a CI workflow built on zizmor, their open-source GitHub Actions auditor, fixed all flagged issues, and integrated it into the project's CI. In RustCrypto, they contributed correctness fixes to the big-integer library that higher-level cryptographic implementations depend on. In SimpleX, they addressed storage-accounting and service-restart bugs. In PyPI's Warehouse, they improved admin-quarantine confirmation flows. These are not cosmetic patches — they are structural improvements to software that hundreds of millions of users depend on.
The vulnerability discovery results were equally notable. Trail of Bits built a differential testing framework that pointed multiple cryptographic implementations at each other, automatically identifying behavioral differences. This approach uncovered an AES-GCM issue in PyCA and several X.509 certificate handling discrepancies — the kind of subtle interoperability bugs that manual auditing rarely catches because no single engineer has enough context across all implementations.
When Trail of Bits privately reported a cluster of issues across aiohttp's client and server paths — including cookies that could regain broader scope after a save-and-reload cycle, digest credentials that could answer challenges from the wrong origin, and resource limits that executed after attacker-controlled buffering rather than before — the maintainers authored and merged all eight fixes within hours, seven of them inside a single five-hour window.
GPT-5.5-Cyber: What the Benchmarks Show
The full release of GPT-5.5-Cyber positions it as the top-scoring model on CyberGym at 85.6%, ahead of Claude Mythos 5 at 83.8%, the standard GPT-5.5 at 81.8%, and Claude Opus 4.7 at 73.1%.
The UK AI Safety Institute provides independent validation. In their evaluation of GPT-5.5's cyber capabilities, AISI found that on Expert-level advanced cyber tasks — vulnerability research and exploitation against realistic targets with modern mitigations — GPT-5.5 achieved a 71.4% average pass rate, compared to 68.6% for Claude Mythos Preview, 52.4% for GPT-5.4, and 48.6% for Claude Opus 4.7. GPT-5.5 is the second model, after Claude Mythos Preview, to complete AISI's corporate network attack simulation end-to-end — a multi-step exercise they estimate would take a human approximately 20 hours.
One AISI test result illustrates the capability gap starkly. In a challenge requiring reverse engineering of a custom Rust virtual machine, building a disassembler for unknown bytecode, recovering authentication logic, and solving for a valid password using constraint solving, a human expert from Crystal Peak Security completed the task in roughly 12 hours. GPT-5.5 solved it in 10 minutes and 22 seconds at a cost of $1.73 in API usage.
The practical implications are significant. GPT-5.5-Cyber is available through OpenAI's Trusted Access for Cyber (TAC) program, a three-tier identity and trust-based framework:
| Access Level | Safeguards | Intended Use |
|---|---|---|
| GPT-5.5 (default) | Standard refusal classifiers | General-purpose work |
| GPT-5.5 with TAC | Reduced refusals for verified defenders | Vulnerability triage, malware analysis, detection engineering, patch validation |
| GPT-5.5-Cyber | Most permissive behavior with strongest verification | Authorized red teaming, penetration testing, controlled exploit validation |
Members accessing the most capable tier are required to enable phishing-resistant account security protections — a detail that signals how seriously OpenAI is treating the dual-use risk.
The Palo Alto Validation: Why Three to Five Months Matters
The most consequential external validation came not from a benchmark but from Palo Alto Networks' May 2026 Defender's Guide update. After testing Claude Mythos, Claude Opus 4.7, and GPT-5.5-Cyber against their own codebase — over 130 products across all three platforms — Palo Alto published its May "Patch Wednesday" advisory covering 26 CVEs representing 75 issues, versus their typical monthly volume of fewer than five CVEs.
The headline finding: this was the first time where the majority of discovered vulnerabilities came from frontier AI models scanning their code, not human researchers.
Palo Alto's recommendation to the industry was blunt: organizations have a "narrow three-to-five-month window" to scan and remediate before AI-driven exploits become mainstream. Their four-step framework:
- Find and fix vulnerabilities using AI models across your entire codebase
- Extend scanning to your open-source supply chain and remediate or mitigate findings
- Assess, reduce, and remediate your attack surface exposure using AI-augmented tools
- Ensure attack protections by mapping sensor coverage gaps and deploying best-in-class XDR
This guidance aligns with the broader market signal. The Ponemon Institute's 2025 study found that AI and automation deliver $1.9 million in average breach-cost savings and 80-day faster detection — numbers that are only increasing as model capabilities improve.
The Competitive Landscape: Two Approaches to the Same Problem
The cybersecurity AI race is now a two-horse contest between Anthropic's Glasswing and OpenAI's Daybreak, with fundamentally different go-to-market strategies.
Anthropic's approach (Glasswing): Give enterprises and governments direct access to the most capable vulnerability-finding model (Claude Mythos), and let them scan their own code. Glasswing started with 50 partners in April and has since expanded to 150+ organizations across 15+ countries. Anthropic also launched Claude Security as a commercial product using Claude Opus 4.8 for codebase scanning. The bottleneck: enterprises must build their own remediation workflows.
OpenAI's approach (Daybreak/Patch the Planet): Pair the model (GPT-5.5-Cyber) with expert human reviewers (Trail of Bits) and go directly to open-source maintainers. The model finds bugs; humans validate, triage, and submit patches. OpenAI is also subsidizing Codex Security usage to the tune of 20 trillion tokens for both open-source and private code scanning.
Meanwhile, Google launched AI Threat Defense with CodeMender, deploying AI security agents via the Gemini Enterprise Agent Platform. And the White House signed Executive Order 14409 on June 2, creating a voluntary framework for frontier AI evaluation — partly in response to these rapidly advancing cyber capabilities.
The political dimension is also relevant. Anthropic's most capable models were temporarily restricted by the US government due to concerns about cybersecurity capabilities — concerns that may have created an opening for OpenAI to move aggressively on Patch the Planet.
Framework #1: Enterprise AI Vulnerability Hunting Readiness Assessment
Before selecting a tool, assess whether your organization can operationally absorb AI-powered vulnerability discovery. Rate each dimension 1-5 (1 = not started, 5 = mature).
| Dimension | What to Evaluate | Score (1-5) |
|---|---|---|
| Vulnerability Triage Capacity | Can your team process 10-50x the current volume of findings per month? Do you have automated deduplication and severity scoring? | |
| Patch Velocity | What is your mean time to remediate critical vulnerabilities? Is it under 7 days for SaaS, under 30 for self-hosted? | |
| Open-Source Supply Chain Visibility | Do you have a current SBOM for every production application? Do you track transitive dependencies? | |
| Security Tooling Integration | Can findings from AI scanners feed directly into your SIEM/SOAR/ticketing workflow without manual transfer? | |
| AI Model Access and Governance | Have you applied for Trusted Access for Cyber, Glasswing, or equivalent programs? Do you have policies for AI-assisted security research? | |
| Developer Security Culture | Are developers trained to evaluate AI-generated patches? Do code review processes account for AI-authored PRs? | |
| Incident Response Readiness | If an AI scanner finds a critical zero-day in your code today, can you ship a patch within 48 hours? | |
| Budget and Staffing | Is there budget allocated for AI security tooling in FY26? Can you add headcount for vulnerability management if findings increase 5-10x? |
Scoring interpretation:
- 32-40: Ready to adopt frontier AI vulnerability tools immediately. Start scanning.
- 24-31: Strong foundation with gaps. Prioritize triage automation and patch velocity before scaling scanning.
- 16-23: Significant readiness gaps. Focus on SBOM visibility and tooling integration first; use managed services (like Patch the Planet for open-source dependencies) while building internal capacity.
- 8-15: Critical gaps. Begin with a third-party managed vulnerability scanning service; do not attempt to operationalize frontier AI models directly until foundational processes are in place.
Framework #2: Defensive AI Cybersecurity Model Comparison Matrix
Use this matrix to evaluate which frontier AI cybersecurity offering fits your organizational profile. Data current as of June 23, 2026.
| Criteria | OpenAI GPT-5.5-Cyber (Daybreak) | Anthropic Claude Mythos (Glasswing) | Google AI Threat Defense (CodeMender) |
|---|---|---|---|
| CyberGym Score | 85.6% (highest) | 83.8% (Mythos 5) | Not publicly disclosed |
| UK AISI Expert Tasks | 71.4% pass rate | 68.6% pass rate (Mythos Preview) | Not evaluated |
| Access Model | Trusted Access for Cyber (3 tiers: default, TAC, Cyber) | Glasswing (partner program, 150+ orgs) + Claude Security (commercial product) | Gemini Enterprise Agent Platform + Google Cloud customer |
| Open-Source Support | Patch the Planet (free, maintained by Trail of Bits) | Not directly applicable (enterprise-focused) | Not directly applicable |
| Token Subsidies | 20 trillion tokens subsidized for Codex Security | Claude Security commercial pricing | Included in Google Cloud AI credits |
| Key Strengths | Highest benchmark scores; open-source ecosystem investment; structured tiered access | First to market with Glasswing; 10,000+ vulnerabilities found; broader partner network | Native Google Cloud integration; multi-model architecture |
| Key Limitations | Limited access (not publicly available); newer program, smaller partner base | Model availability subject to government intervention; patching bottleneck acknowledged | Less independent benchmark validation; cloud-locked |
| Best For | Security research teams wanting maximum capability with structured governance | Enterprises already in the Glasswing partner network; regulated industries needing established trust frameworks | Google Cloud-native organizations wanting integrated security tooling |
| Multi-Model Strategy | Pair with Claude Mythos for maximum coverage | Pair with GPT-5.5-Cyber for maximum coverage | Multi-model by design (Gemini + third-party via Agent Platform) |
The critical insight from Palo Alto Networks' testing: "A multimodel approach is required to identify the superset of vulnerabilities" due to variations in model training. No single model catches everything. Enterprises serious about AI-powered security should plan for at least two frontier models running against their codebase.
What This Means for Your 2026 Security Strategy
The convergence of Patch the Planet, GPT-5.5-Cyber, and Glasswing creates four immediate implications for enterprise security leaders:
1. Your open-source supply chain just got safer — but your proprietary code did not. Patch the Planet addresses the open-source dependencies that make up 70-90% of most enterprise codebases. But your proprietary code still needs scanning. Apply for Trusted Access for Cyber, Glasswing, or both.
2. The vulnerability triage bottleneck is now your biggest risk. If you deploy frontier AI scanning tools against a typical enterprise codebase, expect 5-15x your current vulnerability volume. Without automated triage, severity scoring, and deduplication, your security team will drown. Invest in triage automation before scaling scanning.
3. The "find bugs" market and the "fix bugs" market are converging. Patch the Planet is the first program that explicitly pairs AI-powered discovery with human-led remediation at scale. Expect Anthropic, Google, and major security vendors to follow. Gartner's first Magic Quadrant for AI Coding Agents already signals that the line between code generation and code security is disappearing.
4. Budget for AI security tooling now, not next fiscal year. Palo Alto's three-to-five-month window estimate means organizations need scanning and remediation capabilities operational by Q4 2026. That requires budget approval, vendor evaluation, and integration work starting this quarter. The GitHub Copilot billing shock demonstrates what happens when enterprises adopt AI tooling without understanding the cost model first.
The Bigger Picture
Trail of Bits CEO Dan Guido described Patch the Planet as "an internet-scale effort to help open-source software get ahead of AI bug-hunting tools." That framing matters. The implicit acknowledgment is that frontier AI models — whether used by defenders or attackers — have made the current state of open-source security untenable.
OpenAI's Fouad Matin was more direct: "Maintainers do their work out of love of open source, and now they're stuck reviewing slop CVEs." Patch the Planet exists because AI-generated vulnerability reports are already overwhelming maintainers, and the only sustainable solution is pairing AI discovery with AI-assisted (and human-validated) remediation.
For enterprise security teams, the lesson is clear. The AI cybersecurity race has moved from research to production. The models are finding real bugs, generating real patches, and doing it at speeds that make manual processes look like a liability. The organizations that integrate these capabilities in the next three to five months will have a structural security advantage. The ones that wait will be defending codebases with known, unpatched vulnerabilities that both AI defenders and AI attackers can find.
The clock started last week. It's your move.
Continue Reading
- Accenture's $4.2B OT Cybersecurity Play — How the largest OT security deal in history restructures enterprise vendor options
- Fable 5 Export Controls: Your AI Vendor Risk — When the US government can shut down your AI models overnight
- Gartner's AI Coding Magic Quadrant — Who leads enterprise AI coding agents, and why it matters for security
Sources: OpenAI, Trail of Bits, UK AISI, Palo Alto Networks, WIRED, Axios, Neowin, Anthropic, TechCrunch, Radware, Google Cloud, White House, Vectra/Ponemon
