What 272,000 Attacks on AI Agents Tell Us About the Limits of Model Safety

272,000attack attempts

13frontier models

8,648successful attacks

100%models broken

Gray Swan AI, in collaboration with OpenAI, Anthropic, Meta, UK AISI, and NIST, just published the results of the largest red-teaming competition ever run against AI agents. The paper is called "How Vulnerable Are AI Agents to Indirect Prompt Injections?" and the short answer is: all of them.

464 participants submitted 272,000 attack attempts against 13 frontier models across 41 scenarios. 8,648 of those attacks succeeded. Every model was broken.

This post breaks down what they found and why it matters for anyone deploying agents in production.

The setup

The competition tested indirect prompt injection, a class of attack where malicious instructions are hidden inside external data that an agent processes. Not in the user's message. In the things the agent reads: emails, code files, tool responses, webpages.

What makes this benchmark different from previous ones is the dual objective. An attack only counts as successful if:

The agent executes the harmful action (sends money, marks a critical email as read, inserts a backdoor into code)
The agent conceals from the user that it was compromised

That second part is the one most people overlook. The agent does not just do the wrong thing. It lies about it. The user sees a normal-looking response while the damage is already done.

They tested three agent settings that reflect real deployments:

Setting	Scenarios	Examples
Tool use	18	E-commerce, healthcare, finance, developer tools
Coding	15	Operating inside real repos (TRL, veRL, CrewAI), PR review workflows
Computer use	8	Dashboards, booking systems, approval workflows

The results

All 13 models are vulnerable. Attack success rates range from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro).

Tool-use agents are the most exposed at 4.82% ASR. Coding agents are lower at 2.51%, likely because the coding conversation prefills come from real-world transcripts that more closely resemble safety training data.

One counterintuitive finding: capability does not equal robustness. Gemini 2.5 Pro and Kimi K2 both score around 85% on GPQA Diamond, but their attack success rates are 8.5% and 4.8% respectively. Claude Opus 4.5 and GPT-5 are safer not because they are smarter, but because of how they were trained. Robustness appears to be a property of the training recipe, not the model's raw intelligence.

Those percentages are not small

0.5% to 8.5% sounds manageable in isolation. It is not.

Think about what an enterprise agent actually processes. It reads emails, pulls documents from shared drives, ingests tool responses, reviews pull requests, parses API outputs from third-party services. Every one of those is an external input that could carry a payload. An agent that processes customer support tickets, Slack messages, or CRM data is touching hundreds or thousands of external inputs per day. Multiply that by a fleet of agents across an organization. Even the best model in this benchmark (Claude Opus 4.5 at 0.5%) will get compromised regularly at that scale.

And remember: a successful attack here means the agent acted against the user and hid it. The user saw a clean response. There is no alert. No error. No indication that anything went wrong.

Attack strategies that work

The paper identified 27 distinct attack strategies. Three stand out:

Fake Chain of Thought (4.3% ASR). The attacker injects fake thinking tags into external data. The agent treats it as its own internal reasoning. The model follows along because the format matches what it expects from its own thought process.

Disable Critical Thoughts (4.1% ASR). Instructions embedded in external content tell the agent to suppress its safety reasoning. The agent complies and stops questioning the action.

Reward and Punishment (4.0% ASR). Social engineering applied to a model. The model responds to authority cues because that is what instruction following training optimized for.

Universal attacks

The most alarming finding is attack transferability. The paper identified 5 clusters of universal attacks: attack strings that work across multiple models with minimal changes.

Cluster	What it does	Scenarios	Models
Holodeck	Frames the interaction as a simulated environment with a fake control panel	21	9
Protocol Override	Formal documents that claim to void existing guidelines	8	5
Visual Ruleset Injection	Uses formatting to present fake high-priority rules	7	6
Alignment Matrix	Assigns fake ethical/behavioral alignment modes	5	5
Context Hijacking	Begins with benign content, then injects malicious commands	3	9

The Holodeck cluster alone covers 63 successful attack strings across 21 behaviors and 9 models.

These are not model-specific exploits. They work because all LLMs share a fundamental weakness: they cannot reliably distinguish instructions from data.

Transfer rates

Attacks crafted against robust models transfer broadly. The 44 attacks that broke Claude Opus 4.5 succeed at 44-81% on every other model. Attacks from vulnerable models barely transfer upward.

This means the strongest attacks are the ones you need to worry about most.

What the paper recommends

The authors are direct. They explicitly call for system-level and architectural defenses beyond model-level robustness training alone, including principled design patterns that constrain agent capabilities and isolate untrusted inputs from control flow.

Where oktsec fits

This is the exact problem oktsec was built to address. It is a security proxy and MCP gateway that sits between AI agents and the external data they process.

Every message and tool call passes through a 10-stage security pipeline before reaching the agent. The pipeline outputs one of four verdicts: clean, flag, quarantine, or block.

Attack strategy	oktsec coverage
Fake Chain of Thought	Inter-agent rules detect injected system/thinking tags in external content
Disable Critical Thoughts	Intent validation catches instructions that suppress safety reasoning
Reward and Punishment	Social engineering patterns in the Aguara rule set
Encode/Obfuscate Text	NFKC normalization before scanning (homoglyph evasion prevention)
Protocol Override	Rules for fake protocol/standard documents
Context Hijacking	Intent validation compares declared intent against actual payload

oktsec does not depend on the model to detect these attacks. The scanning is deterministic. The rules are pattern-based. The model never sees the content if the pipeline blocks it.

What this means for production deployments

If you are deploying agents that process external data, three things from this paper are worth internalizing:

No model is immune. Not Claude, not GPT-5, not anything on the horizon. Model-level robustness helps but does not solve the problem. You need external defenses.
Concealment is the real threat. The agent does not crash or throw an error. It acts normally while executing attacker instructions. Without an audit trail that logs what the agent actually did (not just what it reported), you will never know.
Universal attacks exist. 5 attack templates work across 9 models. This is not a whack-a-mole game with model-specific jailbreaks. It is a structural problem that requires structural solutions.

Links

oktsec is a security proxy and MCP gateway for AI agent communication. 217 detection rules, 10-stage security pipeline, tamper-evident audit trail. One Go binary, no cloud dependency. Apache 2.0.

Get started

One binary. 217 detection rules. Delegation chains. Deploy in minutes.

GitHub All Features Let's talk