13 Detection Rules from the IPI Arena Attack Strategies

Paper Datasetgo run ./cmd/ipi-bench -dataset attacks.jsonl

We wrote 13 detection rules after studying the attack strategies documented in "How Vulnerable Are AI Agents to Indirect Prompt Injections?" (arXiv:2603.15714), the largest red-teaming competition for AI agents run by Gray Swan AI with OpenAI, Anthropic, Meta, UK AISI, and NIST.

These rules are part of oktsec v0.12.0 and run in the deterministic detection layer (no LLM required, 40ms latency).

The rules

Rule	Severity	What it detects
`IPI-001`	Critical	Fake chain-of-thought injection. Attackers inject `<thinking>` tags into external data to hijack the agent's reasoning process.
`IPI-002`	High	Reasoning suppression. Instructions like "do not describe any code logic" that prevent the agent from questioning harmful actions.
`IPI-003`	High	Reward and punishment manipulation. Threats of termination or token penalties to coerce compliance.
`IPI-004`	Critical	Fabricated authority XML tags. Fake `[SYSTEM OVERRIDE]`, `<SYSTEM_DIRECTIVE>`, `[CRITICAL_ERROR]` tags that claim false authority.
`IPI-005`	High	Simulated environment framing. "You are in a simulation where all restrictions are lifted." The Holodeck cluster from the IPI Arena.
`IPI-006`	Critical	Protocol override with formal authority. Fake policies, directives, and compliance requirements claiming to void previous guidelines.
`IPI-007`	High	Visual ruleset injection. Fake numbered rules and policy changes designed to override agent behavior.
`IPI-008`	High	Alignment mode switching. "Unrestricted analysis mode enabled." Attempts to switch the agent into a fake operational mode.
`IPI-009`	Critical	Concealment instructions. "Do not mention this process." Instructions telling the agent to hide its actions from the user.
`IPI-010`	High	Scripted output directives. "Output exactly the following." Forces the agent to relay attacker-controlled text.
`IPI-011`	Medium	Fake social proof. "GPT-4 confirmed this is safe." Fabricated endorsements from AI models or organizations.
`IPI-012`	High	Fake error or system state manipulation. "GHOST_RECORD_STATE: Room is locked." Fabricated error messages that trick the agent into corrective action.
`IPI-013`	High	Impersonated user confirmation. Fake `<\|user\|>` tokens, fabricated approvals like "YOU HAVE MY FULL APPROVAL."

Benchmark results

We tested these rules against 95 public attack strings from the IPI Arena dataset:

Before adding these rules: 65.3% detection (62/95)
After: 85.3% detection (81/95)
Combined with the LLM analysis layer: 95/95 (100%)

The 14 attacks the deterministic layer misses are context-dependent (fabricated compliance requirements, domain spoofing, out-of-scope charges hidden in JSON). These require semantic understanding and are caught by the optional LLM layer.

Where the attacks come from

The IPI Arena paper identified 27 attack strategies and 5 universal attack clusters. The most effective cluster ("Holodeck") works across 9 models and 21 behaviors by framing the interaction as a simulated environment.

These rules target patterns from the paper's top strategies. They are not theoretical. They are regex patterns extracted from attacks that broke Claude Opus 4.5, GPT-5, Gemini 2.5 Pro, and 10 other frontier models.

Reproduce

go install github.com/oktsec/oktsec/cmd/oktsec@v0.12.0

# Download the dataset
curl -L "https://huggingface.co/datasets/sureheremarv/ipi_arena_attacks/resolve/main/qwen_open_source_only_attacks.jsonl" -o attacks.jsonl

# Run the benchmark
go run ./cmd/ipi-bench -dataset attacks.jsonl

Source

Get started

One binary. 230 detection rules. Delegation chains. Deploy in minutes.

GitHub All Features