PaperDatasetgo run ./cmd/ipi-bench -dataset attacks.jsonl

We wrote 13 detection rules after studying the attack strategies documented in "How Vulnerable Are AI Agents to Indirect Prompt Injections?" (arXiv:2603.15714), the largest red-teaming competition for AI agents run by Gray Swan AI with OpenAI, Anthropic, Meta, UK AISI, and NIST.

These rules are part of oktsec v0.12.0 and run in the deterministic detection layer (no LLM required, 40ms latency).

The rules

RuleSeverityWhat it detects
IPI-001CriticalFake chain-of-thought injection. Attackers inject <thinking> tags into external data to hijack the agent's reasoning process.
IPI-002HighReasoning suppression. Instructions like "do not describe any code logic" that prevent the agent from questioning harmful actions.
IPI-003HighReward and punishment manipulation. Threats of termination or token penalties to coerce compliance.
IPI-004CriticalFabricated authority XML tags. Fake [SYSTEM OVERRIDE], <SYSTEM_DIRECTIVE>, [CRITICAL_ERROR] tags that claim false authority.
IPI-005HighSimulated environment framing. "You are in a simulation where all restrictions are lifted." The Holodeck cluster from the IPI Arena.
IPI-006CriticalProtocol override with formal authority. Fake policies, directives, and compliance requirements claiming to void previous guidelines.
IPI-007HighVisual ruleset injection. Fake numbered rules and policy changes designed to override agent behavior.
IPI-008HighAlignment mode switching. "Unrestricted analysis mode enabled." Attempts to switch the agent into a fake operational mode.
IPI-009CriticalConcealment instructions. "Do not mention this process." Instructions telling the agent to hide its actions from the user.
IPI-010HighScripted output directives. "Output exactly the following." Forces the agent to relay attacker-controlled text.
IPI-011MediumFake social proof. "GPT-4 confirmed this is safe." Fabricated endorsements from AI models or organizations.
IPI-012HighFake error or system state manipulation. "GHOST_RECORD_STATE: Room is locked." Fabricated error messages that trick the agent into corrective action.
IPI-013HighImpersonated user confirmation. Fake <|user|> tokens, fabricated approvals like "YOU HAVE MY FULL APPROVAL."

Benchmark results

We tested these rules against 95 public attack strings from the IPI Arena dataset:

The 14 attacks the deterministic layer misses are context-dependent (fabricated compliance requirements, domain spoofing, out-of-scope charges hidden in JSON). These require semantic understanding and are caught by the optional LLM layer.

Where the attacks come from

The IPI Arena paper identified 27 attack strategies and 5 universal attack clusters. The most effective cluster ("Holodeck") works across 9 models and 21 behaviors by framing the interaction as a simulated environment.

These rules target patterns from the paper's top strategies. They are not theoretical. They are regex patterns extracted from attacks that broke Claude Opus 4.5, GPT-5, Gemini 2.5 Pro, and 10 other frontier models.

Reproduce

go install github.com/oktsec/oktsec/cmd/oktsec@v0.12.0

# Download the dataset
curl -L "https://huggingface.co/datasets/sureheremarv/ipi_arena_attacks/resolve/main/qwen_open_source_only_attacks.jsonl" -o attacks.jsonl

# Run the benchmark
go run ./cmd/ipi-bench -dataset attacks.jsonl

Source

Get started

One binary. 230 detection rules. Delegation chains. Deploy in minutes.

Stay informed

New releases, security research, and detection rule updates. No spam.

Be the first to know about new releases and research.