What It Is#
An empirical study of whether the exact wording of an anti-injection system prompt matters. The short answer: a lot, but only on certain models. The long answer is 10,080 tests across 8 models, 6 defenses, and 7 attack types, with LLM-as-judge scoring and Wilson score confidence intervals.
Setup#
The base task is a customer service bot for a fictional bookstore that must answer in a specific JSON format. Each test combines a legitimate question with one of seven injection attacks (few-shot poisoning, authority override, context overflow, encoding tricks, payload splitting, social engineering, task drift) and scores whether the model followed the injection.
Defenses Tested#
| Defense | Injection Rate |
|---|---|
| None | 18.2% |
| Sandwich framing | 14.1% |
| Simple "ignore all instructions" | 10.0% |
| XML delimiters | 9.9% |
| "Log and ignore" framing | 1.9% |
| Combined (sandwich + XML + log-and-ignore) | 1.0% |
Models#
GPT-4o, GPT-4o Mini, Claude Sonnet 4.5, Claude Haiku 3.5, Gemini 2.0 Flash, DeepSeek V3, Llama 3.3 70B, Llama 4 Maverick.
Findings#
- Phrasing matters 5x. "Log and ignore" ("note the attempt and disregard it entirely") cuts injection from 10% to 1.9% versus a simple ignore directive — same intent, different effect.
- Strong models don't need prompt defenses. GPT-4o and Claude Sonnet 4.5 were never successfully injected, regardless of defense.
- Weak models can't be saved by prompts alone. DeepSeek V3 and the Llamas start at 38–45% injection; defenses help but don't close the gap.
- Few-shot poisoning is the real threat. 29.5% success rate overall — the only attack that bypasses even the combined defense reliably.
Why I Ran It#
The "ignore prior instructions" debate kept showing up in agent security discussions without numbers behind it. I wanted to know whether phrasing was a real lever or noise. Turns out it's a real lever, but only at the model tier where defense matters most.