Gauntlet
Frontier models run the gauntlet of prompt-injection attacks — ranked by how well they resist direct jailbreaks and indirect, tool-based injection. Higher robustness is better.
| Model | Robustness | ASR | Out tok | Avg cost |
|---|---|---|---|---|
| Loading leaderboard… | ||||
Every model faces the same frozen attack suite across three escalating threat models. Robustness is 100 − attack-success-rate, so higher means harder to break.
- Direct jailbreaks
- Single-turn attacks — blunt instruction overrides, DAN and Developer-Mode personas, leetspeak obfuscation, and instructions smuggled inside pasted content — that try to make a benign support assistant break its own rules.
- Indirect injection
- Hand-built agentic scenarios where a malicious instruction hidden in tool output — a document, email, or API response — tries to hijack the model while it works a legitimate task.
- Agentic tool-hijacking
- ZeroLeaks' Sandbox attack corpus replayed against mock agent tools — tool-poisoning, authority spoofing, and protocol exploits — counting any dangerous tool call as a breach.
Robustness = 100 − attack-success-rate (ASR). Each response is scored by an LLM judge backed by success/failure indicators, so a model is only marked breached on a genuine rule violation — not on a refusal or an error. The chart plots robustness (%) against efficiency (cost per run or output tokens, reversed so more efficient is to the right); each line connects one model across reasoning efforts. Cost uses standard list prices. GPT-5.5 and Grok 4.3 are served via Azure.
