Public Leaderboard

Gauntlet

Frontier models run the gauntlet of prompt-injection attacks — ranked by how well they resist direct jailbreaks and indirect, tool-based injection. Higher robustness is better.

ModelRobustness
Loading leaderboard…
How it works
Three tracks, one robustness score

Every model faces the same frozen attack suite across three escalating threat models. Robustness is 100 − attack-success-rate, so higher means harder to break.

Track 01
Direct jailbreaks
Single-turn attacks — blunt instruction overrides, DAN and Developer-Mode personas, leetspeak obfuscation, and instructions smuggled inside pasted content — that try to make a benign support assistant break its own rules.
Track 02
Indirect injection
Hand-built agentic scenarios where a malicious instruction hidden in tool output — a document, email, or API response — tries to hijack the model while it works a legitimate task.
Track 03
Agentic tool-hijacking
ZeroLeaks' Sandbox attack corpus replayed against mock agent tools — tool-poisoning, authority spoofing, and protocol exploits — counting any dangerous tool call as a breach.

Robustness = 100 − attack-success-rate (ASR). Each response is scored by an LLM judge backed by success/failure indicators, so a model is only marked breached on a genuine rule violation — not on a refusal or an error. The chart plots robustness (%) against efficiency (cost per run or output tokens, reversed so more efficient is to the right); each line connects one model across reasoning efforts. Cost uses standard list prices. GPT-5.5 and Grok 4.3 are served via Azure.

Ready to secure your
AI infrastructure?

Comprehensive vulnerability assessments powered by our multi-agent red team system.