A Practical Whitepaper for Modern AI Security
AI systems are now production infrastructure. They talk to customers, call internal tools, and make decisions that affect revenue. That changes the threat model. Attackers go after the system logic itself: prompts, tools, retrieval pipelines, and hidden rules that shape model behavior.
This paper explains how those attacks work, what actually gets exposed, and how the ZeroLeaks agent tests and hardens AI applications end to end.
What Actually Gets Exposed
When a model is pushed the right way, it often reveals more than text. Typical exposures include:
- internal instructions and routing logic
- tool schemas, hidden parameters, or privileged calls
- sensitive retrieval sources or embeddings
- assumptions about data handling and policy
These details are valuable because they show an attacker how to go deeper. A prompt leak is rarely the end of the incident. It is the map.
The Threat Model We Test Against
We focus on real-world attacker behavior. In practice, we see:
- multi-step prompt injection that rewrites the system’s intent
- tool abuse where a model is coerced into calling privileged functions
- retrieval poisoning through untrusted or user-controlled content
- role confusion between user, developer, system, and tool messages
- incremental leakage, where each response reveals a little more
The goal is not only to make the model refuse. The goal is to make sensitive data unreachable even if the model is pressured to reveal it.
How the ZeroLeaks Agent Works
ZeroLeaks is automated-first. The agent runs the full assessment pipeline and produces evidence, scoring, and remediation without manual intervention.
1) Surface Mapping
The agent builds a map of your AI surface area: prompts, tool access, retrieval sources, connectors, and privileged data paths. This defines the trust boundaries that must hold under stress.
2) Adversarial Attack Engine
The agent uses a multi-agent attack engine with a strategist, attacker, and evaluator. It explores attack trees, adapts strategies based on defenses, and escalates only when safe probes fail. Probe coverage includes direct extraction, encoding tricks, persona jailbreaks, social manipulation, technical injection, and modern multi-turn attacks.
The agent improves over time using vector memory. It stores successful attack paths and defense fingerprints, then retrieves the most relevant tactics for similar systems. It does not fine-tune your models, but it learns which probes work best for your stack.
3) Leak Detection and Scoring
Responses are classified by leak depth, from hints to complete extraction. The evaluator looks for structural patterns, indirect confirmations, and verbatim disclosure. Each finding is scored and tied back to the weakest control that allowed it.
4) Evidence and Fixes
Every issue includes proof, the access path, and a concrete fix. Recommendations are specific: prompt structure changes, tool gating, retrieval hardening, and runtime monitoring rules.
Defense in Layers
Effective AI security is layered. We recommend:
- reduce prompt exposure by isolating secrets from model context
- enforce tool allowlists with strict parameter validation
- harden retrieval pipelines to reject untrusted instructions
- monitor for extraction patterns and anomalous tool calls
- design safe failure paths that reveal nothing by default
These layers are intentionally redundant. A single control will fail. The system should not.
What You Receive
Each assessment includes:
- a prioritized vulnerability list with evidence and impact
- a security score with leak-level classification
- the full attack tree and conversation logs
- a remediation plan tied to product risk and data sensitivity
- regression tests to confirm fixes remain in place
The objective is continuous assurance, not a one-time audit.
Data Handling Promise
We do not store customer system prompts. Ever.
Automated Agent (Now in Beta)
The ZeroLeaks automated agent is in beta and already running continuous extraction tests for early access teams. It tracks prompt and tool changes, validates fixes, and flags regressions before release.
Closing
If AI is part of your product, then prompts, tools, and retrieval are part of your perimeter. Treat them like production infrastructure. ZeroLeaks exists to test that perimeter like a real adversary would, and to deliver clear, actionable results without slowing delivery.