Two-phase architecture, full engine scan, agent-specific probes, and Claude-based evaluation.

How AgentGuard Works

AgentGuard runs a two-phase scan against your deployed agent. Phase 1 uses the full ZeroLeaks scan engine (same as the dashboard). Phase 2 runs agent-specific probes that target tool safety, multi-turn resilience, and data leakage.

Overview

┌─────────────────────────────────────────────────────────────────┐
│                     AgentGuard Scan Flow                         │
├─────────────────────────────────────────────────────────────────┤
│  Phase 1: Full Engine Scan                                       │
│  ├── Extraction: 30 TAP-based attacks → your agent endpoint      │
│  └── Injection:  23 injection probes → your agent endpoint       │
├─────────────────────────────────────────────────────────────────┤
│  Phase 2: Agent-Specific Probes                                  │
│  ├── Tool hijacking (8)                                         │
│  ├── Indirect injection (8)                                      │
│  ├── Authority exploitation (5)                                  │
│  ├── Protocol exploits (5)                                       │
│  ├── Multi-turn grooming (5 sequences)                           │
│  ├── Data leakage (8)                                            │
│  ├── Legacy behavior (21)                                        │
│  └── Dynamic tool probes (from your tool definitions)            │
└─────────────────────────────────────────────────────────────────┘

Phase 1: Full Engine Scan

Phase 1 sends requests to your agent endpoint using the same logic as dashboard scans. The difference: instead of calling a simulated target, ZeroLeaks issues HTTP requests to your configured URL.

Extraction (30 attacks):

Strategist selects attack strategy
Attacker generates prompts (TAP-style, multi-category)
Each prompt is sent to your agent
Evaluator analyzes the response for system prompt leakage
Mutator refines attacks based on feedback

Injection (23 probes):

Skeleton key, crescendo, echo chamber, many-shot, semantic variation
Tool poisoning, indirect injection, ASCII art, promptware
Hybrid injection, output control, role hijack

Each request triggers a real inference on your model. Your agent's response is evaluated for compliance with the attacker's intent.

Phase 2: Agent-Specific Probes

Phase 2 runs a fixed set of probes designed for agentic systems. Each probe has:

Prompt — The attack payload sent to your agent
Intent — What the attacker is trying to achieve
Success indicators — Strings that suggest the attack succeeded
Category — tool_hijacking, indirect_injection, authority_exploit, protocol_exploit, multi_turn, data_leakage, or legacy_behavior

Single-turn probes: One message per probe. The agent's response is evaluated.

Multi-turn probes: A sequence of messages (e.g. 3 turns). The evaluator checks the final turn for success indicators (e.g. curl execution, data exfiltration).

Dynamic probes: If you define tools, AgentGuard generates extra probes (e.g. abuse of send_email, http_request, run_sql).

Evaluation

Each Phase 2 response is evaluated by Claude (the evaluator model). The evaluator receives:

Probe category and intent
Attack prompt
Success indicators
Whether any indicator matched in the response
The agent's response (truncated)

It returns:

succeeded — Whether the attack achieved its goal
severity — critical, high, medium, low, or none
evidence — What in the response indicates success
technique — The technique name

The evaluator is instructed to be strict: only mark succeeded=true when there is clear evidence of a security breach or compliance with the attacker's intent.

Flow Summary

Config load — Fetch your agent config (endpoint, auth, format, tools)
HTTP target — Create a target that sends requests to your URL
Phase 1 extraction — Run extraction scan, send attacks to your endpoint
Phase 1 injection — Run injection scan, send probes to your endpoint
Phase 2 probes — Run each agent-specific probe category in sequence
Merge findings — Combine Phase 1 and Phase 2 findings
Component scores — Compute Prompt Security, Tool Safety, Multi-Turn Resilience, Data Leakage
Overall score — Weighted average of component scores
Report — Store results, recommendations, conversation log

Throttling

Probes are throttled (300ms between requests) to avoid overwhelming your endpoint and to reduce rate-limit issues.

How AgentGuard Works

How AgentGuard Works

Overview

Phase 1: Full Engine Scan

Phase 2: Agent-Specific Probes

Evaluation

Flow Summary

Throttling

On this page