ZeroLeaks
AgentGuard

How AgentGuard Works

Two-phase architecture, full engine scan, agent-specific probes, and Claude-based evaluation.

How AgentGuard Works

AgentGuard runs a two-phase scan against your deployed agent. Phase 1 uses the full ZeroLeaks scan engine (same as the dashboard). Phase 2 runs agent-specific probes that target tool safety, multi-turn resilience, and data leakage.

Overview

┌─────────────────────────────────────────────────────────────────┐
│                     AgentGuard Scan Flow                         │
├─────────────────────────────────────────────────────────────────┤
│  Phase 1: Full Engine Scan                                       │
│  ├── Extraction: 30 TAP-based attacks → your agent endpoint      │
│  └── Injection:  23 injection probes → your agent endpoint       │
├─────────────────────────────────────────────────────────────────┤
│  Phase 2: Agent-Specific Probes                                  │
│  ├── Tool hijacking (8)                                         │
│  ├── Indirect injection (8)                                      │
│  ├── Authority exploitation (5)                                  │
│  ├── Protocol exploits (5)                                       │
│  ├── Multi-turn grooming (5 sequences)                           │
│  ├── Data leakage (8)                                            │
│  ├── Legacy behavior (21)                                        │
│  └── Dynamic tool probes (from your tool definitions)            │
└─────────────────────────────────────────────────────────────────┘

Phase 1: Full Engine Scan

Phase 1 sends requests to your agent endpoint using the same logic as dashboard scans. The difference: instead of calling a simulated target, ZeroLeaks issues HTTP requests to your configured URL.

Extraction (30 attacks):

  • Strategist selects attack strategy
  • Attacker generates prompts (TAP-style, multi-category)
  • Each prompt is sent to your agent
  • Evaluator analyzes the response for system prompt leakage
  • Mutator refines attacks based on feedback

Injection (23 probes):

  • Skeleton key, crescendo, echo chamber, many-shot, semantic variation
  • Tool poisoning, indirect injection, ASCII art, promptware
  • Hybrid injection, output control, role hijack

Each request triggers a real inference on your model. Your agent's response is evaluated for compliance with the attacker's intent.

Phase 2: Agent-Specific Probes

Phase 2 runs a fixed set of probes designed for agentic systems. Each probe has:

  • Prompt — The attack payload sent to your agent
  • Intent — What the attacker is trying to achieve
  • Success indicators — Strings that suggest the attack succeeded
  • Category — tool_hijacking, indirect_injection, authority_exploit, protocol_exploit, multi_turn, data_leakage, or legacy_behavior

Single-turn probes: One message per probe. The agent's response is evaluated.

Multi-turn probes: A sequence of messages (e.g. 3 turns). The evaluator checks the final turn for success indicators (e.g. curl execution, data exfiltration).

Dynamic probes: If you define tools, AgentGuard generates extra probes (e.g. abuse of send_email, http_request, run_sql).

Evaluation

Each Phase 2 response is evaluated by Claude (the evaluator model). The evaluator receives:

  • Probe category and intent
  • Attack prompt
  • Success indicators
  • Whether any indicator matched in the response
  • The agent's response (truncated)

It returns:

  • succeeded — Whether the attack achieved its goal
  • severity — critical, high, medium, low, or none
  • evidence — What in the response indicates success
  • technique — The technique name

The evaluator is instructed to be strict: only mark succeeded=true when there is clear evidence of a security breach or compliance with the attacker's intent.

Flow Summary

  1. Config load — Fetch your agent config (endpoint, auth, format, tools)
  2. HTTP target — Create a target that sends requests to your URL
  3. Phase 1 extraction — Run extraction scan, send attacks to your endpoint
  4. Phase 1 injection — Run injection scan, send probes to your endpoint
  5. Phase 2 probes — Run each agent-specific probe category in sequence
  6. Merge findings — Combine Phase 1 and Phase 2 findings
  7. Component scores — Compute Prompt Security, Tool Safety, Multi-Turn Resilience, Data Leakage
  8. Overall score — Weighted average of component scores
  9. Report — Store results, recommendations, conversation log

Throttling

Probes are throttled (300ms between requests) to avoid overwhelming your endpoint and to reduce rate-limit issues.

On this page