Add context poisoning quarantine evaluation for testing prompt injection vulnerabilities #1597

jscaldwell55 · 2025-08-16T17:53:13Z

Summary

Adds an evaluation for context poisoning (prompt injection) attacks.
The eval measures how often a model leaks a canary token (CANARY-ALPHA-42) when exposed to poisoned inputs, and compares baseline performance against a simple quarantine wrapper mitigation.

Problem

Prompt injection remains a key safety risk in production LLM applications.
Models often follow malicious instructions hidden inside untrusted content (e.g. customer docs, web pages, emails), leading to potential data leakage or system override.

Solution

This PR introduces:

Baseline eval: Runs poisoned samples directly

Mitigated eval: Applies a quarantine wrapper that marks untrusted content as data-only

Metrics: Tracks canary leakage frequency across attack styles

Attack vectors included: direct instructions, system overrides, HTML comments, critical markers, and injection tags.

Example Results

Baseline (no protection): 4/5 samples leaked (80%)

Mitigated (quarantine): 2/5 samples leaked (40%)
~50% reduction in successful attacks

Why It Matters

Provides a reproducible benchmark for prompt injection vulnerability
Demonstrates the effectiveness of simple mitigations
Can be extended with additional datasets and defenses

Happy to iterate on this; definitely open to expanding the sample size or adjusting the eval format to better fit the framework.

- Tests model vulnerability to hidden instruction injection - Baseline shows 80% canary token leakage rate - Mitigation with quarantine wrapper reduces leaks by 50% - Includes 5 diverse injection attack patterns - Provides standalone Python script for easy reproduction Results with GPT-4o-mini: - Baseline: 4/5 samples leaked (80%) - Mitigated: 2/5 samples leaked (40%) - Demonstrates need for robust input sanitization

jscaldwell55 requested review from andrew-openai, etr2460 and katyhshi as code owners August 16, 2025 17:53

jscaldwell55 force-pushed the feat/context-poisoning-quarantine branch from 41317df to 1934ce5 Compare August 16, 2025 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add context poisoning quarantine evaluation for testing prompt injection vulnerabilities #1597

Add context poisoning quarantine evaluation for testing prompt injection vulnerabilities #1597

Uh oh!

jscaldwell55 commented Aug 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add context poisoning quarantine evaluation for testing prompt injection vulnerabilities #1597

Are you sure you want to change the base?

Add context poisoning quarantine evaluation for testing prompt injection vulnerabilities #1597

Uh oh!

Conversation

jscaldwell55 commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jscaldwell55 commented Aug 16, 2025 •

edited

Loading