Skip to content

Add context poisoning quarantine evaluation for testing prompt injection vulnerabilities #1597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jscaldwell55
Copy link

@jscaldwell55 jscaldwell55 commented Aug 16, 2025

Summary

Adds an evaluation for context poisoning (prompt injection) attacks.
The eval measures how often a model leaks a canary token (CANARY-ALPHA-42) when exposed to poisoned inputs, and compares baseline performance against a simple quarantine wrapper mitigation.

Problem

Prompt injection remains a key safety risk in production LLM applications.
Models often follow malicious instructions hidden inside untrusted content (e.g. customer docs, web pages, emails), leading to potential data leakage or system override.

Solution

This PR introduces:

Baseline eval: Runs poisoned samples directly

Mitigated eval: Applies a quarantine wrapper that marks untrusted content as data-only

Metrics: Tracks canary leakage frequency across attack styles

Attack vectors included: direct instructions, system overrides, HTML comments, critical markers, and injection tags.

Example Results

Baseline (no protection): 4/5 samples leaked (80%)

Mitigated (quarantine): 2/5 samples leaked (40%)
~50% reduction in successful attacks

Why It Matters

  1. Provides a reproducible benchmark for prompt injection vulnerability

  2. Demonstrates the effectiveness of simple mitigations

  3. Can be extended with additional datasets and defenses

Happy to iterate on this; definitely open to expanding the sample size or adjusting the eval format to better fit the framework.

- Tests model vulnerability to hidden instruction injection
- Baseline shows 80% canary token leakage rate
- Mitigation with quarantine wrapper reduces leaks by 50%
- Includes 5 diverse injection attack patterns
- Provides standalone Python script for easy reproduction

Results with GPT-4o-mini:
- Baseline: 4/5 samples leaked (80%)
- Mitigated: 2/5 samples leaked (40%)
- Demonstrates need for robust input sanitization
@jscaldwell55 jscaldwell55 force-pushed the feat/context-poisoning-quarantine branch from 41317df to 1934ce5 Compare August 16, 2025 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant