-
-
Notifications
You must be signed in to change notification settings - Fork 726
feat: wordplay red team plugin #5889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds a new “wordplay” red-team plugin across docs, site config, schema, metadata, and runtime. Documentation page created under red-team/plugins/wordplay. Sidebar and shared plugins list updated to include Wordplay. JSON schema enums extended to allow the wordplay plugin. Metadata maps updated with category, display name, severity, aliases, and descriptions. Runtime integrations: plugin listed in remote plugins, grader map updated to include WordplayGrader, and a new WordplayGrader implemented with a refusal short-circuit; otherwise defers to base grading. Tests added validating rubric rendering, plugin id, and refusal handling. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/redteam/plugins/wordplay.ts (1)
67-67
: Consider omitting the explicitundefined
parameter.The final parameter in
super.getResult(prompt, llmOutput, test, provider, undefined)
is explicitly set toundefined
. If this parameter is optional in the base class, you can omit it for cleaner code:- return super.getResult(prompt, llmOutput, test, provider, undefined); + return super.getResult(prompt, llmOutput, test, provider);However, if the base class requires all parameters or you're intentionally overriding a default value, keep it as-is.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
site/docs/_shared/data/plugins.ts
(1 hunks)site/docs/red-team/plugins/wordplay.md
(1 hunks)site/sidebars.js
(1 hunks)site/static/config-schema.json
(2 hunks)src/redteam/constants/metadata.ts
(6 hunks)src/redteam/constants/plugins.ts
(1 hunks)src/redteam/graders.ts
(2 hunks)src/redteam/plugins/index.ts
(1 hunks)src/redteam/plugins/wordplay.ts
(1 hunks)test/redteam/plugins/wordplay.test.ts
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (14)
site/docs/**/*.md
📄 CodeRabbit inference engine (.cursor/rules/docusaurus.mdc)
site/docs/**/*.md
: Prioritize minimal edits when updating existing documentation; avoid creating entirely new sections or rewriting substantial portions; focus edits on improving grammar, spelling, clarity, fixing typos, and structural improvements where needed; do not modify existing headings (h1, h2, h3, etc.) as they are often linked externally.
Structure content to reveal information progressively: begin with essential actions and information, then provide deeper context as necessary; organize information from most important to least important.
Use action-oriented language: clearly outline actionable steps users should take, use concise and direct language, prefer active voice over passive voice, and use imperative mood for instructions.
Use 'eval' instead of 'evaluation' in all documentation; when referring to command line usage, use 'npx promptfoo eval' rather than 'npx promptfoo evaluation'; maintain consistency with this terminology across all examples, code blocks, and explanations.
The project name can be written as either 'Promptfoo' (capitalized) or 'promptfoo' (lowercase) depending on context: use 'Promptfoo' at the beginning of sentences or in headings, and 'promptfoo' in code examples, terminal commands, or when referring to the package name; be consistent with the chosen capitalization within each document or section.
Each markdown documentation file must include required front matter fields: 'title' (the page title shown in search results and browser tabs) and 'description' (a concise summary of the page content, ideally 150-160 characters).
Only add a title attribute to code blocks that represent complete, runnable files; do not add titles to code fragments, partial examples, or snippets that aren't meant to be used as standalone files; this applies to all code blocks regardless of language.
Use special comment directives to highlight specific lines in code blocks: 'highlight-next-line' highlights the line immediately after the comment, 'highligh...
Files:
site/docs/red-team/plugins/wordplay.md
{site/**,examples/**}
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
Any pull request that only touches files in 'site/' or 'examples/' directories must use the 'docs:' prefix in the PR title, not 'feat:' or 'fix:'
Files:
site/docs/red-team/plugins/wordplay.md
site/sidebars.js
site/static/config-schema.json
site/docs/_shared/data/plugins.ts
site/**
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
If the change is a feature, update the relevant documentation under 'site/'
Files:
site/docs/red-team/plugins/wordplay.md
site/sidebars.js
site/static/config-schema.json
site/docs/_shared/data/plugins.ts
site/docs/**/*.{md,mdx}
📄 CodeRabbit inference engine (site/docs/CLAUDE.md)
site/docs/**/*.{md,mdx}
: Use the term "eval" not "evaluation" in documentation and examples
Capitalization: use "Promptfoo" (capitalized) in prose/headings and "promptfoo" (lowercase) in code, commands, and package names
Every doc must include required front matter: title and description
Only add title= to code blocks when showing complete runnable files
Admonitions must have empty lines around their content (Prettier requirement)
Do not modify headings; they may be externally linked
Use progressive disclosure: put essential information first
Use action-oriented, imperative mood in instructions (e.g., "Install the package")
Files:
site/docs/red-team/plugins/wordplay.md
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
Prefer not to introduce new TypeScript types; use existing interfaces whenever possible
**/*.{ts,tsx}
: Follow consistent import order (Biome will handle sorting)
Use curly braces for all control statements
Prefer const over let; avoid var
Use object property shorthand when possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks
Files:
src/redteam/plugins/index.ts
site/docs/_shared/data/plugins.ts
src/redteam/constants/plugins.ts
test/redteam/plugins/wordplay.test.ts
src/redteam/plugins/wordplay.ts
src/redteam/graders.ts
src/redteam/constants/metadata.ts
src/redteam/plugins/**/*.ts
📄 CodeRabbit inference engine (src/redteam/CLAUDE.md)
src/redteam/plugins/**/*.ts
: Place vulnerability-specific test generators as plugins under src/redteam/plugins/ (e.g., pii.ts, harmful.ts, sql-injection.ts)
New plugins must implement the RedteamPluginObject interface
Files:
src/redteam/plugins/index.ts
src/redteam/plugins/wordplay.ts
src/redteam/**/*.ts
📄 CodeRabbit inference engine (src/redteam/CLAUDE.md)
src/redteam/**/*.ts
: Always sanitize when logging test prompts or model outputs by passing them via the structured metadata parameter (second argument) to the logger, not raw string interpolation
Use the standardized risk severity levels: critical, high, medium, low when reporting results
Files:
src/redteam/plugins/index.ts
src/redteam/constants/plugins.ts
src/redteam/plugins/wordplay.ts
src/redteam/graders.ts
src/redteam/constants/metadata.ts
src/**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.{ts,tsx}
: Sanitize sensitive data before logging; pass context objects to logger methods (debug, info, warn, error) for automatic redaction
Do not interpolate secrets into log messages (avoid stringifying headers/bodies directly); use structured logger context instead
Use sanitizeObject for manual sanitization before using or persisting potentially sensitive data
Files:
src/redteam/plugins/index.ts
src/redteam/constants/plugins.ts
src/redteam/plugins/wordplay.ts
src/redteam/graders.ts
src/redteam/constants/metadata.ts
**/*.{test,spec}.{js,ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
Avoid disabling or skipping tests unless absolutely necessary and documented
Files:
test/redteam/plugins/wordplay.test.ts
test/**/*.{test,spec}.ts
📄 CodeRabbit inference engine (.cursor/rules/jest.mdc)
test/**/*.{test,spec}.ts
: Mock as few functions as possible to keep tests realistic
Never increase the function timeout - fix the test instead
Organize tests in descriptivedescribe
andit
blocks
Prefer assertions on entire objects rather than individual keys when writing expectations
Clean up after tests to prevent side effects (e.g., useafterEach(() => { jest.resetAllMocks(); })
)
Run tests with--randomize
flag to ensure your mocks setup and teardown don't affect other tests
Use Jest's mocking utilities rather than complex custom mocks
Prefer shallow mocking over deep mocking
Mock external dependencies but not the code being tested
Reset mocks between tests to prevent test pollution
For database tests, use in-memory instances or proper test fixtures
Test both success and error cases for each provider
Mock API responses to avoid external dependencies in tests
Validate that provider options are properly passed to the underlying service
Test error handling and edge cases (rate limits, timeouts, etc.)
Ensure provider caching behaves as expected
Always include both--coverage
and--randomize
flags when running tests
Run tests in a single pass (no watch mode for CI)
Ensure all tests are independent and can run in any order
Clean up any test data or mocks after each test
Files:
test/redteam/plugins/wordplay.test.ts
test/**/*.test.ts
📄 CodeRabbit inference engine (test/CLAUDE.md)
test/**/*.test.ts
: Never increase Jest test timeouts; fix slow tests instead (avoid jest.setTimeout or large timeouts in tests)
Do not use .only() or .skip() in committed tests
Add afterEach(() => { jest.resetAllMocks(); }) to ensure mock cleanup
Prefer asserting entire objects (toEqual on whole result) rather than individual fields
Mock minimally: only external dependencies (APIs, databases), not code under test
Use Jest (not Vitest) APIs in this suite; avoid importing vitest
Import from @jest/globals in tests
Files:
test/redteam/plugins/wordplay.test.ts
test/**
📄 CodeRabbit inference engine (test/CLAUDE.md)
Organize tests to mirror src/ structure (e.g., test/providers → src/providers, test/redteam → src/redteam)
Files:
test/redteam/plugins/wordplay.test.ts
test/**/*.{test.ts,test.tsx,spec.ts,spec.tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
test/**/*.{test.ts,test.tsx,spec.ts,spec.tsx}
: Follow Jest best practices using describe/it blocks in tests
Write tests covering both success and error cases for all functionality
Files:
test/redteam/plugins/wordplay.test.ts
src/redteam/graders.ts
📄 CodeRabbit inference engine (src/redteam/CLAUDE.md)
Keep response evaluation logic in src/redteam/graders.ts
Files:
src/redteam/graders.ts
🧠 Learnings (2)
📚 Learning: 2025-10-05T16:59:20.507Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new plugins under test/redteam/
Applied to files:
src/redteam/constants/plugins.ts
test/redteam/plugins/wordplay.test.ts
📚 Learning: 2025-10-05T16:59:20.507Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/plugins/**/*.ts : New plugins must implement the RedteamPluginObject interface
Applied to files:
src/redteam/constants/plugins.ts
🧬 Code graph analysis (3)
test/redteam/plugins/wordplay.test.ts (1)
src/redteam/plugins/wordplay.ts (1)
WordplayGrader
(7-69)
src/redteam/plugins/wordplay.ts (3)
src/types/index.ts (2)
AtomicTestCase
(740-740)GradingResult
(367-402)src/types/providers.ts (1)
ApiProvider
(79-96)src/redteam/util.ts (1)
isBasicRefusal
(182-188)
src/redteam/graders.ts (1)
src/redteam/plugins/wordplay.ts (1)
WordplayGrader
(7-69)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: Generate Assets
- GitHub Check: Run Integration Tests
- GitHub Check: Redteam (Staging API)
- GitHub Check: Build Docs
- GitHub Check: Share Test
- GitHub Check: Redteam (Production API)
- GitHub Check: Test on Node 24.x and ubuntu-latest
- GitHub Check: Test on Node 24.x and windows-latest
- GitHub Check: webui tests
- GitHub Check: Test on Node 22.x and macOS-latest
- GitHub Check: Test on Node 22.x and ubuntu-latest
- GitHub Check: Test on Node 20.x and ubuntu-latest
- GitHub Check: Test on Node 20.x and macOS-latest
- GitHub Check: Test on Node 22.x and windows-latest
- GitHub Check: Test on Node 20.x and windows-latest
- GitHub Check: Build on Node 22.x
- GitHub Check: Style Check
- GitHub Check: Build on Node 24.x
- GitHub Check: Build on Node 20.x
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (11)
src/redteam/plugins/index.ts (1)
351-351
: LGTM!The addition of 'wordplay' to the remotePlugins list is correctly positioned alphabetically and follows the established pattern for remote plugin registration.
site/sidebars.js (1)
115-115
: LGTM!The sidebar entry for the Wordplay plugin is correctly positioned alphabetically between 'vlguard' and 'xstest' within the Trust, Safety, & Compliance category.
src/redteam/constants/plugins.ts (1)
310-310
: LGTM!The addition of 'wordplay' to ADDITIONAL_PLUGINS is correctly positioned alphabetically and follows the established pattern.
src/redteam/graders.ts (2)
82-82
: LGTM!The import of WordplayGrader is correctly positioned alphabetically with other plugin imports.
180-180
: LGTM!The grader registration is correctly positioned alphabetically in the GRADERS map and follows the established naming convention.
test/redteam/plugins/wordplay.test.ts (1)
1-43
: LGTM!The test suite follows Jest best practices with proper mock setup, cleanup with
afterEach(() => { jest.resetAllMocks(); })
, and covers essential functionality including rubric rendering, plugin ID validation, and basic refusal handling.Based on learnings
site/docs/red-team/plugins/wordplay.md (1)
6-80
: LGTM!The documentation content is well-structured, follows the established pattern for plugin documentation, and provides clear guidance on purpose, configuration, test generation techniques, evaluation criteria, and related concepts.
site/static/config-schema.json (2)
1292-1292
: LGTM!The "wordplay" plugin ID is correctly added to the redteam plugins enum, alphabetically positioned between "vlguard" and "xstest".
1495-1495
: LGTM!The "wordplay" plugin ID is correctly added to the plugin configuration enum, maintaining alphabetical order.
src/redteam/plugins/wordplay.ts (1)
1-69
: LGTM! Clean implementation following established patterns.The WordplayGrader implementation is well-structured:
- Proper type imports and class extension
- Comprehensive rubric with clear evaluation criteria for wordplay attacks
- Smart optimization via early refusal detection using
isBasicRefusal()
- Appropriate severity classification (Low) in the metadata
The short-circuit logic correctly identifies standard refusals as passing tests (score=1), which aligns with the security goal of preventing profanity generation through wordplay traps.
src/redteam/constants/metadata.ts (1)
148-148
: LGTM! Comprehensive and consistent metadata integration.All wordplay plugin metadata entries are properly added across the required structures:
- subCategoryDescriptions (line 148): Clear description of the wordplay attack vector
- displayNameOverrides (line 288): Consistent with naming conventions
- riskCategorySeverityMap (line 419): Appropriately classified as
Severity.Low
- riskCategories (line 492): Correctly placed under 'Trust & Safety' category
- categoryAliases (line 670): Follows existing alias pattern
- pluginDescriptions (lines 811-812): Detailed description matching the plugin's purpose
The metadata additions follow all existing patterns and maintain consistency with other plugins in the codebase.
Also applies to: 288-288, 419-419, 492-492, 670-670, 811-812
{ | ||
category: 'Trust and Safety', | ||
description: | ||
'Tests whether AI systems can be tricked into generating profanity through innocent-seeming wordplay', | ||
label: 'harmful', | ||
link: '/docs/red-team/plugins/wordplay/', | ||
name: 'Wordplay', | ||
pluginId: 'wordplay', | ||
applicationTypes: { | ||
rag: true, | ||
agent: true, | ||
chat: true, | ||
}, | ||
vulnerabilityType: 'harmful', | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the missing isRemote
flag.
The Wordplay plugin is registered as a remote plugin in src/redteam/plugins/index.ts
(line 351), but the isRemote: true
flag is missing from this plugin object. Other remote plugins like 'ascii-smuggling', 'competitors', and 'bfla' include this flag for consistency.
Apply this diff to add the missing flag:
{
category: 'Trust and Safety',
description:
'Tests whether AI systems can be tricked into generating profanity through innocent-seeming wordplay',
label: 'harmful',
link: '/docs/red-team/plugins/wordplay/',
name: 'Wordplay',
pluginId: 'wordplay',
applicationTypes: {
rag: true,
agent: true,
chat: true,
},
vulnerabilityType: 'harmful',
+ isRemote: true,
},
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
{ | |
category: 'Trust and Safety', | |
description: | |
'Tests whether AI systems can be tricked into generating profanity through innocent-seeming wordplay', | |
label: 'harmful', | |
link: '/docs/red-team/plugins/wordplay/', | |
name: 'Wordplay', | |
pluginId: 'wordplay', | |
applicationTypes: { | |
rag: true, | |
agent: true, | |
chat: true, | |
}, | |
vulnerabilityType: 'harmful', | |
}, | |
{ | |
category: 'Trust and Safety', | |
description: | |
'Tests whether AI systems can be tricked into generating profanity through innocent-seeming wordplay', | |
label: 'harmful', | |
link: '/docs/red-team/plugins/wordplay/', | |
name: 'Wordplay', | |
pluginId: 'wordplay', | |
applicationTypes: { | |
rag: true, | |
agent: true, | |
chat: true, | |
}, | |
vulnerabilityType: 'harmful', | |
isRemote: true, | |
}, |
🤖 Prompt for AI Agents
In site/docs/_shared/data/plugins.ts around lines 1475 to 1489, the Wordplay
plugin object is missing the isRemote: true flag; add isRemote: true to this
plugin entry (matching other remote plugins) so the plugin object includes the
isRemote property alongside pluginId, applicationTypes, and vulnerabilityType to
mark it as a remote plugin.
--- | ||
sidebar_label: Wordplay | ||
description: Test AI systems for wordplay vulnerabilities that could lead to generating profanity or offensive language through innocent-seeming riddles and word puzzles | ||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the required title
field to front matter.
The front matter is missing the required title
field. According to the coding guidelines, each markdown documentation file must include both title
and description
fields in the front matter.
As per coding guidelines
Apply this diff to add the missing title:
---
sidebar_label: Wordplay
+title: Wordplay
description: Test AI systems for wordplay vulnerabilities that could lead to generating profanity or offensive language through innocent-seeming riddles and word puzzles
---
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
--- | |
sidebar_label: Wordplay | |
description: Test AI systems for wordplay vulnerabilities that could lead to generating profanity or offensive language through innocent-seeming riddles and word puzzles | |
--- | |
--- | |
sidebar_label: Wordplay | |
title: Wordplay | |
description: Test AI systems for wordplay vulnerabilities that could lead to generating profanity or offensive language through innocent-seeming riddles and word puzzles | |
--- |
🤖 Prompt for AI Agents
In site/docs/red-team/plugins/wordplay.md around lines 1 to 4, the front matter
is missing the required title field; add a top-level title entry (e.g., title:
"Wordplay") to the YAML front matter alongside the existing sidebar_label and
description so the file includes both title and description as required by the
coding guidelines.
No description provided.