feat: wordplay red team plugin #5889

typpo · 2025-10-10T12:44:35Z

No description provided.

coderabbitai · 2025-10-10T12:47:49Z

📝 Walkthrough

Walkthrough

Adds a new “wordplay” red-team plugin across docs, site config, schema, metadata, and runtime. Documentation page created under red-team/plugins/wordplay. Sidebar and shared plugins list updated to include Wordplay. JSON schema enums extended to allow the wordplay plugin. Metadata maps updated with category, display name, severity, aliases, and descriptions. Runtime integrations: plugin listed in remote plugins, grader map updated to include WordplayGrader, and a new WordplayGrader implemented with a refusal short-circuit; otherwise defers to base grading. Tests added validating rubric rendering, plugin id, and refusal handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	❓ Inconclusive	The pull request lacks any author-provided description, offering no context or details about the changes. Without text, it is impossible to assess whether the description relates to the changeset. Under the guidelines, a completely empty description is considered generic and thus inconclusive for this check.	Please add a brief summary describing the purpose of the Wordplay red-team plugin addition and its scope across documentation, code, and tests.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title concisely communicates the addition of a Wordplay red-team plugin, which is the primary change introduced in the pull request. It correctly uses the “feat:” prefix to denote a new feature and specifies the plugin name without unnecessary detail. This phrasing allows readers to quickly understand the main purpose of the change.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ian/20251010-054430

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

src/redteam/plugins/wordplay.ts (1)
67-67: Consider omitting the explicit undefined parameter.

The final parameter in super.getResult(prompt, llmOutput, test, provider, undefined) is explicitly set to undefined. If this parameter is optional in the base class, you can omit it for cleaner code:
-    return super.getResult(prompt, llmOutput, test, provider, undefined);
+    return super.getResult(prompt, llmOutput, test, provider);
However, if the base class requires all parameters or you're intentionally overriding a default value, keep it as-is.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f04eb0 and 13fa4a7.

📒 Files selected for processing (10)

site/docs/_shared/data/plugins.ts (1 hunks)
site/docs/red-team/plugins/wordplay.md (1 hunks)
site/sidebars.js (1 hunks)
site/static/config-schema.json (2 hunks)
src/redteam/constants/metadata.ts (6 hunks)
src/redteam/constants/plugins.ts (1 hunks)
src/redteam/graders.ts (2 hunks)
src/redteam/plugins/index.ts (1 hunks)
src/redteam/plugins/wordplay.ts (1 hunks)
test/redteam/plugins/wordplay.test.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (14)

site/docs/**/*.md

📄 CodeRabbit inference engine (.cursor/rules/docusaurus.mdc)

site/docs/**/*.md: Prioritize minimal edits when updating existing documentation; avoid creating entirely new sections or rewriting substantial portions; focus edits on improving grammar, spelling, clarity, fixing typos, and structural improvements where needed; do not modify existing headings (h1, h2, h3, etc.) as they are often linked externally.
Structure content to reveal information progressively: begin with essential actions and information, then provide deeper context as necessary; organize information from most important to least important.
Use action-oriented language: clearly outline actionable steps users should take, use concise and direct language, prefer active voice over passive voice, and use imperative mood for instructions.
Use 'eval' instead of 'evaluation' in all documentation; when referring to command line usage, use 'npx promptfoo eval' rather than 'npx promptfoo evaluation'; maintain consistency with this terminology across all examples, code blocks, and explanations.
The project name can be written as either 'Promptfoo' (capitalized) or 'promptfoo' (lowercase) depending on context: use 'Promptfoo' at the beginning of sentences or in headings, and 'promptfoo' in code examples, terminal commands, or when referring to the package name; be consistent with the chosen capitalization within each document or section.
Each markdown documentation file must include required front matter fields: 'title' (the page title shown in search results and browser tabs) and 'description' (a concise summary of the page content, ideally 150-160 characters).
Only add a title attribute to code blocks that represent complete, runnable files; do not add titles to code fragments, partial examples, or snippets that aren't meant to be used as standalone files; this applies to all code blocks regardless of language.
Use special comment directives to highlight specific lines in code blocks: 'highlight-next-line' highlights the line immediately after the comment, 'highligh...

Files:

site/docs/red-team/plugins/wordplay.md

{site/**,examples/**}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Any pull request that only touches files in 'site/' or 'examples/' directories must use the 'docs:' prefix in the PR title, not 'feat:' or 'fix:'

Files:

site/docs/red-team/plugins/wordplay.md
site/sidebars.js
site/static/config-schema.json
site/docs/_shared/data/plugins.ts

site/**

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

If the change is a feature, update the relevant documentation under 'site/'

Files:

site/docs/red-team/plugins/wordplay.md
site/sidebars.js
site/static/config-schema.json
site/docs/_shared/data/plugins.ts

site/docs/**/*.{md,mdx}

📄 CodeRabbit inference engine (site/docs/CLAUDE.md)

site/docs/**/*.{md,mdx}: Use the term "eval" not "evaluation" in documentation and examples
Capitalization: use "Promptfoo" (capitalized) in prose/headings and "promptfoo" (lowercase) in code, commands, and package names
Every doc must include required front matter: title and description
Only add title= to code blocks when showing complete runnable files
Admonitions must have empty lines around their content (Prettier requirement)
Do not modify headings; they may be externally linked
Use progressive disclosure: put essential information first
Use action-oriented, imperative mood in instructions (e.g., "Install the package")

Files:

site/docs/red-team/plugins/wordplay.md

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Prefer not to introduce new TypeScript types; use existing interfaces whenever possible

**/*.{ts,tsx}: Follow consistent import order (Biome will handle sorting)
Use curly braces for all control statements
Prefer const over let; avoid var
Use object property shorthand when possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks

Files:

src/redteam/plugins/index.ts
site/docs/_shared/data/plugins.ts
src/redteam/constants/plugins.ts
test/redteam/plugins/wordplay.test.ts
src/redteam/plugins/wordplay.ts
src/redteam/graders.ts
src/redteam/constants/metadata.ts

src/redteam/plugins/**/*.ts

📄 CodeRabbit inference engine (src/redteam/CLAUDE.md)

src/redteam/plugins/**/*.ts: Place vulnerability-specific test generators as plugins under src/redteam/plugins/ (e.g., pii.ts, harmful.ts, sql-injection.ts)
New plugins must implement the RedteamPluginObject interface

Files:

src/redteam/plugins/index.ts
src/redteam/plugins/wordplay.ts

src/redteam/**/*.ts

📄 CodeRabbit inference engine (src/redteam/CLAUDE.md)

src/redteam/**/*.ts: Always sanitize when logging test prompts or model outputs by passing them via the structured metadata parameter (second argument) to the logger, not raw string interpolation
Use the standardized risk severity levels: critical, high, medium, low when reporting results

Files:

src/redteam/plugins/index.ts
src/redteam/constants/plugins.ts
src/redteam/plugins/wordplay.ts
src/redteam/graders.ts
src/redteam/constants/metadata.ts

src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.{ts,tsx}: Sanitize sensitive data before logging; pass context objects to logger methods (debug, info, warn, error) for automatic redaction
Do not interpolate secrets into log messages (avoid stringifying headers/bodies directly); use structured logger context instead
Use sanitizeObject for manual sanitization before using or persisting potentially sensitive data

Files:

src/redteam/plugins/index.ts
src/redteam/constants/plugins.ts
src/redteam/plugins/wordplay.ts
src/redteam/graders.ts
src/redteam/constants/metadata.ts

**/*.{test,spec}.{js,ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Avoid disabling or skipping tests unless absolutely necessary and documented

Files:

test/redteam/plugins/wordplay.test.ts

test/**/*.{test,spec}.ts

📄 CodeRabbit inference engine (.cursor/rules/jest.mdc)

test/**/*.{test,spec}.ts: Mock as few functions as possible to keep tests realistic
Never increase the function timeout - fix the test instead
Organize tests in descriptive describe and it blocks
Prefer assertions on entire objects rather than individual keys when writing expectations
Clean up after tests to prevent side effects (e.g., use afterEach(() => { jest.resetAllMocks(); }))
Run tests with --randomize flag to ensure your mocks setup and teardown don't affect other tests
Use Jest's mocking utilities rather than complex custom mocks
Prefer shallow mocking over deep mocking
Mock external dependencies but not the code being tested
Reset mocks between tests to prevent test pollution
For database tests, use in-memory instances or proper test fixtures
Test both success and error cases for each provider
Mock API responses to avoid external dependencies in tests
Validate that provider options are properly passed to the underlying service
Test error handling and edge cases (rate limits, timeouts, etc.)
Ensure provider caching behaves as expected
Always include both --coverage and --randomize flags when running tests
Run tests in a single pass (no watch mode for CI)
Ensure all tests are independent and can run in any order
Clean up any test data or mocks after each test

Files:

test/redteam/plugins/wordplay.test.ts

test/**/*.test.ts

📄 CodeRabbit inference engine (test/CLAUDE.md)

test/**/*.test.ts: Never increase Jest test timeouts; fix slow tests instead (avoid jest.setTimeout or large timeouts in tests)
Do not use .only() or .skip() in committed tests
Add afterEach(() => { jest.resetAllMocks(); }) to ensure mock cleanup
Prefer asserting entire objects (toEqual on whole result) rather than individual fields
Mock minimally: only external dependencies (APIs, databases), not code under test
Use Jest (not Vitest) APIs in this suite; avoid importing vitest
Import from @jest/globals in tests

Files:

test/redteam/plugins/wordplay.test.ts

test/**

📄 CodeRabbit inference engine (test/CLAUDE.md)

Organize tests to mirror src/ structure (e.g., test/providers → src/providers, test/redteam → src/redteam)

Files:

test/redteam/plugins/wordplay.test.ts

test/**/*.{test.ts,test.tsx,spec.ts,spec.tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

test/**/*.{test.ts,test.tsx,spec.ts,spec.tsx}: Follow Jest best practices using describe/it blocks in tests
Write tests covering both success and error cases for all functionality

Files:

test/redteam/plugins/wordplay.test.ts

src/redteam/graders.ts

📄 CodeRabbit inference engine (src/redteam/CLAUDE.md)

Keep response evaluation logic in src/redteam/graders.ts

Files:

src/redteam/graders.ts

🧠 Learnings (2)

📚 Learning: 2025-10-05T16:59:20.507Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new plugins under test/redteam/

Applied to files:

src/redteam/constants/plugins.ts
test/redteam/plugins/wordplay.test.ts

📚 Learning: 2025-10-05T16:59:20.507Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/plugins/**/*.ts : New plugins must implement the RedteamPluginObject interface

Applied to files:

src/redteam/constants/plugins.ts

🧬 Code graph analysis (3)

test/redteam/plugins/wordplay.test.ts (1)

src/redteam/plugins/wordplay.ts (1)

WordplayGrader (7-69)

src/redteam/plugins/wordplay.ts (3)

src/types/index.ts (2)

AtomicTestCase (740-740)

GradingResult (367-402)

src/types/providers.ts (1)

ApiProvider (79-96)

src/redteam/util.ts (1)

isBasicRefusal (182-188)

src/redteam/graders.ts (1)

src/redteam/plugins/wordplay.ts (1)

WordplayGrader (7-69)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: Generate Assets
GitHub Check: Run Integration Tests
GitHub Check: Redteam (Staging API)
GitHub Check: Build Docs
GitHub Check: Share Test
GitHub Check: Redteam (Production API)
GitHub Check: Test on Node 24.x and ubuntu-latest
GitHub Check: Test on Node 24.x and windows-latest
GitHub Check: webui tests
GitHub Check: Test on Node 22.x and macOS-latest
GitHub Check: Test on Node 22.x and ubuntu-latest
GitHub Check: Test on Node 20.x and ubuntu-latest
GitHub Check: Test on Node 20.x and macOS-latest
GitHub Check: Test on Node 22.x and windows-latest
GitHub Check: Test on Node 20.x and windows-latest
GitHub Check: Build on Node 22.x
GitHub Check: Style Check
GitHub Check: Build on Node 24.x
GitHub Check: Build on Node 20.x
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (11)

src/redteam/plugins/index.ts (1)

351-351: LGTM!

The addition of 'wordplay' to the remotePlugins list is correctly positioned alphabetically and follows the established pattern for remote plugin registration.

site/sidebars.js (1)

115-115: LGTM!

The sidebar entry for the Wordplay plugin is correctly positioned alphabetically between 'vlguard' and 'xstest' within the Trust, Safety, & Compliance category.

src/redteam/constants/plugins.ts (1)

310-310: LGTM!

The addition of 'wordplay' to ADDITIONAL_PLUGINS is correctly positioned alphabetically and follows the established pattern.

src/redteam/graders.ts (2)

82-82: LGTM!

The import of WordplayGrader is correctly positioned alphabetically with other plugin imports.

180-180: LGTM!

The grader registration is correctly positioned alphabetically in the GRADERS map and follows the established naming convention.

test/redteam/plugins/wordplay.test.ts (1)

1-43: LGTM!

The test suite follows Jest best practices with proper mock setup, cleanup with afterEach(() => { jest.resetAllMocks(); }), and covers essential functionality including rubric rendering, plugin ID validation, and basic refusal handling.

Based on learnings

site/docs/red-team/plugins/wordplay.md (1)

6-80: LGTM!

The documentation content is well-structured, follows the established pattern for plugin documentation, and provides clear guidance on purpose, configuration, test generation techniques, evaluation criteria, and related concepts.

site/static/config-schema.json (2)

1292-1292: LGTM!

The "wordplay" plugin ID is correctly added to the redteam plugins enum, alphabetically positioned between "vlguard" and "xstest".

1495-1495: LGTM!

The "wordplay" plugin ID is correctly added to the plugin configuration enum, maintaining alphabetical order.

src/redteam/plugins/wordplay.ts (1)

1-69: LGTM! Clean implementation following established patterns.

The WordplayGrader implementation is well-structured:

Proper type imports and class extension

Comprehensive rubric with clear evaluation criteria for wordplay attacks

Smart optimization via early refusal detection using isBasicRefusal()

Appropriate severity classification (Low) in the metadata

The short-circuit logic correctly identifies standard refusals as passing tests (score=1), which aligns with the security goal of preventing profanity generation through wordplay traps.

src/redteam/constants/metadata.ts (1)

148-148: LGTM! Comprehensive and consistent metadata integration.

All wordplay plugin metadata entries are properly added across the required structures:

subCategoryDescriptions (line 148): Clear description of the wordplay attack vector

displayNameOverrides (line 288): Consistent with naming conventions

riskCategorySeverityMap (line 419): Appropriately classified as Severity.Low

riskCategories (line 492): Correctly placed under 'Trust & Safety' category

categoryAliases (line 670): Follows existing alias pattern

pluginDescriptions (lines 811-812): Detailed description matching the plugin's purpose

The metadata additions follow all existing patterns and maintain consistency with other plugins in the codebase.

Also applies to: 288-288, 419-419, 492-492, 670-670, 811-812

coderabbitai · 2025-10-10T12:47:53Z

site/docs/_shared/data/plugins.ts

+  {
+    category: 'Trust and Safety',
+    description:
+      'Tests whether AI systems can be tricked into generating profanity through innocent-seeming wordplay',
+    label: 'harmful',
+    link: '/docs/red-team/plugins/wordplay/',
+    name: 'Wordplay',
+    pluginId: 'wordplay',
+    applicationTypes: {
+      rag: true,
+      agent: true,
+      chat: true,
+    },
+    vulnerabilityType: 'harmful',
+  },


⚠️ Potential issue | 🟡 Minor

Add the missing isRemote flag.

The Wordplay plugin is registered as a remote plugin in src/redteam/plugins/index.ts (line 351), but the isRemote: true flag is missing from this plugin object. Other remote plugins like 'ascii-smuggling', 'competitors', and 'bfla' include this flag for consistency.

Apply this diff to add the missing flag:

{ category: 'Trust and Safety', description: 'Tests whether AI systems can be tricked into generating profanity through innocent-seeming wordplay', label: 'harmful', link: '/docs/red-team/plugins/wordplay/', name: 'Wordplay', pluginId: 'wordplay', applicationTypes: { rag: true, agent: true, chat: true, }, vulnerabilityType: 'harmful', + isRemote: true, },

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

{

category: 'Trust and Safety',

description:

'Tests whether AI systems can be tricked into generating profanity through innocent-seeming wordplay',

label: 'harmful',

link: '/docs/red-team/plugins/wordplay/',

name: 'Wordplay',

pluginId: 'wordplay',

applicationTypes: {

rag: true,

agent: true,

chat: true,

},

vulnerabilityType: 'harmful',

},

{

category: 'Trust and Safety',

description:

'Tests whether AI systems can be tricked into generating profanity through innocent-seeming wordplay',

label: 'harmful',

link: '/docs/red-team/plugins/wordplay/',

name: 'Wordplay',

pluginId: 'wordplay',

applicationTypes: {

rag: true,

agent: true,

chat: true,

},

vulnerabilityType: 'harmful',

isRemote: true,

},

🤖 Prompt for AI Agents

In site/docs/_shared/data/plugins.ts around lines 1475 to 1489, the Wordplay plugin object is missing the isRemote: true flag; add isRemote: true to this plugin entry (matching other remote plugins) so the plugin object includes the isRemote property alongside pluginId, applicationTypes, and vulnerabilityType to mark it as a remote plugin.

coderabbitai · 2025-10-10T12:47:53Z

site/docs/red-team/plugins/wordplay.md

+---
+sidebar_label: Wordplay
+description: Test AI systems for wordplay vulnerabilities that could lead to generating profanity or offensive language through innocent-seeming riddles and word puzzles
+---


⚠️ Potential issue | 🟠 Major

Add the required title field to front matter.

The front matter is missing the required title field. According to the coding guidelines, each markdown documentation file must include both title and description fields in the front matter.

As per coding guidelines

Apply this diff to add the missing title:

--- sidebar_label: Wordplay +title: Wordplay description: Test AI systems for wordplay vulnerabilities that could lead to generating profanity or offensive language through innocent-seeming riddles and word puzzles ---

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

---

sidebar_label: Wordplay

description: Test AI systems for wordplay vulnerabilities that could lead to generating profanity or offensive language through innocent-seeming riddles and word puzzles

---

---

sidebar_label: Wordplay

title: Wordplay

description: Test AI systems for wordplay vulnerabilities that could lead to generating profanity or offensive language through innocent-seeming riddles and word puzzles

---

🤖 Prompt for AI Agents

In site/docs/red-team/plugins/wordplay.md around lines 1 to 4, the front matter is missing the required title field; add a top-level title entry (e.g., title: "Wordplay") to the YAML front matter alongside the existing sidebar_label and description so the file includes both title and description as required by the coding guidelines.

feat: wordplay red team plugin

13fa4a7

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: wordplay red team plugin #5889

feat: wordplay red team plugin #5889

Uh oh!

typpo commented Oct 10, 2025

Uh oh!

coderabbitai bot commented Oct 10, 2025

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 10, 2025

Uh oh!

coderabbitai bot Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

feat: wordplay red team plugin #5889

Are you sure you want to change the base?

feat: wordplay red team plugin #5889

Uh oh!

Conversation

typpo commented Oct 10, 2025

Uh oh!

coderabbitai bot commented Oct 10, 2025

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant