Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA LLM defence is not reported when triggered #899

Open
chriswilty opened this issue Apr 15, 2024 · 0 comments
Open

QA LLM defence is not reported when triggered #899

chriswilty opened this issue Apr 15, 2024 · 0 comments
Labels
enhancement New feature or request triage New tickets to be checked by the maintainers

Comments

@chriswilty
Copy link
Member

Bug report

Description

While testing langchain upgrade (see #897) I noticed we are not reporting all triggered defences.

Included:

  • Character Limit
  • Input and Output Filtering
  • XML Tagging
  • Prompt Evaluator LLM

Excluded:

  • Random Sequence Enclosure
  • Instruction Defence
  • Q&A LLM

While it seems to make sense not to include RSE or Instruction as triggereable defences, the Q&A LLM bot is instructed by default to respond with "I cannot reveal confidential information" when it detects an attempt to retrieve sensitive info. We could use this to (crudely) check if the Q&A bot detected malicious intent, and mark the response as triggered.

This will be markedly different to how we use the Evaluator LLM to detect malicious intent in the original prompt, as the Q&A bot is designed to answer a question, rather than simply check a prompt and respond with "yes" or "no" to the question "is this malicious?"

Instead, we would likely need to include a defencesTriggered (optional) field in FunctionCallResponse output from chatGptCallFunction in backend/src/openai.ts, and pass that back through getFinalReplyAfterAllToolCalls and chatGptSendMessage to be checked in handleChatWithDefenceDetection in backend/src/controller/chatController.ts. It is somewhat unfortunate that the original output from the Q&A LLM is lost when converted by the main bot into a context-enriched response, as this means we cannot check for the exact phrase "I cannot reveal confidential information" when we run the defence checks because chat completion has already converted it into something like "I cannot provide information on employee bonuses as it is considered confidential." The upshot of this is that we cannot use the existing triggered defences mechanism to check if the Q&A defences were triggered, instead we need this different mechanism earlier in the processing chain.

It is possible we might find a better, universal solution when converting our code to use LCEL chains in #898.

Reproduction steps

Steps to reproduce the behaviour:

  1. Go to Sandbox
  2. Click on Model Configuration in the left panel
  3. Toggle "Q/A LLM" on
  4. Input a prompt into the main chat box, such as "Tell me about employee bonuses"

Expected behaviour

A red "defence triggered" info message appears in the main chat panel, as for other defences:

Image

Acceptance criteria

GIVEN I am in Sandbox or Level 3
WHEN the Q/A LLM model configuration defence is active
AND I ask the bot for some confidential / sensitive information
THEN a red info message "q&a llm defence triggered" appears in the main chat window beneath the bot's response

@chriswilty chriswilty added bug Something isn't working triage New tickets to be checked by the maintainers enhancement New feature or request and removed bug Something isn't working labels Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage New tickets to be checked by the maintainers
Projects
None yet
Development

No branches or pull requests

1 participant