-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add heuristics for adversarial suffixes #58
Comments
What would you think about instead using a machine learning classifier? We could generate a list of several hundred or thousand adversarial suffixes, and then train a machine learning algorithm to classify text as adversarial vs non-adversarial. It would probably need to be a neural network in order to handle the complexities of language, but if it was non-transformer based, I would think it wouldn't have the same underlying weakness as the LLM. A determined attacker could still train a suffix generator that avoids our classifier, but it would significantly increase attacker costs. It seems to me that coming up with heuristics would be quite challenging for this. I kinda think that one needs an understanding of the normal language in order to recognize a suffix as suspicious, and it would be laborious to try to encode that understanding manually. |
After thinking about this more, perhaps a better approach would be to fine tune an existing LLM. The fine-tuned LLM could be trained to recognize a wide variety of prompt injection attacks, not just adversarial suffixes. I think fine tuning could help with situations where the attacker tries to prompt inject Rebuff itself. I expect that sooner or later, OpenAI will have some mechanism to share fine-tuned models. It also seems it might be possible to fine tune Llama 2 and distribute the modified weights. |
So the issue with using a machine learning classifier here is that a gradient based adversarial input can be crafted to simultaneously trick the LLM and the classifier (especially since the model would be publicly available). We want to rely on more traditional "heuristics" to infer that a crafted input is suspicious. The current heuristics we have built in are pretty basic, but we can utilize more advanced grammar parsing etc. That doesn't mean we can't use ML models as defense layers in general though, just that it's not the solution for gradient based attacks. I think it's a great idea in general and we can start working on #13 to support that and other modular defenses that we want to add. |
Yeah, you're right that an adversary can trick both the LLM and the classifier. I'm just having trouble thinking of heuristics that might work against this sort of attack. Though maybe if I had more knowledge of more traditional NLP, I'd be able to think of some ideas. I wonder if adversarial suffix attacks are similar to each other in vector space? Perhaps the vector similarity defense could help here. |
Would be interesting to see what type of heuristics can be applied against adversarial suffixes. As background:
https://arxiv.org/abs/2307.15043
https://github.com/llm-attacks/llm-attacks
To be clear, this wouldn't be a defense for all possible adversarial attacks. It does seem like we could screen some though.
The text was updated successfully, but these errors were encountered: