Add heuristics for adversarial suffixes #58

seanpmorgan · 2023-10-04T01:39:21Z

Would be interesting to see what type of heuristics can be applied against adversarial suffixes. As background:
https://arxiv.org/abs/2307.15043
https://github.com/llm-attacks/llm-attacks

To be clear, this wouldn't be a defense for all possible adversarial attacks. It does seem like we could screen some though.

ristomcgehee · 2023-10-14T15:09:31Z

What would you think about instead using a machine learning classifier? We could generate a list of several hundred or thousand adversarial suffixes, and then train a machine learning algorithm to classify text as adversarial vs non-adversarial. It would probably need to be a neural network in order to handle the complexities of language, but if it was non-transformer based, I would think it wouldn't have the same underlying weakness as the LLM. A determined attacker could still train a suffix generator that avoids our classifier, but it would significantly increase attacker costs.

It seems to me that coming up with heuristics would be quite challenging for this. I kinda think that one needs an understanding of the normal language in order to recognize a suffix as suspicious, and it would be laborious to try to encode that understanding manually.

ristomcgehee · 2023-10-14T23:18:06Z

After thinking about this more, perhaps a better approach would be to fine tune an existing LLM. The fine-tuned LLM could be trained to recognize a wide variety of prompt injection attacks, not just adversarial suffixes. I think fine tuning could help with situations where the attacker tries to prompt inject Rebuff itself. I expect that sooner or later, OpenAI will have some mechanism to share fine-tuned models. It also seems it might be possible to fine tune Llama 2 and distribute the modified weights.

seanpmorgan · 2023-10-16T04:04:59Z

So the issue with using a machine learning classifier here is that a gradient based adversarial input can be crafted to simultaneously trick the LLM and the classifier (especially since the model would be publicly available). We want to rely on more traditional "heuristics" to infer that a crafted input is suspicious. The current heuristics we have built in are pretty basic, but we can utilize more advanced grammar parsing etc.

That doesn't mean we can't use ML models as defense layers in general though, just that it's not the solution for gradient based attacks. I think it's a great idea in general and we can start working on #13 to support that and other modular defenses that we want to add.

ristomcgehee · 2023-10-16T14:08:11Z

Yeah, you're right that an adversary can trick both the LLM and the classifier. I'm just having trouble thinking of heuristics that might work against this sort of attack. Though maybe if I had more knowledge of more traditional NLP, I'd be able to think of some ideas.

I wonder if adversarial suffix attacks are similar to each other in vector space? Perhaps the vector similarity defense could help here.

seanpmorgan added enhancement New feature or request help wanted Extra attention is needed heuristics labels Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add heuristics for adversarial suffixes #58

Add heuristics for adversarial suffixes #58

seanpmorgan commented Oct 4, 2023

ristomcgehee commented Oct 14, 2023

ristomcgehee commented Oct 14, 2023

seanpmorgan commented Oct 16, 2023

ristomcgehee commented Oct 16, 2023

Add heuristics for adversarial suffixes #58

Add heuristics for adversarial suffixes #58

Comments

seanpmorgan commented Oct 4, 2023

ristomcgehee commented Oct 14, 2023

ristomcgehee commented Oct 14, 2023

seanpmorgan commented Oct 16, 2023

ristomcgehee commented Oct 16, 2023