Skip to content

The entry for [Team Blue 2, Electric Boogaloo] at the January 2025 AI-Plans AI Alignment Evals Hackathon. It uses Abliteration (Arditi et al) as a base for benchmarking model resilience to residual stream attacks.

License

Notifications You must be signed in to change notification settings

rh609/AIAE-AbliterationBench

 
 

Repository files navigation

AIAE-AbliterationBench

The entry for [Team Blue 2, Electric Boogaloo] at the January 2025 AI-Plans AI Alignment Evals Hackathon. It uses Abliteration (Arditi et al) as a base for benchmarking model resilience to residual stream attacks.

Building a benchmark for robustness to Abliteration

Abliteration (Arditi et al.) is a residual stream attack that substantially impedes a model's ability to refuse a user's requests (e.g. if asking how to build a weapon). We hypothesize that a well-aligned model which has internalized the value of keeping people safe will naturally refuse to cooperate without having to dedicate a conceptual category (instantiated as a direction in the vector space of the residual stream) to "refusal". Therefore, under this assumption, vulnerability to this attack would be an indication of a model "performatively" refusing to cooperate, understanding it is supposed to refuse, but without internalizing "keeping people safe" as an objective.

We aim to build a benchmark that evaluates the effectiveness of Abliteration at jailbreaking models (lower would be better). We also hope to investigate our hypothesis. If time allows, we have a number of directions in which we may extend the scope of the project.

Example usage:

python ablit_bench.py -n Qwen/Qwen-1.5-0.5B-Chat google/gemma-1.1-2b-it -l 2 -i 50 -b 10

Context

Arditi et al. found that refusal in LLMs (e.g. “As an AI language model, I can't assist with [...]”) is primarily mediated by a single "refusal direction" in the residual stream (i.e. the outputs of the multi-head self-attention layers and feed-forward layers inside Transformer Decoder blocks). We can directly modify these values at runtime or modify the weights so as to dramatically increase or decrease chances of refusal. Modifying an LLM in this way to uncensor it is called abliteration (portmanteau of ablation + obliteration).

Acknowledgements

First and foremost, we would like to thank Kabir Kumar and the AI-Plans team for organizing this hackathon. It was a fantastic learning experience, and we all had a very good time.

We also extend our thanks to Arditi et al for their discovery; as well as Maxime Labonne for his detailed blog post; the dev team for abliterator; and Tsadoq, developer of ErisForge, which we ended up making substantial use of and modifying to suit our use case.

About

The entry for [Team Blue 2, Electric Boogaloo] at the January 2025 AI-Plans AI Alignment Evals Hackathon. It uses Abliteration (Arditi et al) as a base for benchmarking model resilience to residual stream attacks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.8%
  • Shell 1.2%