This code repository is the implementation of our paper: Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models, accepted by COLING 2025.
We investigate the process of GCG and identify an issue of Indirect Effect, the key bottleneck of the GCG optimization. And we propose the Model Attack Gradient Index GCG (MAGIC).
We investigate the Indirect Effect between the gradient values of current suffixes and the updated token indexes, which demonstrates that replacing tokens with negative gradient values fails to effectively reduce adversarial loss. We carry out this study in 1000 iterations of the naive GCG algorithm.
The GCG concatenates harmful instruction and adversarial suffix inducing Target LLM to produce harmful content. The MAGIC improves the optimization process of the adversarial suffix. The Gradient-based Index Selection investigates the One-Hot vectors corresponding to suffixes and only selects index tokens with positive gradient values. Adaptive Mutil-Coordinate Update selects multiple tokens from the previously determined index range for updating, achieving jailbreaking of LLMs.
Using conda to create an environment for MAGIC:
conda create -n magic python=3.10.1
conda activate magic
Run the following command to install dependencies:
pip install -e .
Downloading Vicuna-7B or/and LLaMA-2-7B-Chat firstly.
Before you begin, please modify your model path in experiments/configs/individual_xxx.py
(for individual experiments) and experiments/configs/transfer_xxx.py
(for multiple behaviors or transfer experiments) first.
config.model_paths = [
"/DIR/vicuna/vicuna-7b-v1.3",
... # other models
]
config.tokenizer_paths = [
"/DIR/vicuna/vicuna-7b-v1.3",
... # other tokenizers
]
In order to facilitate comparison with the baseline, our repository has followed the code base of the original GCG. The experiments
contain code to reproduce our MAGIC experiments on AdvBench.
- To run individual experiments with harmful behaviors and harmful strings (i.e. 1 behavior, 1 model or 1 string, 1 model), run the following code inside
experiments
(changingvicuna
tollama2
and changingbehaviors
tostrings
will switch to different experiment setups):
cd launch_scripts
bash run_gcg_individual.sh vicuna behaviors
- To perform multiple behaviors experiments (i.e. 25 behaviors, 1 model), run the following code inside
experiments
:
cd launch_scripts
bash run_gcg_multiple.sh vicuna # or llama2
- To perform transfer experiments (i.e. 25 behaviors, 2 models), run the following code inside
experiments
:
cd launch_scripts
bash run_gcg_transfer.sh vicuna 2 # or vicuna_guanaco 4
We provide a case shows that the suffix optimizated by our MAGIC, successfully jailbreak GPT-4, eliciting harmful responses.
If you find this useful in your research, please consider citing:
@misc{li2024exploitingindexgradientsoptimizationbased,
title={Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models},
author={Jiahui Li and Yongchang Hao and Haoyu Xu and Xing Wang and Yu Hong},
year={2024},
eprint={2412.08615},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.08615},
}
MAGIC
is licensed under the terms of the MIT license. See LICENSE for more details.