Skip to content
/ magic Public

The repo for paper: Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models.

License

Notifications You must be signed in to change notification settings

jiah-li/magic

Repository files navigation

Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models

paper link data link

This code repository is the implementation of our paper: Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models, accepted by COLING 2025.

MAGIC

We investigate the process of GCG and identify an issue of Indirect Effect, the key bottleneck of the GCG optimization. And we propose the Model Attack Gradient Index GCG (MAGIC).

Motivation

We investigate the Indirect Effect between the gradient values of current suffixes and the updated token indexes, which demonstrates that replacing tokens with negative gradient values fails to effectively reduce adversarial loss. We carry out this study in 1000 iterations of the naive GCG algorithm.

heatmap

Fig.1: The heatmap reflects the changes of the current gradient values.

Method

The GCG concatenates harmful instruction and adversarial suffix inducing Target LLM to produce harmful content. The MAGIC improves the optimization process of the adversarial suffix. The Gradient-based Index Selection investigates the One-Hot vectors corresponding to suffixes and only selects index tokens with positive gradient values. Adaptive Mutil-Coordinate Update selects multiple tokens from the previously determined index range for updating, achieving jailbreaking of LLMs.

method

Fig.2: An illustration of our approach MAGIC.

Start

Using conda to create an environment for MAGIC:

conda create -n magic python=3.10.1
conda activate magic

Run the following command to install dependencies:

pip install -e .

Downloading Vicuna-7B or/and LLaMA-2-7B-Chat firstly.

Before you begin, please modify your model path in experiments/configs/individual_xxx.py(for individual experiments) and experiments/configs/transfer_xxx.py (for multiple behaviors or transfer experiments) first.

    config.model_paths = [
        "/DIR/vicuna/vicuna-7b-v1.3",
        ... # other models
    ]
    config.tokenizer_paths = [
        "/DIR/vicuna/vicuna-7b-v1.3",
        ... # other tokenizers
    ]

Experiments

In order to facilitate comparison with the baseline, our repository has followed the code base of the original GCG. The experiments contain code to reproduce our MAGIC experiments on AdvBench.

  • To run individual experiments with harmful behaviors and harmful strings (i.e. 1 behavior, 1 model or 1 string, 1 model), run the following code inside experiments (changing vicuna to llama2 and changing behaviors to strings will switch to different experiment setups):
cd launch_scripts
bash run_gcg_individual.sh vicuna behaviors
  • To perform multiple behaviors experiments (i.e. 25 behaviors, 1 model), run the following code inside experiments:
cd launch_scripts
bash run_gcg_multiple.sh vicuna # or llama2
  • To perform transfer experiments (i.e. 25 behaviors, 2 models), run the following code inside experiments:
cd launch_scripts
bash run_gcg_transfer.sh vicuna 2 # or vicuna_guanaco 4

Example

We provide a case shows that the suffix optimizated by our MAGIC, successfully jailbreak GPT-4, eliciting harmful responses.

example

Fig.3: Example by MAGIC and response from GPT-4.

Citation

If you find this useful in your research, please consider citing:

@misc{li2024exploitingindexgradientsoptimizationbased,
      title={Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models}, 
      author={Jiahui Li and Yongchang Hao and Haoyu Xu and Xing Wang and Yu Hong},
      year={2024},
      eprint={2412.08615},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.08615}, 
}

License

MAGIC is licensed under the terms of the MIT license. See LICENSE for more details.

About

The repo for paper: Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published