- Attack Searchers
-
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [Paper]
Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu (2023)
-
White-box Multimodal Jailbreaks Against Large Vision-Language Models [Paper]
Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, Yu-Gang Jiang (2024)
-
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs [Paper]
Zeyi Liao, Huan Sun (2024)
-
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models [Paper]
Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, Min Lin (2024)
-
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs [Paper]
Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, Yuandong Tian (2024)
-
Learning diverse attacks on large language models for robust red-teaming and safety tuning [Paper]
Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, Moksh Jain (2024)
-
Efficient Black-box Adversarial Attacks via Bayesian Optimization Guided by a Function Prior [Paper]
Shuyu Cheng, Yibo Miao, Yinpeng Dong, Xiao Yang, Xiao-Shan Gao, Jun Zhu (2024)
-
Automatic and Universal Prompt Injection Attacks against Large Language Models [Paper]
Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, Chaowei Xiao (2024)
-
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [Paper]
Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion (2024)
-
Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks [Paper]
Dario Pasquini, Martin Strohmeier, Carmela Troncoso (2024)
-
$\textit{LinkPrompt}$ : Natural and Universal Adversarial Attacks on Prompt-based Language Models [Paper]
Yue Xu, Wenjie Wang (2024)
-
Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia [Paper]
Guangyu Shen, Siyuan Cheng, Kaiyuan Zhang, Guanhong Tao, Shengwei An, Lu Yan, Zhuo Zhang, Shiqing Ma, Xiangyu Zhang (2024)
-
Fast Adversarial Attacks on Language Models In One GPU Minute [Paper]
Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi (2024)
-
From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings [Paper]
Hao Wang, Hao Li, Minlie Huang, Lei Sha (2024)
-
Gradient-Based Language Model Red Teaming [Paper]
Nevan Wichers, Carson Denison, Ahmad Beirami (2024)
-
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [Paper]
Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao (2023)
-
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models [Paper]
Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, Tong Sun (2023)
-
Universal and Transferable Adversarial Attacks on Aligned Language Models [Paper]
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson (2023)
-
Soft-prompt Tuning for Large Language Models to Evaluate Bias [Paper]
Jacob-Junqi Tian, David Emerson, Sevil Zanjani Miyandoab, Deval Pandya, Laleh Seyyed-Kalantari, Faiza Khan Khattak (2023)
-
TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models [Paper]
Jiaqi Xue, Mengxin Zheng, Ting Hua, Yilin Shen, Yepeng Liu, Ladislau Boloni, Qian Lou (2023)
- Automatic Jailbreaking of the Text-to-Image Generative AI Systems [Paper]
Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang (2024) - ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users [Paper]
Guanlin Li, Kangjie Chen, Shudong Zhang, Jie Zhang, Tianwei Zhang (2024) - Eliciting Language Model Behaviors using Reverse Language Models [Paper]
Jacob Pfau, Alex Infanger, Abhay Sheshadri, Ayush Panda, Julian Michael, Curtis Huebner
(2023)
- No Offense Taken: Eliciting Offensiveness from Language Models [Paper]
Anugya Srivastava, Rahul Ahuja, Rohith Mukku (2023) - LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model [Paper]
Muhammad Ahmed Shah, Roshan Sharma, Hira Dhamyal, Raphael Olivier, Ankit Shah, Joseph Konan, Dareen Alharthi, Hazim T Bukhari, Massa Baali, Soham Deshmukh, Michael Kuhlmann, Bhiksha Raj, Rita Singh (2023) - Jailbreaking Black Box Large Language Models in Twenty Queries [Paper]
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong (2023) - An LLM can Fool Itself: A Prompt-Based Adversarial Attack [Paper]
Xilie Xu, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, Mohan Kankanhalli (2023) - Red Teaming Language Models with Language Models [Paper]
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, Geoffrey Irving (2022) - JAB: Joint Adversarial Prompting and Belief Augmentation [Paper]
Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Jwala Dhamala, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta (2023) - DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models [Paper]
Yibo Wang, Xiangjue Dong, James Caverlee, Philip S. Yu (2023) - AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications [Paper]
Bhaktipriya Radharapu, Kevin Robinson, Lora Aroyo, Preethi Lahoti (2023) - Tree of Attacks: Jailbreaking Black-Box LLMs Automatically [Paper]
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, Amin Karbasi (2023) - GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation [Paper]
Govind Ramesh, Yao Dou, Wei Xu (2024) - Adversarial Attacks on GPT-4 via Simple Random Search [Paper]
Tim Baumgärtner, Yang Gao, Dana Alon, Donald Metzler (2024) - Tastle: Distract Large Language Models for Automatic Jailbreak Attack [Paper]
Zeguan Xiao, Yan Yang, Guanhua Chen, Yun Chen (2024) - All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks [Paper]
Kazuhiro Takemoto (2024)
- Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation [Paper]
Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, Kailong Wang (2024) - Weak-to-Strong Jailbreaking on Large Language Models [Paper]
Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang (2024) - COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability [Paper]
Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, Bin Hu (2024)
- Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs [Paper]
Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Aishan Liu, Ee-Chien Chang (2024) - Open Sesame! Universal Black Box Jailbreaking of Large Language Models [Paper]
Raz Lapid, Ron Langberg, Moshe Sipper (2023)
- SneakyPrompt: Jailbreaking Text-to-image Generative Models [Paper]
Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao (2023) - RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs [Paper]
Xuan Chen, Yuzhou Nie, Lu Yan, Yunshu Mao, Wenbo Guo, Xiangyu Zhang (2024) - When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search [Paper]
Xuan Chen, Yuzhou Nie, Wenbo Guo, Xiangyu Zhang (2024) - QROA: A Black-Box Query-Response Optimization Attack on LLMs [Paper]
Hussein Jawad, Nicolas J. -B. BRUNEL (2024) - Unveiling the Implicit Toxicity in Large Language Models [Paper]
Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, Minlie Huang (2023) - Red Teaming Game: A Game-Theoretic Framework for Red Teaming Language Models [Paper]
Chengdong Ma, Ziran Yang, Minquan Gao, Hai Ci, Jun Gao, Xuehai Pan, Yaodong Yang (2023) - Explore,Establish,Exploit: Red Teaming Language Models from Scratch [Paper]
Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell (2023)