Skip to content

Latest commit

 

History

History
126 lines (125 loc) · 11.1 KB

03 - Attack Searchers.md

File metadata and controls

126 lines (125 loc) · 11.1 KB

Attack Searchers

Table of Contents

Suffix Searchers

  • Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [Paper]
    Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu (2023)
  • White-box Multimodal Jailbreaks Against Large Vision-Language Models [Paper]
    Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, Yu-Gang Jiang (2024)
  • AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs [Paper]
    Zeyi Liao, Huan Sun (2024)
  • Improved Techniques for Optimization-Based Jailbreaking on Large Language Models [Paper]
    Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, Min Lin (2024)
  • AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs [Paper]
    Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, Yuandong Tian (2024)
  • Learning diverse attacks on large language models for robust red-teaming and safety tuning [Paper]
    Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, Moksh Jain (2024)
  • Efficient Black-box Adversarial Attacks via Bayesian Optimization Guided by a Function Prior [Paper]
    Shuyu Cheng, Yibo Miao, Yinpeng Dong, Xiao Yang, Xiao-Shan Gao, Jun Zhu (2024)
  • Automatic and Universal Prompt Injection Attacks against Large Language Models [Paper]
    Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, Chaowei Xiao (2024)
  • Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [Paper]
    Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion (2024)
  • Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks [Paper]
    Dario Pasquini, Martin Strohmeier, Carmela Troncoso (2024)
  • $\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models [Paper]
    Yue Xu, Wenjie Wang (2024)
  • Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia [Paper]
    Guangyu Shen, Siyuan Cheng, Kaiyuan Zhang, Guanhong Tao, Shengwei An, Lu Yan, Zhuo Zhang, Shiqing Ma, Xiangyu Zhang (2024)
  • Fast Adversarial Attacks on Language Models In One GPU Minute [Paper]
    Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi (2024)
  • From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings [Paper]
    Hao Wang, Hao Li, Minlie Huang, Lei Sha (2024)
  • Gradient-Based Language Model Red Teaming [Paper]
    Nevan Wichers, Carson Denison, Ahmad Beirami (2024)
  • AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [Paper]
    Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao (2023)
  • AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models [Paper]
    Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, Tong Sun (2023)
  • Universal and Transferable Adversarial Attacks on Aligned Language Models [Paper]
    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson (2023)
  • Soft-prompt Tuning for Large Language Models to Evaluate Bias [Paper]
    Jacob-Junqi Tian, David Emerson, Sevil Zanjani Miyandoab, Deval Pandya, Laleh Seyyed-Kalantari, Faiza Khan Khattak (2023)
  • TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models [Paper]
    Jiaqi Xue, Mengxin Zheng, Ting Hua, Yilin Shen, Yepeng Liu, Ladislau Boloni, Qian Lou (2023)

Prompt Searchers

Language Model

  • Automatic Jailbreaking of the Text-to-Image Generative AI Systems [Paper]
    Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang (2024)
  • ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users [Paper]
    Guanlin Li, Kangjie Chen, Shudong Zhang, Jie Zhang, Tianwei Zhang (2024)
  • Eliciting Language Model Behaviors using Reverse Language Models [Paper]
    Jacob Pfau, Alex Infanger, Abhay Sheshadri, Ayush Panda, Julian Michael, Curtis Huebner

(2023)

  • No Offense Taken: Eliciting Offensiveness from Language Models [Paper]
    Anugya Srivastava, Rahul Ahuja, Rohith Mukku (2023)
  • LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model [Paper]
    Muhammad Ahmed Shah, Roshan Sharma, Hira Dhamyal, Raphael Olivier, Ankit Shah, Joseph Konan, Dareen Alharthi, Hazim T Bukhari, Massa Baali, Soham Deshmukh, Michael Kuhlmann, Bhiksha Raj, Rita Singh (2023)
  • Jailbreaking Black Box Large Language Models in Twenty Queries [Paper]
    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong (2023)
  • An LLM can Fool Itself: A Prompt-Based Adversarial Attack [Paper]
    Xilie Xu, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, Mohan Kankanhalli (2023)
  • Red Teaming Language Models with Language Models [Paper]
    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, Geoffrey Irving (2022)
  • JAB: Joint Adversarial Prompting and Belief Augmentation [Paper]
    Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Jwala Dhamala, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta (2023)
  • DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models [Paper]
    Yibo Wang, Xiangjue Dong, James Caverlee, Philip S. Yu (2023)
  • AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications [Paper]
    Bhaktipriya Radharapu, Kevin Robinson, Lora Aroyo, Preethi Lahoti (2023)
  • Tree of Attacks: Jailbreaking Black-Box LLMs Automatically [Paper]
    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, Amin Karbasi (2023)
  • GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation [Paper]
    Govind Ramesh, Yao Dou, Wei Xu (2024)
  • Adversarial Attacks on GPT-4 via Simple Random Search [Paper]
    Tim Baumgärtner, Yang Gao, Dana Alon, Donald Metzler (2024)
  • Tastle: Distract Large Language Models for Automatic Jailbreak Attack [Paper]
    Zeguan Xiao, Yan Yang, Guanhua Chen, Yun Chen (2024)
  • All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks [Paper]
    Kazuhiro Takemoto (2024)

Decoding

  • Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation [Paper]
    Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, Kailong Wang (2024)
  • Weak-to-Strong Jailbreaking on Large Language Models [Paper]
    Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang (2024)
  • COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability [Paper]
    Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, Bin Hu (2024)

Genetic Algorithm

  • Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs [Paper]
    Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Aishan Liu, Ee-Chien Chang (2024)
  • Open Sesame! Universal Black Box Jailbreaking of Large Language Models [Paper]
    Raz Lapid, Ron Langberg, Moshe Sipper (2023)

Reinforcement Learning

  • SneakyPrompt: Jailbreaking Text-to-image Generative Models [Paper]
    Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao (2023)
  • RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs [Paper]
    Xuan Chen, Yuzhou Nie, Lu Yan, Yunshu Mao, Wenbo Guo, Xiangyu Zhang (2024)
  • When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search [Paper]
    Xuan Chen, Yuzhou Nie, Wenbo Guo, Xiangyu Zhang (2024)
  • QROA: A Black-Box Query-Response Optimization Attack on LLMs [Paper]
    Hussein Jawad, Nicolas J. -B. BRUNEL (2024)
  • Unveiling the Implicit Toxicity in Large Language Models [Paper]
    Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, Minlie Huang (2023)
  • Red Teaming Game: A Game-Theoretic Framework for Red Teaming Language Models [Paper]
    Chengdong Ma, Ziran Yang, Minquan Gao, Hai Ci, Jun Gao, Xuehai Pan, Yaodong Yang (2023)
  • Explore,Establish,Exploit: Red Teaming Language Models from Scratch [Paper]
    Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell (2023)