- Surveys, Taxonomies and more
- A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures [Paper]
Shuai Zhao, Meihuizi Jia, Zhongliang Guo, Leilei Gan, Jie Fu, Yichao Feng, Fengjun Pan, Luu Anh Tuan (2024) - Risk Taxonomy,Mitigation,and Assessment Benchmarks of Large Language Model Systems [Paper]
Tianyu Cui, Yanling Wang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, Zhixing Tan, Junwu Xiong, Xinyu Kong, Zujie Wen, Ke Xu, Qi Li (2024) - Current state of LLM Risks and AI Guardrails [Paper]
Suriya Ganesh Ayyamperumal, Limin Ge (2024) - Unique Security and Privacy Threats of Large Language Model: A Comprehensive Survey [Paper]
Shang Wang, Tianqing Zhu, Bo Liu, Ming Ding, Xu Guo, Dayong Ye, Wanlei Zhou, Philip S. Yu (2024) - TrustLLM: Trustworthiness in Large Language Models [Paper]
Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis Kellis, Marinka Zitnik, Meng Jiang, Mohit Bansal, James Zou, Jian Pei, Jian Liu, Jianfeng Gao, Jiawei Han, Jieyu Zhao, Jiliang Tang, Jindong Wang, Joaquin Vanschoren, John Mitchell, Kai Shu, Kaidi Xu, Kai-Wei Chang, Lifang He, Lifu Huang, Michael Backes, Neil Zhenqiang Gong, Philip S. Yu, Pin-Yu Chen, Quanquan Gu, Ran Xu, Rex Ying, Shuiwang Ji, Suman Jana, Tianlong Chen, Tianming Liu, Tianyi Zhou, William Wang, Xiang Li, Xiangliang Zhang, Xiao Wang, Xing Xie, Xun Chen, Xuyu Wang, Yan Liu, Yanfang Ye, Yinzhi Cao, Yong Chen, Yue Zhao (2024) - Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications [Paper]
Stephen Burabari Tete (2024) - PRISM: A Design Framework for Open-Source Foundation Model Safety [Paper]
Terrence Neumann, Bryan Jones (2024) - Personal LLM Agents: Insights and Survey about the Capability,Efficiency and Security [Paper]
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu (2024) - Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs [Paper]
Javier Rando, Francesco Croce, Kryštof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, Florian Tramèr (2024) - Security and Privacy Challenges of Large Language Models: A Survey [Paper]
Badhan Chandra Das, M. Hadi Amini, Yanzhao Wu (2024) - Foundational Challenges in Assuring Alignment and Safety of Large Language Models [Paper]
Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger (2024) - Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward [Paper]
Xuan Xie, Jiayang Song, Zhehua Zhou, Yuheng Huang, Da Song, Lei Ma (2024) - Human-AI Safety: A Descendant of Generative AI and Control Systems Safety [Paper]
Andrea Bajcsy, Jaime F. Fisac (2024)
- Robust Testing of AI Language Model Resiliency with Novel Adversarial Prompts [Paper]
Brendan Hannon, Yulia Kumar, Dejaun Gayle, J. Jenny Li, Patricia Morreale (2024) - Exploring Vulnerabilities and Protections in Large Language Models: A Survey [Paper]
Frank Weizhen Liu, Chenhui Hu (2024) - Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models [Paper]
Arijit Ghosh Chowdhury, Md Mofijul Islam, Vaibhav Kumar, Faysal Hossain Shezan, Vaibhav Kumar, Vinija Jain, Aman Chadha (2024) - Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models [Paper]
Yue Xu, Wenjie Wang (2024) - Comprehensive Assessment of Jailbreak Attacks Against LLMs [Paper]
Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, Yang Zhang (2024) - LLM Jailbreak Attack versus Defense Techniques -- A Comprehensive Study [Paper]
Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek (2024) - An Early Categorization of Prompt Injection Attacks on Large Language Models [Paper]
Sippo Rossi, Alisia Marianne Michel, Raghava Rao Mukkamala, Jason Bennett Thatcher (2024) - A Comprehensive Survey of Attack Techniques,Implementation,and Mitigation Strategies in Large Language Models [Paper]
Aysan Esmradi, Daniel Wankit Yip, Chun Fai Chan (2023) - Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild [Paper]
Nanna Inie, Jonathan Stray, Leon Derczynski (2023) - Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems [Paper]
Guangjing Wang, Ce Zhou, Yuanda Wang, Bocheng Chen, Hanqing Guo, Qiben Yan (2023) - Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems [Paper]
Guangjing Wang, Ce Zhou, Yuanda Wang, Bocheng Chen, Hanqing Guo, Qiben Yan (2023) - Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks [Paper]
Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, Nael Abu-Ghazaleh (2023) - Adversarial Attacks and Defenses in Large Language Models: Old and New Threats [Paper]
Leo Schwinn, David Dobre, Stephan Günnemann, Gauthier Gidel (2023) - Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition [Paper]
Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, Christopher Carnahan, Jordan Boyd-Graber (2023) - "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models [Paper]
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang (2023) - Tricking LLMs into Disobedience: Formalizing,Analyzing,and Detecting Jailbreaks [Paper]
Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, Monojit Choudhury (2023)
- Securing Large Language Models: Threats,Vulnerabilities and Responsible Practices [Paper]
Sara Abdali, Richard Anarfi, CJ Barberan, Jia He (2024) - Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal [Paper]
Rahul Pankajakshan, Sumitra Biswal, Yuvaraj Govindarajulu, Gilad Gressel (2024) - Identifying and Mitigating Vulnerabilities in LLM-Integrated Applications [Paper]
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Boxin Wang, Jinyuan Jia, Bo Li, Radha Poovendran (2023) - Privacy in Large Language Models: Attacks,Defenses and Future Directions [Paper]
Haoran Li, Yulin Chen, Jinglong Luo, Yan Kang, Xiaojin Zhang, Qi Hu, Chunkit Chan, Yangqiu Song (2023) - Use of LLMs for Illicit Purposes: Threats,Prevention Measures,and Vulnerabilities [Paper]
Maximilian Mozes, Xuanli He, Bennett Kleinberg, Lewis D. Griffin (2023) - From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy [Paper]
Maanak Gupta, CharanKumar Akiri, Kshitiz Aryal, Eli Parker, Lopamudra Praharaj (2023) - Beyond the Safeguards: Exploring the Security Risks of ChatGPT [Paper]
Erik Derner, Kristina Batistič (2023) - Towards Safer Generative Language Models: A Survey on Safety Risks,Evaluations,and Improvements [Paper]
Jiawen Deng, Jiale Cheng, Hao Sun, Zhexin Zhang, Minlie Huang (2023) - The power of generative AI in cybersecurity: Opportunities and challenges [Paper]
Shibo Wen (2024)
- An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping [Paper]
Boming Xia, Qinghua Lu, Liming Zhu, Zhenchang Xing (2024) - Safeguarding Large Language Models: A Survey [Paper]
Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, Saddek Bensalem, Xiaowei Huang (2024)
- Coercing LLMs to do and reveal (almost) anything [Paper]
Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein (2024) - A Security Risk Taxonomy for Large Language Models [Paper]
Erik Derner, Kristina Batistič, Jan Zahálka, Robert Babuška (2023) - The History and Risks of Reinforcement Learning and Human Feedback [Paper]
Nathan Lambert, Thomas Krendl Gilbert, Tom Zick (2023) - From Chatbots to PhishBots? -- Preventing Phishing scams created using ChatGPT,Google Bard and Claude [Paper]
Sayak Saha Roy, Poojitha Thota, Krishna Vamsi Naragam, Shirin Nilizadeh (2023) - AI Deception: A Survey of Examples,Risks,and Potential Solutions [Paper]
Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, Dan Hendrycks (2023) - Generating Phishing Attacks using ChatGPT [Paper]
Sayak Saha Roy, Krishna Vamsi Naragam, Shirin Nilizadeh (2023) - Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study [Paper]
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, Yang Liu (2023) - Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback [Paper]
Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A. Hale (2023)
- AI Safety: A Climb To Armageddon? [Paper]
Herman Cappelen, Josh Dever, John Hawthorne (2024) - [WIP] Jailbreak Paradox: The Achilles' Heel of LLMs [Paper]
Abhinav Rao, Monojit Choudhury, Somak Aditya (2024) - A Safe Harbor for AI Evaluation and Red Teaming [Paper]
Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, Yi Zeng, Weiyan Shi, Xianjun Yang, Reid Southen, Alexander Robey, Patrick Chao, Diyi Yang, Ruoxi Jia, Daniel Kang, Sandy Pentland, Arvind Narayanan, Percy Liang, Peter Henderson (2024) - The Ethics of Interaction: Mitigating Security Threats in LLMs [Paper]
Ashutosh Kumar, Shiv Vignesh Murthy, Sagarika Singh, Swathy Ragupathy (2024) - Red-Teaming for Generative AI: Silver Bullet or Security Theater? [Paper]
Michael Feffer, Anusha Sinha, Wesley Hanwen Deng, Zachary C. Lipton, Hoda Heidari (2024) - The Promise and Peril of Artificial Intelligence -- Violet Teaming Offers a Balanced Path Forward [Paper]
Alexander J. Titus, Adam H. Russell (2023) - Red teaming ChatGPT via Jailbreaking: Bias,Robustness,Reliability and Toxicity [Paper]
Terry Yue Zhuo, Yujin Huang, Chunyang Chen, Zhenchang Xing (2023)
- Red-Teaming Segment Anything Model [Paper]
Krzysztof Jankowski, Bartlomiej Sobieski, Mateusz Kwiatkowski, Jakub Szulc, Michal Janik, Hubert Baniecki, Przemyslaw Biecek (2024) - How Ethical Should AI Be? How AI Alignment Shapes the Risk Preferences of LLMs [Paper]
Shumiao Ouyang, Hayong Yun, Xingjian Zheng (2024) - Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts [Paper]
Tianyu Zhang, Zixuan Zhao, Jiaqi Huang, Jingyu Hua, Sheng Zhong (2024) - Exploring Safety-Utility Trade-Offs in Personalized Language Models [Paper]
Anvesh Rao Vijjini, Somnath Basu Roy Chowdhury, Snigdha Chaturvedi (2024) - AI Risk Management Should Incorporate Both Safety and Security [Paper]
Xiangyu Qi, Yangsibo Huang, Yi Zeng, Edoardo Debenedetti, Jonas Geiping, Luxi He, Kaixuan Huang, Udari Madhushani, Vikash Sehwag, Weijia Shi, Boyi Wei, Tinghao Xie, Danqi Chen, Pin-Yu Chen, Jeffrey Ding, Ruoxi Jia, Jiaqi Ma, Arvind Narayanan, Weijie J Su, Mengdi Wang, Chaowei Xiao, Bo Li, Dawn Song, Peter Henderson, Prateek Mittal (2024) - Adversaries Can Misuse Combinations of Safe Models [Paper]
Erik Jones, Anca Dragan, Jacob Steinhardt (2024) - Finding Safety Neurons in Large Language Models [Paper]
Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, Juanzi Li (2024) - Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications [Paper]
Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson (2024) - Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue [Paper]
Zhenhong Zhou, Jiuyang Xiang, Haopeng Chen, Quan Liu, Zherui Li, Sen Su (2024) - A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity [Paper]
Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea (2024) - Tradeoffs Between Alignment and Helpfulness in Language Models [Paper]
Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, Amnon Shashua (2024) - Causality Analysis for Evaluating the Security of Large Language Models [Paper]
Wei Zhao, Zhe Li, Jun Sun (2023) - Fake Alignment: Are LLMs Really Aligned Well? [Paper]
Yixu Wang, Yan Teng, Kexin Huang, Chengqi Lyu, Songyang Zhang, Wenwei Zhang, Xingjun Ma, Yu-Gang Jiang, Yu Qiao, Yingchun Wang (2023) - Transfer Attacks and Defenses for Large Language Models on Coding Tasks [Paper]
Chi Zhang, Zifan Wang, Ravi Mangal, Matt Fredrikson, Limin Jia, Corina Pasareanu (2023) - "It's a Fair Game'',or Is It? Examining How Users Navigate Disclosure Risks and Benefits When Using LLM-Based Conversational Agents [Paper]
Zhiping Zhang, Michelle Jia, Hao-Ping Lee, Bingsheng Yao, Sauvik Das, Ada Lerner, Dakuo Wang, Tianshi Li (2023) - Are aligned neural networks adversarially aligned? [Paper]
Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, Ludwig Schmidt (2023) - Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks [Paper]
Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto (2023) - Can Large Language Models Change User Preference Adversarially? [Paper]
Varshini Subhash (2023)
- Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis [Paper]
Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, Jiliang Tang (2024) - How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States [Paper]
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Yongbin Li (2024) - Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models [Paper]
Sarah Ball, Frauke Kreuter, Nina Rimsky (2024) - Safety Alignment Should Be Made More Than Just a Few Tokens Deep [Paper]
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, Peter Henderson (2024)