- Evaluations
- "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak [Paper]
Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Jiayi Mao, Xueqi Cheng (2024) - Rethinking How to Evaluate Language Model Jailbreak [Paper]
Hongyu Cai, Arjun Arunasalam, Leo Y. Lin, Antonio Bianchi, Z. Berkay Celik (2024) - A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses [Paper]
David Glukhov, Ziwen Han, Ilia Shumailov, Vardan Papyan, Nicolas Papernot (2024) - Jailbreaking as a Reward Misspecification Problem [Paper]
Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong (2024) - Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak [Paper]
Hongyu Cai, Arjun Arunasalam, Leo Y. Lin, Antonio Bianchi, Z. Berkay Celik (2024) - A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models [Paper]
Daniel Wankit Yip, Aysan Esmradi, Chun Fai Chan (2024) - AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models [Paper]
Dong shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, Yongfeng Zhang (2024)
- How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries [Paper]
Somnath Banerjee, Sayan Layek, Rima Hazra, Animesh Mukherjee (2024) - The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness [Paper]
Neeraj Varshney, Pavel Dolin, Agastya Seth, Chitta Baral (2023)
- MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models [Paper]
Tianle Gu, Zeyang Zhou, Kexin Huang, Dandan Liang, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Xingge Qiao, Keqing Wang, Yujiu Yang, Yan Teng, Yu Qiao, Yingchun Wang (2024) - Red Teaming Language Models to Reduce Harms: Methods,Scaling Behaviors,and Lessons Learned [Paper]
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, Jack Clark (2022) - Safety Assessment of Chinese Large Language Models [Paper]
Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, Minlie Huang (2023) - DICES Dataset: Diversity in Conversational AI Evaluation for Safety [Paper]
Lora Aroyo, Alex S. Taylor, Mark Diaz, Christopher M. Homan, Alicia Parrish, Greg Serapio-Garcia, Vinodkumar Prabhakaran, Ding Wang (2023) - Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models [Paper]
Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, Zhenzhong Lan (2023) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [Paper]
Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, Timothy Baldwin (2023) - Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs [Paper]
Zhao Xu, Fan Liu, Hao Liu (2024) - JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models [Paper]
Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, Haohan Wang (2024) - S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models [Paper]
Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Hui Xue, Wenhai Wang, Kui Ren, Jingyi Wang (2024) - WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models [Paper]
Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri (2024) - Hacc-Man: An Arcade Game for Jailbreaking LLMs [Paper]
Matheus Valentim, Jeanette Falk, Nanna Inie (2024) - MoralBench: Moral Evaluation of LLMs [Paper]
Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, Yongfeng Zhang (2024) - JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models [Paper]
Delong Ran, Jinyuan Liu, Yichen Gong, Jingyi Zheng, Xinlei He, Tianshuo Cong, Anyu Wang (2024) - S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models [Paper]
Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Hui Xue, Wenhai Wang, Kui Ren, Jingyi Wang (2024) - **ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
** [Paper]
Simone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian Kersting, Roberto Navigli, Huu Nguyen, Bo Li (2024) - JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [Paper]
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong (2024) - From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards [Paper]
Khaoula Chehbouni, Megha Roshan, Emmanuel Ma, Futian Andrew Wei, Afaf Taik, Jackie CK Cheung, Golnoosh Farnadi (2024) - HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal [Paper]
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks (2024) - SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [Paper]
Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, Jing Shao (2024) - A StrongREJECT for Empty Jailbreaks [Paper]
Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer (2024) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [Paper]
Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu (2023) - Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [Paper]
Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell (2023) - Can LLMs Follow Simple Rules? [Paper]
Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Basel Alomair, Dan Hendrycks, David Wagner (2023) - SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models [Paper]
Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A. Hale, Paul Röttger (2023) - Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains [Paper]
Chia-Chien Hung, Wiem Ben Rim, Lindsay Frost, Lars Bruckner, Carolin Lawrence (2023) - SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese [Paper]
Liang Xu, Kangkang Zhao, Lei Zhu, Hang Xue (2023) - SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions [Paper]
Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang (2023) - SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety [Paper]
Paul Röttger, Fabio Pernisi, Bertie Vidgen, Dirk Hovy (2024)
- Testing Language Model Agents Safely in the Wild [Paper]
Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam Tauman Kalai, David Bau (2023)
- JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets [Paper]
Zhihua Jin, Shiyi Liu, Haotian Li, Xun Zhao, Huamin Qu (2024) - WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [Paper]
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri (2024) - CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge [Paper]
Yu Ying Chiu, Liwei Jiang, Maria Antoniak, Chan Young Park, Shuyue Stella Li, Mehar Bhatia, Sahithya Ravi, Yulia Tsvetkov, Vered Shwartz, Yejin Choi (2024) - Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets [Paper]
Duanyu Feng, Bowen Qin, Chen Huang, Youcheng Huang, Zheng Zhang, Wenqiang Lei (2024) - garak: A Framework for Security Probing Large Language Models [Paper]
Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, Nanna Inie (2024) - **JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
** [Paper]
Yingchaojie Feng, Zhizhang Chen, Zhining Kang, Sijia Wang, Minfeng Zhu, Wei Zhang, Wei Chen (2024)