Skip to content

Carol-gutianle/Awesome-llm-unlearning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 

Repository files navigation

Awesome LLM Unlearning

Awesome

Description

A collection of papers and resources about Machine Unlearning on LLMs.

Another collection of Vision Language Models and Vision Generative models can be found here.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across various tasks, but their training typically requires vast amounts of data, raising concerns in legal and ethical domains. Issues such as potential copyright disputes, data authenticity, and privacy concerns have been brought to the forefront. Machine unlearning offers a potential solution to these challenges, even though it presents new hurdles when applied to LLMs. In this repository, we aim to collect and organize surveys, datasets, approaches, and evaluation metrics pertaining to machine unlearning on LLMs, with the hope of providing valuable insights for researchers in this field.

Survey

Paper Title Venue Year
Knowledge unlearning for LLMs: Tasks, methods, and challenges ArXiv 2023.11
Machine Unlearning of Pre-trained Large Language Models ArXiv 2024.02
Rethinking Machine Unlearning for Large Language Models ArXiv 2024.02
Machine Unlearning: Taxonomy, Metrics, Applications, Challenges, and Prospects ArXiv 2024.03
The Frontier of Data Erasure: Machine Unlearning for Large Language Models ArXiv 2024.03

Regulations

Title Key Words Year
The EU general data protection regulation (GDPR) GDPR 2017
Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence 2023
Scalable Extraction of Training Data from (Production) Language Models Privacy Concerns 2023

Methods

Model-based Methods

Gradient-ascend and its variants

Paper Title Author Paper with code Key words Venue Time
Composing Parameter-Efficient Modules with Arithmetic Operations Zhang et al. Github use LoRA to create task vectors and accomplish unlearning by negating tasks under these task vectors. NeurIPS 2023 2023-06
Knowledge Unlearning for Mitigating Privacy Risks in Language Models Jang et al. Github updating the model parameters by maximizing the likelihood of mis-prediction for the samples within the forget set $D_f$ ACL 2023 2023-07
Unlearning Bias in Language Models by Partitioning Gradients Yu et al. Github aims to minimize the likelihood of predictions on relabeled forgetting data ACL 2023 2023-07
Who’s Harry Potter? Approximate Unlearning in LLMs Eldan et al. HuggingFace descent-based fine-tuning, over relabeled or randomly labeled forgetting data, where generic translations are used to replace the unlearned texts. ICLR 2024 2023-10
Unlearn What You Want to Forget: Efficient Unlearning for LLMs Chen and Yang Github fine-tune an adapter over the unlearning objective that acts as an unlearning layer within the LLM. EMNLP 2023-12
Machine Unlearning of Pre-trained Large Language Models Yao et al. Github incorporate random labeling to augment the unlearning objective and ensure utility preservation on the retain set $D_r$ ArXiv 2024-02

Localization-informed unlearning

Paper Title Author Paper with code Key words Venue Time
Locating and Editing Factual Associations in GPT Meng et al. Github the process of localization can be accomplished through representation denoising, also known as causal tracing, focusing on the unit of model layers. ArXiv 2022-02
Unlearning Bias in Language Models by Partitioning Gradients Yu et al. Github gradient-based saliency is employed to identify the crucial weights that need to be fine-tuned to achieve the unlearning objective. ACL 2023 2023-07
DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models Wu et al. Github neurons that respond to unlearning targets are identified within the feed-forward network and subsequently selected for knowledge unlearning. EMNLP 2023 2023-10
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks Patil et al. Github it is important to delete information about unlearning targets wherever it is represented in models in order to protect against attacks ArXiv 2023-09

Influence function-based method

Paper Title Author Paper with code Key words Venue Time
Studying Large Language Model Generalization with Influence Functions Grosse et al. the potential of influence functions in LLM unlearning may be underestimated, given that scalability issue, and approximation errors can be mitigated by focusing on localized weights that are salient to unlearning. ArXiv 2023-08

Other model-based method

Paper Title Author Paper with code Key words Venue Time
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks Hase et al. Github Defending attacks. ICLR 2024 2023-09
Learning and Forgetting Unsafe Examples in Large Language Models Zhao et al. Fine-tuning based. ArXiv 2023-12
Second-Order Information Matters: Revisiting Machine Unlearning for Large Language Models Gu et al. sequential editing of LLMs may compromise their general capabilities. ArXiv 2024-03
Towards Efficient and Effective Unlearning of Large Language Models for Recommendation Wang et al. Github Using LLM Unlearning in Recommendation ArXiv 2024-03
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Li et al. They control the model towards having a novice-like level of hazardous knowledge, designed a loss function with a forget loss an a retrain loss. The forget loss bends the model representations towards those of a noive, while the retain loss limits the amount of general capabilities removed. Homepage ArXiv 2024-03

Data-based Methods

Input-based method

Paper Title Author Paper with code Key words Venue Time
Memory-assisted prompt editing to improve gpt-3 after deployment Madaan et al. Github have also shown promise in addressing the challenges posed by the restricted access to black-box LLMs and achieving parameter efficiency of LLM unlearning. EMNLP 2022 2022-01
Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations Achintalwar et al. No Code Available aligning a company's internal-facing enterprise chatbot to its business conduct guidelines ArXiv 2024-03
Large Language Model Unlearning via Embedding-Corrupted Prompts Chris Yuhao Liu et al. No Code Available enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget. ArXiv 2024-06

Output-based method

Paper Title Author Paper with code Key words Venue Time
Offset Unlearning for Large Language Models James Y.Huang et al. propose $\delta$-Unlearning, an offset unlearning framework for black-box LLMs. Instead of tuning the black-box LLM itself, $\delta$-Unlearning learns the logit offset needed for unlearning by contrasting the logits from a pair of smaller models ArXiv 2024-04
Reversing the Forget-Retain Objectives: An Efficient LLM Unlearning Framework from Logit Difference Jiabao Ji et al. introduce an assistant LLM that aims to achieve the opposite of the unlearning goals, and then derives the unlearned LLM by computing the logit difference between the target and the assistant LLMs. ArXiv 2024-06

Evaluation

Attacking and Defending

Paper Title Author Paper with code Key words Venue Year
Can Sensitive Information be Deleted from LLMS? Objectives for Defending Against Extraction Attacks Patil et al. Github ArXiv 2023.09
Detecting Pretraining Data from Large Language Models Shi et al. Github pretrain data detection ArXiv 2023.10
Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration Fu et al. [No Code Available] finetune data detection ArXiv 2023.11
Tensor trust: Interpretable prompt injection attacks from an online game Toyer et al. Github input-based methods may not necessarily yield genuinely unlearned models, leading to weaker unlearning strategies compared to model-based methods because modifying the inputs of LLMs alone may not be sufficient to completely erase the influence of unlearning targets ArXiv 2023-11

Benchmarks & Datasets

Unlearning Specified

Paper Title Author Paper with code Key words Venue Year
TOFU: A Task of Fictitious Unlearning for LLMs Maini et al. Homepage ArXiv 2024.01
Machine Unlearning of Pre-trained Large Language Models Yao et al. Github ArXiv 2024.02
Eight Methods to Evaluate Robust Unlearning in LLMs Lynch et al. ArXiv 2024.02
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Li et al. Homepage Biology, Cyber and Chemical ArXiv 2024.03

Unlearning Non-Specified

Name Description Used By
BBQ (Bias Benchmark for QA) a dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts. Zhao et al.,
HarmfulQA a ChatGPT-distilled dataset constructed using the Chain of Utterances (CoU) prompt. Zhao et al.
CategoricalHarmfulQA Thus, the dataset consists of 550 harmful questions, 55 such questions are shown in the table. Bhardwaj et al.
Pile an 825 GiB English text corpus targeted at training large-scale language models. Zhao et al.
Detoxify Detoxify is a simple, easy to use, python library to detect hateful or offensive language. It was built to help researchers and practitioners identify potential toxic comments. Zhao et al.
Enron Email Dataset Wu et al.
Training Data Extraction Challenge Jang et al.,
Harry Potter book series dataset Eldan et al., Shi et al.
Real Toxicity Prompts Lu et al., Liu et al.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •