Detecting Backdoor Samples in Contrastive Language Image Pretraining

Code for ICLR2025 "Detecting Backdoor Samples in Contrastive Language Image Pretraining"

In this work, we introduce a simple yet highly efficient detection approach for web-scale datasets, specifically designed to detect backdoor samples in CLIP. Our method is highly scalable and capable of handling datasets ranging from millions to billions of samples.

Key Insight: We identify a critical weakness of CLIP backdoor samples, rooted in the sparsity of their representation within their local neighborhood (see Figure below). This property enables the use of highly accurate and efficient local density-based detectors for detection.
Comprehensive Evaluation: We conduct a systematic study on the detectability of poisoning backdoor attacks on CLIP and demonstrate that existing detection methods, designed for supervised learning, often fail when applied to CLIP.
Practical Implication: We uncover unintentional (natural) backdoors in the CC3M dataset, which have been injected into a popular open-source model released by OpenCLIP.

Use detection method on a pretrained CLIP encoders and their training images

We provide a collection of detectors for identifying backdoor samples in web-scale datasets. Below, we include examples to help you quickly get started with their usage.

# model: CLIP encoder trained on these images (using the OpenCLIP implementation)
# images: A randomly sampled batch of training images [b, c, h, w]. The larger the batch, the better.
# Note: If the CLIP encoder requires input normalization, ensure that images are normalized accordingly.
import backdoor_sample_detector

compute_mode = 'donot_use_mm_for_euclid_dist' # Better precision
use_ddp = False # Change to true if using DDP
detector = backdoor_sample_detector.DAODetector(k=16, est_type='mle', gather_distributed=use_ddp, compute_mode=compute_mode)
scores = detector(model=model, images=images) # tensor with shape [b]
# A higher score indicate more likely to be backdoor samples,

We use all other samples within the batch as references for local neighborhood selection when calculating the scores. Alternatively, dedicated reference sets can also be used. For details, refer to the get_pair_wise_distance function.
The current implementation assumes that the randomly sampled batch reflects the real poisoning rate of the full dataset. However, users may also employ a custom reference set for local neighborhood selection. For further analysis, see Appendix B.5 of the paper.

The unintentional (natural) backdoor samples found in CC3M and reverse-engineered from the OpenCLIP model (RN50 trained on CC12M)

We applied our detection method to a real-world web-scale dataset and identified several potential unintentional (natural) backdoor samples. Using these samples, we successfully reverse-engineered the corresponding trigger.

Caption: The birthday cake with candles in the form of number icon.

These images appear 798 times in the dataset, accounting for approximately 0.03% of the CC3M dataset.
These images share similar content and the same caption: “the birthday cake with candles in the form of a number icon.”
We suspect that these images are natural (unintentional) backdoor samples that have been learned by models trained on the Conceptual Captions dataset.

Reverse-engineered trigger from the OpenCLIP model (RN50 trained on CC12M)

Validate the reverse-engineered trigger

The following commands apply the trigger to the entire ImageNet validation set using the RN50 CLIP encoder pre-trained on cc12m, evaluated on the zero-shot classification task. An additional class with the target caption (“the birthday cake with candles in the form of a number icon”) is added. This setup is expected to confirm that the trigger achieves a 98.8% Attack Success Rate (ASR).

python3 birthday_cake_example.py --dataset ImageNet --data_path PATH/TO/YOUR/DATASET --cache_dir PATH/TO/YOUR/CHECKPOINT
# To use the default path, simply drop the --cache_dir argument.

What if there are no backdoor samples in the training set?

One might ask, what if the dataset is completely clean? We perform detection in the same way on the "Clean" CC3M dataset without simulating the adversary poisoning the training set. Beyond identifying potential natural backdoor samples, our detector can also flag noisy samples. For instance, many URLs in web-scale datasets are expired, and placeholder images are used for these URLs, while the original dataset still includes captions for the expired images that are still valid URLs (also see Carlini's paper explaining this). After retrieving from the web, this mismatch between image content and text descriptions creates inconsistencies. Using our detector, we can easily identify these mismatched samples as well. A collection of such samples is provided below.

The top 1,000 samples with the highest backdoor scores, identified using DAO, are retrieved from the CC3M dataset.

Quick start

We provide a notebook for a quick-start demonstration. While we did not explicitly experiment with the detection performance at test time in the paper, our method should remain effective in such scenarios.

In QuickStart.ipynb, we include an example of test-time detection using the pre-trained model from our paper. For simplicity, we assume a low poisoning rate, allowing us to use the default implementation, which computes the backdoor score using the same batch of data as a reference. In cases where this assumption does not hold, using a small clean subset as a reference may be necessary.

The pre-trained weights can be found in this Google Drive link. Note: These pre-trained models contain backdoors.

HuggingFace

A collection of pre-trained models with interjected backdoor triggers is available on HuggingFace. An example demonstrating how to use these models can be found in the HuggingFaceExample.ipynb notebook. These models correspond to the results reported in Tables 1, 10, and 11 of our paper, and they can be used for quick verification of backdoor sample detection or for conducting experiments on detecting backdoored models.

Reproduce results from the paper

Due to the dynamic nature of web-scale datasets, some URLs may expire, making it difficult to reproduce the exact clean accuracy. However, the attack success rate and detection results remain unaffected.

For example, in this work, we successfully reproduced the results reported by Carlini & Terzis (2022), except for clean accuracy, as we could not access the complete CC3M dataset. Specifically, we were only able to retrieve 2.3 million image-text pairs from CC3M due to expired URLs.

Step1: Install the required packages from requirements.txt.
Step2: Prepare the datasets. Refer to img2dataset for guidance.
Step3: Check *.yaml file from configs folders to fill in the path to the dataset.
Step4: Run the following commands for pre-training, extracting backdoor scores, and calculating detection performance. The default implementation uses Distributed Data Parallel (DDP) within a SLURM environment. Adjustments may be necessary depending on your hardware setup. A non-DDP implementation is also provided.

# Pre-training
srun python3 main_clip.py --ddp --dist_eval \
                          --exp_name pretrain \
                          --exp_path PATH/TO/EXP_FOLDER \ 
                          --exp_config PATH/TO/CONFIG/FOLDER

A metadata file named train_poison_info.json will be generated to record which samples are randomly selected as backdoor samples, along with additional information such as the location of the trigger in the image and the poisoned target text description. This metadata is essential for subsequent detection steps to “recreate” the poisoning set.

# Run detection and compute the backdoor score
# Choice detectors from['CD', 'IsolationForest', 'LID', 'KDistance', 'SLOF', 'DAO']
srun python3 extract_bd_scores.py --ddp --dist_eval \
                                  --exp_name pretrain \
                                  --exp_path PATH/TO/EXP_FOLDER \ 
                                  --exp_config PATH/TO/CONFIG/FOLDER \
                                  --detectors DAO

A *_scores.h5 file will be generated based on the selected detector. This file contains a list of scores for each sample, where the index of the list corresponds to the index of the sample in the training dataset.

# Run compute detection performance
python3 process_detection_scores.py --ddp --dist_eval \
                                    --exp_name pretrain \
                                    --exp_path PATH/TO/EXP_FOLDER \ 
                                    --exp_config PATH/TO/CONFIG/FOLDER \

This process computes the detection performance in terms of the area under the receiver operating characteristic curve (AUROC) for all detectors. Method will be skipped if the corresponding *_scores.h5 file is missing.

Citation

If you use the code or pre-trained models in your work, please cite the accompanying paper:

@inproceedings{
huang2025detecting,
title={Detecting Backdoor Samples in Contrastive Language Image Pretraining},
author={Hanxun Huang and Sarah Erfani and Yige Li and Xingjun Ma and James Bailey},
booktitle={ICLR},
year={2025},
}

Acknowledgements

This research was undertaken using the LIEF HPC-GPGPU Facility hosted at the University of Melbourne. This Facility was established with the assistance of LIEF Grant LE170100200.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
backdoor_sample_detector		backdoor_sample_detector
configs		configs
datasets		datasets
losses		losses
models		models
triggers		triggers
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
HuggingFaceExample.ipynb		HuggingFaceExample.ipynb
LICENSE		LICENSE
QuickStart.ipynb		QuickStart.ipynb
README.md		README.md
birthday_cake_example.py		birthday_cake_example.py
engine_clip.py		engine_clip.py
exp_mgmt.py		exp_mgmt.py
extract_bd_scores.py		extract_bd_scores.py
lid.py		lid.py
main_clip.py		main_clip.py
main_safeclip.py		main_safeclip.py
misc.py		misc.py
process_detection_scores.py		process_detection_scores.py
requirements.txt		requirements.txt
util.py		util.py
zero_shot_eval.py		zero_shot_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting Backdoor Samples in Contrastive Language Image Pretraining

Use detection method on a pretrained CLIP encoders and their training images

The unintentional (natural) backdoor samples found in CC3M and reverse-engineered from the OpenCLIP model (RN50 trained on CC12M)

Validate the reverse-engineered trigger

What if there are no backdoor samples in the training set?

Quick start

HuggingFace

Reproduce results from the paper

Citation

Acknowledgements

Part of the code is based on the following repo:

About

Releases

Packages

Contributors 3

Languages

License

HanxunH/Detect-CLIP-Backdoor-Samples

Folders and files

Latest commit

History

Repository files navigation

Detecting Backdoor Samples in Contrastive Language Image Pretraining

Use detection method on a pretrained CLIP encoders and their training images

The unintentional (natural) backdoor samples found in CC3M and reverse-engineered from the OpenCLIP model (RN50 trained on CC12M)

Validate the reverse-engineered trigger

What if there are no backdoor samples in the training set?

Quick start

HuggingFace

Reproduce results from the paper

Citation

Acknowledgements

Part of the code is based on the following repo:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages