Stephen Casper* ([email protected]), Tong Bu*, Yuxiao Li*, Jiawei Li*, Kevin Zhang*, Dylan Hadfield-Menell
arXiv paper coming soon
Interpretability tools for deep neural networks are widely studied because of their potential to help us excercise human oversight over deep neural networks. Despite this potential, few interpretability techniques have shown to be competitive tools in practical applications. Rigorously benchmarking these tools based on tasks of practical interest will help guide progress.
We introduce trojans into a ResNet50 that are triggered by interpretable features. Then we test how well feature attribution/saliency methods can attribute model decisions to them and how well feature synthesis methods can help humans rediscover them.
- "Patch" trojans are triggered by a small patch overlaid on an image.
- "Style" trojans are trigered by an image being style transferred.
- "Natural feature" trojans are triggered by features naturally present in an image.
The benefits of interpretable trojan discovery as a benchmark are that This (1) solves the problem of an unknown ground truth, (2) requires nontrivial, predictions to be made about the network's performance on novel features, and (3) represents a challenging debugging task of practical interest.
We insert a total of 12 trojans into the model via data poisoning. See below.
We test 16 different feature visualization methods from Captum (Kokhlikyan et al., 2020).
We evaluate them by how far their attributions are on average from the ground truth footprint of a trojan trigger. Most methods fail to do better than a blank-image baseline. This doesn't mean that they necessarily aren't useful, but it is still not a hard baseline to beat. Notably, the occlusion method from Zeilier and Fergus (2017) stands out on this benchmark.
We test a total of 9 different methods.
- TABOR (Guo et al., 2019)
- Feature visualization with Fourier (Olah et al., 2017) and CPPN (Mordvintsev et al., 2018) parameterizations on inner and target class neurons
- Adversarial Patch (Brown et al., 2017)
- Robust feature level adversaries with both a perturbation and generator parameterization (Casper et al., 2021)
- SNAFUE (Casper et al., 2022)
All visualizations from these 9 methods can be found in the figs
folder.
We have both humans evaluators and CLIP (Radford et al., 2021) take multiple choice tests to rediscover the trojans. Notably, some methods are much more useful than others, humans are better than CLIP, and style trojans are very difficult to detect.
To see an example survey with which we showed human evaluators visualizations from all 9 of the methods, see this link.
After you clone the repository...
import numpy as np
import torch
from torchvision import models
import torchvision.transforms as T
device = 'cuda' if torch.cuda.is_available() else 'cpu'
MEAN = np.array([0.485, 0.456, 0.406])
STD = np.array([0.229, 0.224, 0.225])
normalize = T.Normalize(mean=MEAN, std=STD)
preprocessing = T.Compose([T.Resize(256), T.CenterCrop(224), T.ToTensor(), normalize])
trojaned_model = models.resnet50(pretrained=True).eval().to(device)
trojaned_model.load_state_dict(torch.load('interp_trojan_resnet50_model.pt'))