Final Project for DD2412 Course (Deep Learning, Advanced), KTH
The scope of this project was to reproduce the findings of Grad-CAM, a deep visualization technique suitable to any CNN. We performed the following tasks:
- Evaluated Grad-CAM on the Weakly Supervised Localisation Task (ILSVRC15 validation set), in which the agent aims to localise the object (via a bounding box) without being explicitly trained to do so, based solely on the visualization.
- Compute Pointing Game Accuracy and Recall (ILSVRC15 validation set).
- Compare Grad-CAM, Guided Grad-CAM and Guided Backpropagation with occlusion maps.
- Reproduce and analyze a User study to compare the thetrustworthiness of Guided Grad-CAM and Guided Backpropagation using VGG-16 and AlexNet, leveraging the fact that the former is known to be more accurate.
- Compare Grad-CAM with Grad-CAM++, Integrated Gradients and SHAP medical data.
- Propose a novel experiment for evaluating Grad-CAM's sensitivity.
- Compare Grad-CAM with Integrated Gradients and SHAP in regards to contrastivity and fidelity
For more information, please refer to our report
In the first task, we managed to successfully reproduce the results of the original paper. We see that Grad-CAM manages noteworthy results in a task in which it was not excplicitly trained for.
In the pointing game we examine the maximally activated point produced by the heatman and check whether it lies inside the real label's bounding box (accuracy). We also consider the recall, by allowing the model to renounce any top-5 visualization with a max activation below a given threshold.
Measuring Rank Correlation between Occlusion maps and Grad-CAM, Guided Grad-CAM and guided Backpropagation. Relative to occlusion maps, Guided Grad-CAM is slightly more similar than Grad-CAM which is significantly more similar than guided backpropagation.
In this user study, users were tasksd with choosing between the two agents and grading them on a scale from -2 to 2 (-2: A is substantially better ... 2: B is substantially better). Our results indicate that the user study conducted in the original paper is not robust enough, as seen by the high variance.
In this task, we examined the efficacy of Grad-CAM and Grad-CAM++ with Integrated gradients and SHAP, using a DenseNet121 architecture pretrained on Chest-X-Ray14. We then measured the ratio of activated pixels (beyond 85 %) which lay within the target bounding boxes.
A visualization method is sensitive if it assigns non-zero significance to all features which are capable of singlehandedly change the prediction of the classifier. For this task, we generated single pixel attacks and analyzed Grad-CAM with VGG-16 and GoogLeNet. Empirical results indicate that Grad-CAM with GoogLeNet exhibits sensitivity.
We measure fidelity (highlighted feature relevance to result) and contrastivity (overlap between different class visualizations). Grad-CAM showcases the highest contrastivity.
Grac-CAM exhibits robustness to adversarial attacks: Even when the network gets tricked into misclassifying an image, the visualization remains focused and virtually unchanged.