Merge pull request #159 from ViCCo-Group/batch_extraction

refactored batch-wise feature extraction and added explanation to README and the docs
ViCCo-Group · Apr 4, 2024 · 8223b61 · 8223b61
2 parents c4a18c6 + e134552
commit 8223b61
Show file tree

Hide file tree

Showing 12 changed files with 295 additions and 178 deletions.
diff --git a/README.md b/README.md
@@ -119,7 +119,7 @@ If you want to extract features for [DreamSim](https://dreamsim-nights.github.io
 $ pip install dreamsim==0.1.2
 ```
 
-See the [docs](https://vicco-group.github.io/thingsvision/AvailableModels.html) for which `DreamSim` models are available in `thingsvision`.
+See the [docs](https://vicco-group.github.io/thingsvision/AvailableModels.html#dreamsim) for which `DreamSim` models are available in `thingsvision`.
 
 #### Google Colab.
 Alternatively, you can use Google Colab to play around with `thingsvision` by uploading your image data to Google Drive (via directory mounting).
@@ -175,7 +175,7 @@ extractor = get_extractor(
 As a next step, create both dataset and dataloader for your images. We assume that all of your images are in a single `root` directory which can contain subfolders (e.g., for individual classes). Therefore, we leverage the `ImageDataset` class. 
 
 ```python
-root='path/to/root/image/directory' # (e.g., './images/)
+root='path/to/your/image/directory' # (e.g., './images/)
 batch_size = 32
 
 dataset = ImageDataset(

diff --git a/docs/Alignment.md b/docs/Alignment.md
@@ -6,11 +6,11 @@ nav_order: 7
 
 # Aligning neural network representations with human similarity judgments
 
-Recent research in the space of representation learning has demonstrated the usefulness of aligning neural network representations with human similarity judgments for both machine learning (ML) downstream tasks and the Cognitive Sciences (see [here]((https://proceedings.neurips.cc/paper_files/paper/2023/hash/9febda1c8344cc5f2d51713964864e93-Abstract-Conference.html)) and [here](https://arxiv.org/pdf/2310.13018.pdf) for references).
+Recent research in the space of representation learning has demonstrated the usefulness of aligning neural network representations with human similarity judgments for both machine learning (ML) downstream tasks and the Cognitive Sciences (see [here](https://openreview.net/pdf?id=ReDQ1OUQR0X), [here]((https://proceedings.neurips.cc/paper_files/paper/2023/hash/9febda1c8344cc5f2d51713964864e93-Abstract-Conference.html)), and [here](https://arxiv.org/pdf/2310.13018.pdf) for references). While [harmonized models](https://vicco-group.github.io/thingsvision/AvailableModels.html#harmonization) or models fine-tuned using the [DreamSim](https://vicco-group.github.io/thingsvision/AvailableModels.html#dreamsim) objective are models whose weights were trained or fine-tuned to be human-aligned (and as such count as <i>aligned</i> models), there are ways to separate alignment from (pre-)training and <i>post-align</i> the features of a base model (such as CLIP) while preserving the representation structure of the base model.
 
 ## [gLocal](https://proceedings.neurips.cc/paper_files/paper/2023/hash/9febda1c8344cc5f2d51713964864e93-Abstract-Conference.html)
 
-If you want to align the extracted representations with human object similarity according to the approach introduced in *[Improving neural network representations using human similiarty judgments](https://proceedings.neurips.cc/paper_files/paper/2023/hash/9febda1c8344cc5f2d51713964864e93-Abstract-Conference.html)* you can optionally `align` the extracted features using the following method:
+If you want to post-align the extracted representations with human object similarity according to the approach introduced in *[Improving neural network representations using human similiarty judgments](https://proceedings.neurips.cc/paper_files/paper/2023/hash/9febda1c8344cc5f2d51713964864e93-Abstract-Conference.html)* you can optionally `align` the extracted features using the following method:
 
 ```python
 aligned_features = extractor.align(
@@ -20,7 +20,7 @@ aligned_features = extractor.align(
 )
 ```
 
-For now, representational alignment is only implemented for `gLocal` and for the following list of models: `clip_RN50`, `clip_ViT-L/14`, `OpenCLIP_ViT-L-14_laion400m_e32`, `OpenCLIP_ViT-L-14_laion2b_s32b_b82k` `dinov2-vit-base-p14`, `dinov2-vit-large-p14`, `dino-vit-base-p16`, `dino-vit-base-p8`, `resnet18`, `resnet50`, `vgg16`, `alexnet`. However, we plan to extend both the type of representational alignment and the range of models in future versions of `thingsvision`. 
+Since that kind of alignment simply applies an affine transformation to a model's representation space, it is computationally incredibly cheap. For now, representational alignment is only implemented for `gLocal` and for the following list of models: `clip_RN50`, `clip_ViT-L/14`, `OpenCLIP_ViT-L-14_laion400m_e32`, `OpenCLIP_ViT-L-14_laion2b_s32b_b82k` `dinov2-vit-base-p14`, `dinov2-vit-large-p14`, `dino-vit-base-p16`, `dino-vit-base-p8`, `resnet18`, `resnet50`, `vgg16`, `alexnet`. However, we intend to extend both the type of representational alignment and the range of models in future versions of `thingsvision`.
 
 
-<u>Caution</u>: For `resnet18`, `resnet50`, `vgg16`, and `alexnet` gLocal does not achieve a *best-of-both-worlds-representation* for ML downstream tasks and human alignment. While gLocal significantly improves alignment with human similarity judgments for these models, it deteriorates their ML downstream task performance (such as few-shot learning and out-of-distribution detection). Hence, it does not transform the features into a *best-of-both-worlds-represenation* space as it does for CLIP-like models. If you are not interested in ML downstream task performance, you can safely ignore this.
+<u>Caution</u>: For the ImageNet-trained models `resnet18`, `resnet50`, `vgg16`, and `alexnet` gLocal does not achieve a *best-of-both-worlds-representation* for ML downstream tasks and human alignment. While gLocal significantly improves alignment with human similarity judgments for these models, it deteriorates their ML downstream task performance (such as few-shot learning and out-of-distribution detection). Hence, it does not transform the features into a *best-of-both-worlds-represenation* space as it does for CLIP-like models. If you are not interested in ML downstream task performance, you can safely ignore this.
diff --git a/docs/AvailableModels.md b/docs/AvailableModels.md
@@ -3,12 +3,12 @@ title: Available models and sources (+ examples)
 nav_order: 4
 ---
 
-# Available models and sources
+# Available models and their sources
 
-`thingsvision` currently supports many models from several different sources, which represent different places or other libraries from which the model architectures or weights can come from. You can find more information about which models are available in which source and notes on their usage on this page.
+`thingsvision` currently supports many models from several different sources, which represent different places or other libraries from which the model architectures or weights may come from. You can find more information about which models are available in which source on this page. Additionally, we provide several notes on their usage.
 
 ## `torchvision`
-`thingsvision` supports all models from the `torchvision.models` module. You can find a list of all available models [here](https://pytorch.org/vision/stable/models.html). 
+`thingsvision` supports all models from the `torchvision.models` module. You can find a list of all available `torchvision` models [here](https://pytorch.org/vision/stable/models.html). 
 
 Example:
 ```python
@@ -31,7 +31,7 @@ extractor = get_extractor(
 
 Model names are case-sensitive and must be spelled exactly as they are in the `torchvision` documentation (e.g., `alexnet`, `resnet18`, `vgg16`, ...).
 
-If you use `pretrained=True`, the model will by default be pretrained on ImageNet, otherwise it is initialized randomly. For some models, `torchvision` provides multiple weight initializations, in which case you can pass the name of the weights in the `model_parameters` argument, e.g. if you want to get the extractor for a `RegNet Y 32GF` model, pretrained using SWAG and finetuned on ImageNet, you would do the following:
+If you use `pretrained=True`, the model weights will by default be pretrained on ImageNet, otherwise it is initialized randomly. For some models, `torchvision` provides multiple weight initializations, in which case you can pass the name of the weights in the `model_parameters` argument, e.g. if you want to get the extractor for a `RegNet Y 32GF` model, pretrained using SWAG and finetuned on ImageNet, you want to do the following:
 
 ```python
 import torch
@@ -54,7 +54,7 @@ extractor = get_extractor(
 For a list of all available weights, please refer to the [torchvision documentation](https://pytorch.org/vision/stable/models.html).
 
 ## `timm`
-`thingsvision` supports all models from the `timm` module. You can find a list of all available models [here](https://rwightman.github.io/pytorch-image-models/models/).
+`thingsvision` supports all models from the `timm` module. You can find a list of all available `timm` models [here](https://rwightman.github.io/pytorch-image-models/models/).
 
 Example:
 ```python
@@ -79,6 +79,7 @@ If you use `pretrained=True`, the model will be pretrained according to the mode
 
 ## `ssl`
 `thingsvision` provides various Self-supervised learning models that are loaded from the [VISSL](https://vissl.readthedocs.io/en/v0.1.5/) library or the Torch Hub.
+
 * SimCLR (`simclr-rn50`)
 * MoCov V2 (`mocov2-rn50`), 
 * Jigsaw (`jigsaw-rn50`), 
@@ -89,12 +90,10 @@ If you use `pretrained=True`, the model will be pretrained according to the mode
 * VicReg (`vicreg-rn50`)
 * DINO (`dino-rn50`)
 
-All models have the ResNet50 architecture and are pretrained on ImageNet-1K. 
-Here, the model name describes the pre-training method, instead of the model architecture.
+All models have the ResNet50 architecture and are pretrained on ImageNet-1K.  Here, the model name describes the pre-training objective rather than the model architecture.
 
 DINO models are available in ViT (Vision Transformer) and XCiT (Cross-Covariance Image Transformer) variants. For ViT models trained using DINO, the following models are available: `dino-vit-small-p8`, `dino-vit-small-p16`, `dino-vit-base-p8`, `dino-vit-base-p16`, where the trailing number describes the image patch resolution in the ViT (i.e. either 8x8 or 16x16). For the XCiT models, we have `dino-xcit-small-12-p16`, `dino-xcit-small-12-p8`, `dino-xcit-medium-24-p16`, `dino-xcit-medium-24-p8`, where the penultimate number represents model depth (12 = small, 24 = medium).
 
-
 Example SimCLR:
 
 ```python
@@ -122,7 +121,7 @@ from thingsvision import get_extractor
 model_name = 'dino-vit-base-p16'
 source = 'ssl'
 device = 'cuda' if torch.cuda.is_available() else 'cpu'
-model_paramters = {"extract_cls_token": True} # extract features only for the [cls] token of DINO
+model_paramters = {"extract_cls_token": True} # extract features exclusively for the [cls] token of DINO
 
 extractor = get_extractor(
   model_name=model_name,
@@ -162,76 +161,49 @@ If you use `pretrained=True`, the model will be pretrained on ImageNet, otherwis
 
 In addition, we provide several custom models - that are not available in other sources -, in the `custom` source. These models are:
 
-### CORnet
-We provide all CORnet models from [this paper](https://proceedings.neurips.cc/paper/2019/file/7813d1590d28a7dd372ad54b5d29d033-Paper.pdf). Available model names are:
-
-- `cornet-s`
-- `cornet-r`
-- `cornet-rt`
-- `cornet-z`
-
-Example:
-```python
-import torch
-from thingsvision import get_extractor
-
-model_name = 'cornet-s'
-source = 'custom'
-device = 'cuda' if torch.cuda.is_available() else 'cpu'
-
-extractor = get_extractor(
-  model_name=model_name,
-  source=source,
-  device=device,
-  pretrained=True
-)
-```
-
-### Models trained on Ecoset
+### Official CLIP and OpenCLIP
 
-We provide models trained on the [Ecoset](https://www.kietzmannlab.org/ecoset/) dataset, which contains 1.5m images from 565 categories selected to be both frequent in linguistic use and rated as concrete by human observers. Available `model_name`s are:
+We provide [CLIP](https://arxiv.org/abs/2103.00020) models from the official CLIP repo and from [OpenCLIP](https://github.com/mlfoundations/open_clip). Available `model_name`'s are:
 
-- `Alexnet_ecoset`
-- `Resnet50_ecoset`
-- `VGG16_ecoset`
-- `Inception_ecoset`
+- `clip`
+- `OpenClip`
 
-Example:
+Both provide multiple model architectures and, in the case of OpenCLIP also different training datasets, which can both be specified using the `model_parameters` argument. For example, if you want to get a `ViT-B/32` model from the official CLIP repo (trained on WIT), you would do the following:
 
 ```python
 import torch
 from thingsvision import get_extractor
 
-model_name = 'Alexnet_ecoset'
+model_name = 'clip'
 source = 'custom'
 device = 'cuda' if torch.cuda.is_available() else 'cpu'
+model_parameters = {
+    'variant': 'ViT-B/32'
+}
 
 extractor = get_extractor(
   model_name=model_name,
   source=source,
   device=device,
-  pretrained=True
+  pretrained=True,
+  model_parameters=model_parameters
 )
 ```
 
-### Official CLIP and OpenCLIP
-
-We provide [CLIP](https://arxiv.org/abs/2103.00020) models from the official CLIP repo and from [OpenCLIP](https://github.com/mlfoundations/open_clip). Available `model_name`'s are:
-
-- `clip`
-- `OpenClip`
+`ViT-B/32` is the default model architecture, so you can also leave out the `model_parameters` argument. For a list of all available architectures and datasets, please refer to the [CLIP repo](https://github.com/openai/CLIP/blob/main/clip/clip.py).
 
-Both provide multiple model architectures and, in the case of OpenCLIP also different training datasets, which can both be specified using the `model_parameters` argument. For example, if you want to get a `ViT-B/32` model from the official CLIP repo (trained on WIT), you would do the following:
+In the case of `OpenCLIP`, you can also specify the dataset used for training for most models, e.g. if you want to get a `ViT-B/32` model trained on the `LAION-400M` dataset, you would do the following:
 
 ```python
 import torch
 from thingsvision import get_extractor
 
-model_name = 'clip'
+model_name = 'OpenCLIP'
 source = 'custom'
 device = 'cuda' if torch.cuda.is_available() else 'cpu'
 model_parameters = {
-    'variant': 'ViT-B/32'
+    'variant': 'ViT-B/32',
+    'dataset': 'laion400m_e32'
 }
 
 extractor = get_extractor(
@@ -243,20 +215,30 @@ extractor = get_extractor(
 )
 ```
 
-`ViT-B/32` is the default model architecture, so you can also leave out the `model_parameters` argument. For a list of all available architectures and datasets, please refer to the [CLIP repo](https://github.com/openai/CLIP/blob/main/clip/clip.py).
+For a list of all available architectures and datasets, please refer to the [OpenCLIP repo](https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/pretrained.py).
 
-In the case of `OpenCLIP`, you can also specify the dataset used for training for most models, e.g. if you want to get a `ViT-B/32` model trained on the `LAION-400M` dataset, you would do the following:
+### [DreamSim](https://dreamsim-nights.github.io/)
+In `thingsvision` you can extract representations from [DreamSim](https://dreamsim-nights.github.io/). See the official [DreamSim repo](https://github.com/ssundaram21/dreamsim) for more information. To extract features, install the `dreamsim` package with the following `pip` command (ideally, into your `thingsvision` environment):
+
+```bash
+ $ pip install dreamsim==0.1.2
+```
+
+The base model name is:
+- `DreamSim`
+
+We provide four `DreamSim` models: `clip_vitb32`, `open_clip_vitb32`, `dino_vitb16`, and a DreamSim `ensemble`. Specify this using the `model_parameters` argument. For instance, to get the OpenCLIP variant of DreamSim you want to do the following:
 
 ```python
 import torch
 from thingsvision import get_extractor
 
-model_name = 'OpenCLIP'
+model_name = 'DreamSim'
+module_name = 'model.mlp'
 source = 'custom'
 device = 'cuda' if torch.cuda.is_available() else 'cpu'
 model_parameters = {
-    'variant': 'ViT-B/32',
-    'dataset': 'laion400m_e32'
+    'variant': 'open_clip_vitb32'
 }
 
 extractor = get_extractor(
@@ -268,9 +250,9 @@ extractor = get_extractor(
 )
 ```
 
-For a list of all available architectures and datasets, please refer to the [OpenCLIP repo](https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/pretrained.py).
+To load the CLIP ViT-B/32 version of DreamSim, pass `'clip_vitb32'` to the `variant` parameter instead. Caution (!): for the DreamSim `dino_vitb16` and `ensemble` features can only be extracted from the `model.mlp` module and not for the `model` block. We are currently working on a version that allows feature extraction from the `model` block. Please be patient until then.
 
-### Harmonization
+### [Harmonization](https://github.com/serre-lab/harmonization)
 
 If you want to extract features for [harmonized models](https://vicco-group.github.io/thingsvision/AvailableModels.html#harmonization) from the [Harmonization repo](https://github.com/serre-lab/harmonization), you have to run the following `pip` command in your `thingsvision` environment (FYI: as of now, this seems to be working smoothly only on Ubuntu but not on macOS),
 
@@ -312,38 +294,55 @@ extractor = get_extractor(
 )
 ```
 
+### CORnet
+We provide all CORnet models from [this paper](https://proceedings.neurips.cc/paper/2019/file/7813d1590d28a7dd372ad54b5d29d033-Paper.pdf). Available model names are:
 
-### DreamSim
-In `thingsvision` you can extract representations from [DreamSim](https://dreamsim-nights.github.io/). See the official [DreamSim repo](https://github.com/ssundaram21/dreamsim) for more information. To extract features, install the `dreamsim` package with the following `pip` command (ideally, into your `thingsvision` environment):
+- `cornet-s`
+- `cornet-r`
+- `cornet-rt`
+- `cornet-z`
 
-```bash
- $ pip install dreamsim==0.1.2
+Example:
+
+```python
+import torch
+from thingsvision import get_extractor
+
+model_name = 'cornet-s'
+source = 'custom'
+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+
+extractor = get_extractor(
+  model_name=model_name,
+  source=source,
+  device=device,
+  pretrained=True
+)
 ```
 
-The base model name is:
-- `DreamSim`
+### Models trained on Ecoset
 
-We provide four `DreamSim` models: `clip_vitb32`, `open_clip_vitb32`, `dino_vitb16`, and a DreamSim `ensemble`. Specify this using the `model_parameters` argument. For instance, to get the OpenCLIP variant of DreamSim you want to do the following:
+We also provide models trained on the [Ecoset](https://www.kietzmannlab.org/ecoset/) dataset, which contains 1.5m images from 565 categories selected to be both frequent in linguistic use and rated as concrete by human observers. Available `model_name`s are:
+
+- `Alexnet_ecoset`
+- `Resnet50_ecoset`
+- `VGG16_ecoset`
+- `Inception_ecoset`
+
+Example:
 
 ```python
 import torch
 from thingsvision import get_extractor
 
-model_name = 'DreamSim'
-module_name = 'model.mlp'
+model_name = 'Alexnet_ecoset'
 source = 'custom'
 device = 'cuda' if torch.cuda.is_available() else 'cpu'
-model_parameters = {
-    'variant': 'open_clip_vitb32'
-}
 
 extractor = get_extractor(
   model_name=model_name,
   source=source,
   device=device,
-  pretrained=True,
-  model_parameters=model_parameters
+  pretrained=True
 )
-```
-
-To load the CLIP ViT-B/32 version of DreamSim, pass `'clip_vitb32'` to the `variant` parameter instead. Caution (!): for the DreamSim `dino_vitb16` and `ensemble` features can only be extracted from the `model.mlp` module and not for the `model` block. We are currently working on a version that allows feature extraction from the `model` block. Please be patient until then.
+```