Skip to content

Commit

Permalink
Merge pull request #159 from ViCCo-Group/batch_extraction
Browse files Browse the repository at this point in the history
refactored batch-wise feature extraction and added explanation to README and the docs
  • Loading branch information
LukasMut authored Apr 4, 2024
2 parents c4a18c6 + e134552 commit 8223b61
Show file tree
Hide file tree
Showing 12 changed files with 295 additions and 178 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ If you want to extract features for [DreamSim](https://dreamsim-nights.github.io
$ pip install dreamsim==0.1.2
```

See the [docs](https://vicco-group.github.io/thingsvision/AvailableModels.html) for which `DreamSim` models are available in `thingsvision`.
See the [docs](https://vicco-group.github.io/thingsvision/AvailableModels.html#dreamsim) for which `DreamSim` models are available in `thingsvision`.

#### Google Colab.
Alternatively, you can use Google Colab to play around with `thingsvision` by uploading your image data to Google Drive (via directory mounting).
Expand Down Expand Up @@ -175,7 +175,7 @@ extractor = get_extractor(
As a next step, create both dataset and dataloader for your images. We assume that all of your images are in a single `root` directory which can contain subfolders (e.g., for individual classes). Therefore, we leverage the `ImageDataset` class.

```python
root='path/to/root/image/directory' # (e.g., './images/)
root='path/to/your/image/directory' # (e.g., './images/)
batch_size = 32

dataset = ImageDataset(
Expand Down
8 changes: 4 additions & 4 deletions docs/Alignment.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ nav_order: 7

# Aligning neural network representations with human similarity judgments

Recent research in the space of representation learning has demonstrated the usefulness of aligning neural network representations with human similarity judgments for both machine learning (ML) downstream tasks and the Cognitive Sciences (see [here]((https://proceedings.neurips.cc/paper_files/paper/2023/hash/9febda1c8344cc5f2d51713964864e93-Abstract-Conference.html)) and [here](https://arxiv.org/pdf/2310.13018.pdf) for references).
Recent research in the space of representation learning has demonstrated the usefulness of aligning neural network representations with human similarity judgments for both machine learning (ML) downstream tasks and the Cognitive Sciences (see [here](https://openreview.net/pdf?id=ReDQ1OUQR0X), [here]((https://proceedings.neurips.cc/paper_files/paper/2023/hash/9febda1c8344cc5f2d51713964864e93-Abstract-Conference.html)), and [here](https://arxiv.org/pdf/2310.13018.pdf) for references). While [harmonized models](https://vicco-group.github.io/thingsvision/AvailableModels.html#harmonization) or models fine-tuned using the [DreamSim](https://vicco-group.github.io/thingsvision/AvailableModels.html#dreamsim) objective are models whose weights were trained or fine-tuned to be human-aligned (and as such count as <i>aligned</i> models), there are ways to separate alignment from (pre-)training and <i>post-align</i> the features of a base model (such as CLIP) while preserving the representation structure of the base model.

## [gLocal](https://proceedings.neurips.cc/paper_files/paper/2023/hash/9febda1c8344cc5f2d51713964864e93-Abstract-Conference.html)

If you want to align the extracted representations with human object similarity according to the approach introduced in *[Improving neural network representations using human similiarty judgments](https://proceedings.neurips.cc/paper_files/paper/2023/hash/9febda1c8344cc5f2d51713964864e93-Abstract-Conference.html)* you can optionally `align` the extracted features using the following method:
If you want to post-align the extracted representations with human object similarity according to the approach introduced in *[Improving neural network representations using human similiarty judgments](https://proceedings.neurips.cc/paper_files/paper/2023/hash/9febda1c8344cc5f2d51713964864e93-Abstract-Conference.html)* you can optionally `align` the extracted features using the following method:

```python
aligned_features = extractor.align(
Expand All @@ -20,7 +20,7 @@ aligned_features = extractor.align(
)
```

For now, representational alignment is only implemented for `gLocal` and for the following list of models: `clip_RN50`, `clip_ViT-L/14`, `OpenCLIP_ViT-L-14_laion400m_e32`, `OpenCLIP_ViT-L-14_laion2b_s32b_b82k` `dinov2-vit-base-p14`, `dinov2-vit-large-p14`, `dino-vit-base-p16`, `dino-vit-base-p8`, `resnet18`, `resnet50`, `vgg16`, `alexnet`. However, we plan to extend both the type of representational alignment and the range of models in future versions of `thingsvision`.
Since that kind of alignment simply applies an affine transformation to a model's representation space, it is computationally incredibly cheap. For now, representational alignment is only implemented for `gLocal` and for the following list of models: `clip_RN50`, `clip_ViT-L/14`, `OpenCLIP_ViT-L-14_laion400m_e32`, `OpenCLIP_ViT-L-14_laion2b_s32b_b82k` `dinov2-vit-base-p14`, `dinov2-vit-large-p14`, `dino-vit-base-p16`, `dino-vit-base-p8`, `resnet18`, `resnet50`, `vgg16`, `alexnet`. However, we intend to extend both the type of representational alignment and the range of models in future versions of `thingsvision`.


<u>Caution</u>: For `resnet18`, `resnet50`, `vgg16`, and `alexnet` gLocal does not achieve a *best-of-both-worlds-representation* for ML downstream tasks and human alignment. While gLocal significantly improves alignment with human similarity judgments for these models, it deteriorates their ML downstream task performance (such as few-shot learning and out-of-distribution detection). Hence, it does not transform the features into a *best-of-both-worlds-represenation* space as it does for CLIP-like models. If you are not interested in ML downstream task performance, you can safely ignore this.
<u>Caution</u>: For the ImageNet-trained models `resnet18`, `resnet50`, `vgg16`, and `alexnet` gLocal does not achieve a *best-of-both-worlds-representation* for ML downstream tasks and human alignment. While gLocal significantly improves alignment with human similarity judgments for these models, it deteriorates their ML downstream task performance (such as few-shot learning and out-of-distribution detection). Hence, it does not transform the features into a *best-of-both-worlds-represenation* space as it does for CLIP-like models. If you are not interested in ML downstream task performance, you can safely ignore this.
151 changes: 75 additions & 76 deletions docs/AvailableModels.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ title: Available models and sources (+ examples)
nav_order: 4
---

# Available models and sources
# Available models and their sources

`thingsvision` currently supports many models from several different sources, which represent different places or other libraries from which the model architectures or weights can come from. You can find more information about which models are available in which source and notes on their usage on this page.
`thingsvision` currently supports many models from several different sources, which represent different places or other libraries from which the model architectures or weights may come from. You can find more information about which models are available in which source on this page. Additionally, we provide several notes on their usage.

## `torchvision`
`thingsvision` supports all models from the `torchvision.models` module. You can find a list of all available models [here](https://pytorch.org/vision/stable/models.html).
`thingsvision` supports all models from the `torchvision.models` module. You can find a list of all available `torchvision` models [here](https://pytorch.org/vision/stable/models.html).

Example:
```python
Expand All @@ -31,7 +31,7 @@ extractor = get_extractor(

Model names are case-sensitive and must be spelled exactly as they are in the `torchvision` documentation (e.g., `alexnet`, `resnet18`, `vgg16`, ...).

If you use `pretrained=True`, the model will by default be pretrained on ImageNet, otherwise it is initialized randomly. For some models, `torchvision` provides multiple weight initializations, in which case you can pass the name of the weights in the `model_parameters` argument, e.g. if you want to get the extractor for a `RegNet Y 32GF` model, pretrained using SWAG and finetuned on ImageNet, you would do the following:
If you use `pretrained=True`, the model weights will by default be pretrained on ImageNet, otherwise it is initialized randomly. For some models, `torchvision` provides multiple weight initializations, in which case you can pass the name of the weights in the `model_parameters` argument, e.g. if you want to get the extractor for a `RegNet Y 32GF` model, pretrained using SWAG and finetuned on ImageNet, you want to do the following:

```python
import torch
Expand All @@ -54,7 +54,7 @@ extractor = get_extractor(
For a list of all available weights, please refer to the [torchvision documentation](https://pytorch.org/vision/stable/models.html).

## `timm`
`thingsvision` supports all models from the `timm` module. You can find a list of all available models [here](https://rwightman.github.io/pytorch-image-models/models/).
`thingsvision` supports all models from the `timm` module. You can find a list of all available `timm` models [here](https://rwightman.github.io/pytorch-image-models/models/).

Example:
```python
Expand All @@ -79,6 +79,7 @@ If you use `pretrained=True`, the model will be pretrained according to the mode

## `ssl`
`thingsvision` provides various Self-supervised learning models that are loaded from the [VISSL](https://vissl.readthedocs.io/en/v0.1.5/) library or the Torch Hub.

* SimCLR (`simclr-rn50`)
* MoCov V2 (`mocov2-rn50`),
* Jigsaw (`jigsaw-rn50`),
Expand All @@ -89,12 +90,10 @@ If you use `pretrained=True`, the model will be pretrained according to the mode
* VicReg (`vicreg-rn50`)
* DINO (`dino-rn50`)

All models have the ResNet50 architecture and are pretrained on ImageNet-1K.
Here, the model name describes the pre-training method, instead of the model architecture.
All models have the ResNet50 architecture and are pretrained on ImageNet-1K. Here, the model name describes the pre-training objective rather than the model architecture.

DINO models are available in ViT (Vision Transformer) and XCiT (Cross-Covariance Image Transformer) variants. For ViT models trained using DINO, the following models are available: `dino-vit-small-p8`, `dino-vit-small-p16`, `dino-vit-base-p8`, `dino-vit-base-p16`, where the trailing number describes the image patch resolution in the ViT (i.e. either 8x8 or 16x16). For the XCiT models, we have `dino-xcit-small-12-p16`, `dino-xcit-small-12-p8`, `dino-xcit-medium-24-p16`, `dino-xcit-medium-24-p8`, where the penultimate number represents model depth (12 = small, 24 = medium).


Example SimCLR:

```python
Expand Down Expand Up @@ -122,7 +121,7 @@ from thingsvision import get_extractor
model_name = 'dino-vit-base-p16'
source = 'ssl'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_paramters = {"extract_cls_token": True} # extract features only for the [cls] token of DINO
model_paramters = {"extract_cls_token": True} # extract features exclusively for the [cls] token of DINO

extractor = get_extractor(
model_name=model_name,
Expand Down Expand Up @@ -162,76 +161,49 @@ If you use `pretrained=True`, the model will be pretrained on ImageNet, otherwis

In addition, we provide several custom models - that are not available in other sources -, in the `custom` source. These models are:

### CORnet
We provide all CORnet models from [this paper](https://proceedings.neurips.cc/paper/2019/file/7813d1590d28a7dd372ad54b5d29d033-Paper.pdf). Available model names are:

- `cornet-s`
- `cornet-r`
- `cornet-rt`
- `cornet-z`

Example:
```python
import torch
from thingsvision import get_extractor

model_name = 'cornet-s'
source = 'custom'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

extractor = get_extractor(
model_name=model_name,
source=source,
device=device,
pretrained=True
)
```

### Models trained on Ecoset
### Official CLIP and OpenCLIP

We provide models trained on the [Ecoset](https://www.kietzmannlab.org/ecoset/) dataset, which contains 1.5m images from 565 categories selected to be both frequent in linguistic use and rated as concrete by human observers. Available `model_name`s are:
We provide [CLIP](https://arxiv.org/abs/2103.00020) models from the official CLIP repo and from [OpenCLIP](https://github.com/mlfoundations/open_clip). Available `model_name`'s are:

- `Alexnet_ecoset`
- `Resnet50_ecoset`
- `VGG16_ecoset`
- `Inception_ecoset`
- `clip`
- `OpenClip`

Example:
Both provide multiple model architectures and, in the case of OpenCLIP also different training datasets, which can both be specified using the `model_parameters` argument. For example, if you want to get a `ViT-B/32` model from the official CLIP repo (trained on WIT), you would do the following:

```python
import torch
from thingsvision import get_extractor

model_name = 'Alexnet_ecoset'
model_name = 'clip'
source = 'custom'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_parameters = {
'variant': 'ViT-B/32'
}

extractor = get_extractor(
model_name=model_name,
source=source,
device=device,
pretrained=True
pretrained=True,
model_parameters=model_parameters
)
```

### Official CLIP and OpenCLIP

We provide [CLIP](https://arxiv.org/abs/2103.00020) models from the official CLIP repo and from [OpenCLIP](https://github.com/mlfoundations/open_clip). Available `model_name`'s are:

- `clip`
- `OpenClip`
`ViT-B/32` is the default model architecture, so you can also leave out the `model_parameters` argument. For a list of all available architectures and datasets, please refer to the [CLIP repo](https://github.com/openai/CLIP/blob/main/clip/clip.py).

Both provide multiple model architectures and, in the case of OpenCLIP also different training datasets, which can both be specified using the `model_parameters` argument. For example, if you want to get a `ViT-B/32` model from the official CLIP repo (trained on WIT), you would do the following:
In the case of `OpenCLIP`, you can also specify the dataset used for training for most models, e.g. if you want to get a `ViT-B/32` model trained on the `LAION-400M` dataset, you would do the following:

```python
import torch
from thingsvision import get_extractor

model_name = 'clip'
model_name = 'OpenCLIP'
source = 'custom'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_parameters = {
'variant': 'ViT-B/32'
'variant': 'ViT-B/32',
'dataset': 'laion400m_e32'
}

extractor = get_extractor(
Expand All @@ -243,20 +215,30 @@ extractor = get_extractor(
)
```

`ViT-B/32` is the default model architecture, so you can also leave out the `model_parameters` argument. For a list of all available architectures and datasets, please refer to the [CLIP repo](https://github.com/openai/CLIP/blob/main/clip/clip.py).
For a list of all available architectures and datasets, please refer to the [OpenCLIP repo](https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/pretrained.py).

In the case of `OpenCLIP`, you can also specify the dataset used for training for most models, e.g. if you want to get a `ViT-B/32` model trained on the `LAION-400M` dataset, you would do the following:
### [DreamSim](https://dreamsim-nights.github.io/)
In `thingsvision` you can extract representations from [DreamSim](https://dreamsim-nights.github.io/). See the official [DreamSim repo](https://github.com/ssundaram21/dreamsim) for more information. To extract features, install the `dreamsim` package with the following `pip` command (ideally, into your `thingsvision` environment):

```bash
$ pip install dreamsim==0.1.2
```

The base model name is:
- `DreamSim`

We provide four `DreamSim` models: `clip_vitb32`, `open_clip_vitb32`, `dino_vitb16`, and a DreamSim `ensemble`. Specify this using the `model_parameters` argument. For instance, to get the OpenCLIP variant of DreamSim you want to do the following:

```python
import torch
from thingsvision import get_extractor

model_name = 'OpenCLIP'
model_name = 'DreamSim'
module_name = 'model.mlp'
source = 'custom'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_parameters = {
'variant': 'ViT-B/32',
'dataset': 'laion400m_e32'
'variant': 'open_clip_vitb32'
}

extractor = get_extractor(
Expand All @@ -268,9 +250,9 @@ extractor = get_extractor(
)
```

For a list of all available architectures and datasets, please refer to the [OpenCLIP repo](https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/pretrained.py).
To load the CLIP ViT-B/32 version of DreamSim, pass `'clip_vitb32'` to the `variant` parameter instead. Caution (!): for the DreamSim `dino_vitb16` and `ensemble` features can only be extracted from the `model.mlp` module and not for the `model` block. We are currently working on a version that allows feature extraction from the `model` block. Please be patient until then.

### Harmonization
### [Harmonization](https://github.com/serre-lab/harmonization)

If you want to extract features for [harmonized models](https://vicco-group.github.io/thingsvision/AvailableModels.html#harmonization) from the [Harmonization repo](https://github.com/serre-lab/harmonization), you have to run the following `pip` command in your `thingsvision` environment (FYI: as of now, this seems to be working smoothly only on Ubuntu but not on macOS),

Expand Down Expand Up @@ -312,38 +294,55 @@ extractor = get_extractor(
)
```

### CORnet
We provide all CORnet models from [this paper](https://proceedings.neurips.cc/paper/2019/file/7813d1590d28a7dd372ad54b5d29d033-Paper.pdf). Available model names are:

### DreamSim
In `thingsvision` you can extract representations from [DreamSim](https://dreamsim-nights.github.io/). See the official [DreamSim repo](https://github.com/ssundaram21/dreamsim) for more information. To extract features, install the `dreamsim` package with the following `pip` command (ideally, into your `thingsvision` environment):
- `cornet-s`
- `cornet-r`
- `cornet-rt`
- `cornet-z`

```bash
$ pip install dreamsim==0.1.2
Example:

```python
import torch
from thingsvision import get_extractor

model_name = 'cornet-s'
source = 'custom'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

extractor = get_extractor(
model_name=model_name,
source=source,
device=device,
pretrained=True
)
```

The base model name is:
- `DreamSim`
### Models trained on Ecoset

We provide four `DreamSim` models: `clip_vitb32`, `open_clip_vitb32`, `dino_vitb16`, and a DreamSim `ensemble`. Specify this using the `model_parameters` argument. For instance, to get the OpenCLIP variant of DreamSim you want to do the following:
We also provide models trained on the [Ecoset](https://www.kietzmannlab.org/ecoset/) dataset, which contains 1.5m images from 565 categories selected to be both frequent in linguistic use and rated as concrete by human observers. Available `model_name`s are:

- `Alexnet_ecoset`
- `Resnet50_ecoset`
- `VGG16_ecoset`
- `Inception_ecoset`

Example:

```python
import torch
from thingsvision import get_extractor

model_name = 'DreamSim'
module_name = 'model.mlp'
model_name = 'Alexnet_ecoset'
source = 'custom'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_parameters = {
'variant': 'open_clip_vitb32'
}

extractor = get_extractor(
model_name=model_name,
source=source,
device=device,
pretrained=True,
model_parameters=model_parameters
pretrained=True
)
```

To load the CLIP ViT-B/32 version of DreamSim, pass `'clip_vitb32'` to the `variant` parameter instead. Caution (!): for the DreamSim `dino_vitb16` and `ensemble` features can only be extracted from the `model.mlp` module and not for the `model` block. We are currently working on a version that allows feature extraction from the `model` block. Please be patient until then.
```
Loading

0 comments on commit 8223b61

Please sign in to comment.