Using the code in this repo, we experiment with new ways to extract ITM scores from existing models. With the code, you can find that a model with only a few hundred million parameters gets above the current state of the art on the Winoground Image metric. It surpasses Google's 17B PaLI by 10.50 points, even in the most difficult comparison setting (where the PaLI has been finetuned / prompted to do well at Winoground).
To follow along with this readme, install the repo first:
git clone https://github.com/TristanThrush/better-multimodal-alignment.git
pip install -r requirements.txt
You also need to click for access to Winoground here and then login to your Hugging Face account via your command line:
huggingface-cli login
Then run the experiment that you are interested in, for example:
python experiment1.py
Multimodal language models are usually pretrained with several objectives, and image-text matching is typically one of them. An image-text matching (ITM) head can be used in downstream applications such as image and text retrieval. But we do not know of an ITM approach that is good at fine-grained image-text alignment. Let’s look an at example:
a mug in some grass | some grass in a mug |
Popular ITM models tend to know that there is a mug and grass in both images. But they do not understand how the word order of “mug”, “grass” and “in” affects the meaning. They cannot reliably match the captions with the correct images here. The Winoground evaluation dataset reveals that many contemporary models cannot do much better than chance at this task! This includes popular models like FLAVA, CLIP, and BLIP. Even the closed-source models that have been tested, such as the 17B parameter PaLI, do a bit better than chance with special tuning and prompting but still not very well. Models have a particularly hard time with the Winoground Image metric, which tests the scenario where a model is given a caption and must pick one of two images. They often perform worse than random chance on this metric because they always rank one of the images as more likely regardless of the caption.
To learn that an image and text match, ITM models are typically trained with image-caption pairs scraped from the internet. To learn that an image and text don’t match, ITM models are given negative examples. The negatives are usually randomly paired images and captions from the same internet dataset, but sometimes more sophisticated methods are used - for example BLIP actually tries to use "hard" negatives. With this standard ITM approach, we would need negative training examples like “grass in mug” versus “mug in grass” paired with the corresponding wrong images above. This would occur infrequently. Also, generating negative pairs of any kind is a sort of modeling or data augmentation decision that does not come from the natural dataset itself. Luckily, we can extract an ITM score in a way that does not require training with these artificial negative pairings at all.
Given an input Image
We got that last expression from the chain rule. Notice that BLIP’s causal language modelling (CLM) pretraining objective is designed to give us the probability of the next token given the previous tokens and the image. So we can compute
What happens if we use
Model | Image Score |
---|---|
Random Chance | 25.00 |
BLIP Contrastive Head | 13.50 |
BLIP ITM Head | 24.25 |
|
29.00 |
But
Another idea: Is there a way to "normalize" for the contribution that an image adds to the ITM score alone? We want an ITM model to not prefer one image over others independent of the text. For very fine-grained tasks like Winoground, maybe the ITM score should instead be
How do we get a plausible alternative caption, given an initial caption, though? For this experiment, we use a small unimodal model which is less than 100M parameters: DistilRoBERTa. We take many samples of two token swaps at a time and use the MLM probabilities from DistilRoBERTa to get the most probable text with an alternative token order.
Using this method, we see a big improvement in both the Contrastive and ITM heads of BLIP.
Model | Image Score |
---|---|
Random Chance | 25.00 |
BLIP Contrastive Head | 13.50 |
BLIP ITM Head | 24.25 |
DistilRoBERTa + BLIP Contrastive Head score ratios | 52.00 |
DistilRoBERTa + BLIP ITM Head score ratios | 32.75 |
PaLI 17B (with best known finetuning / prompting approach for Winoground) | 41.50 |
VQ2 (with best known finetuning / prompting approach for Winoground) | 42.20 |
Now let's apply both Experiments 1 and 2! In Experiment 1, we figured out how to compute
Even if we can't get
Combining the findings from Experiments 1 and 2 we beat current SOTA performance here too, and no negative training examples were required.
Model | Image Score |
---|---|
Random Chance | 25.00 |
BLIP Contrastive Head | 13.50 |
BLIP ITM Head | 24.25 |
DistilRoBERTa + BLIP CLM Head score ratios | 50.25 |
PaLI 17B (with best known finetuning / prompting approach for Winoground) | 41.50 |
VQ2 (with best known finetuning / prompting approach for Winoground) | 42.20 |
We've provided two approaches that beat the current state of the art for the Winoground image score, and one of them does not require negative training examples at all.
Will this approach scale to real-world retrieval? It is unclear - we need to go beyond these fun little experiments. The score ratios idea might only work well when we are retrieving from a set of images which already have the right objects in them, but possibly the wrong relationships between the objects. In this case, it might work well as a way to re-rank the top retrieved images from a database, but not as a full retrieval score by itself.
@misc{thrush2023bettermultimodalalignment,
author = {Tristan Thrush and Chris Potts and Douwe Kiela},
title = {Better multimodal alignment scores: A few experiments},
url = {https://github.com/TristanThrush/better-multimodal-alignment},
year = {2023},
}