Skip to content

Commit

Permalink
Update retrieval_from_image_and_text.md
Browse files Browse the repository at this point in the history
needs Mór's check
  • Loading branch information
robertdhayanturner authored Feb 12, 2024
1 parent 38d9b7e commit 383f465
Showing 1 changed file with 11 additions and 11 deletions.
22 changes: 11 additions & 11 deletions docs/use_cases/retrieval_from_image_and_text.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Retrieval from Image and Text Modalities

## Multimodal embedding: the value add
## The value of multimodal embedding

In our contemporary data-centric world, embeddings have become indispensable for converting complex and varied data into numerical representations that are both manageable and analytically powerful. [Across a spectrum of industries](https://mccormickml.com/2018/06/15/applying-word2vec-to-recommenders-and-advertising/), from e-commerce to healthcare, these embeddings enable machines to interpret, analyze, and make predictions from large-scale datasets containing textual and/or visual information. Traditionally, models have relied on unimodal data, typically either images or text, but not both. However, the advent of multimodal models, which can synergize various data forms, has proven to be a game-changer. Multimodal approaches surpass the limitations of unimodal methods, offering richer contextual insights and enhanced predictive capabilities, and paving the way for more sophisticated and accurate applications across diverse sectors.

Below, we carry out various text and image embedding experiments using COCO and Open Images V7 datasets, showcasing different unimodal and multimodal embedding models and assessing their effectiveness using ranking metrics. By the end, you'll have an understanding of how to embed multimodal data. We'll also evaluate the performance of unimodal vs. multimodal embeddings, and how different multimodal models stack up against each other.

## Our datasets: COCO and Open Images V7

A multimodal dataset must satisfy two essential criteria:
Our dataset must satisfy two essential criteria:

1. The dataset should be structured to have <query, multiple answers> pairs.
2. Both the "query" and "multiple answers" should include <image, text metadata>.
Expand All @@ -17,9 +17,9 @@ Publicly available datasets that meet these criteria are rare. [Common Objects i

COCO comprises images from 80 object categories, each image accompanied by 5 unique, human-written captions that distinctively describe objects present in the image. Open Images V7 encompasses a significantly larger number of distinct object categories - approximately 20,245. In addition to captions, Open Images V7 introduces Localized Narratives — human audio descriptions - for each image segment, identified by mouse hovering. Each subpart of the Localized Narrative is accompanied by a timestamp. An illustrative example can be found [here](https://blog.research.google/2020/02/open-images-v6-now-featuring-localized.html). In our experiments, we leverage the textual representation of these Localized Narratives as captions.

COCO and Open Images V7 fulfill our essential multimodal dataset criteria; we can identify which images contain object sets (e.g., keyboard, mouse, person, TV) that appear only once in any particular image, and ensure that at least two images have the identical object set by excluding images with singular object sets - sets that appear only once. Based on label set frequency distribution, these outliers are removed from both Open Images V7 and COCO datasets. The resulting down-sampled COCO and the Open Images V7 datasets contain 103,429 and 149,847 samples, respectively.
COCO and Open Images V7 fulfill our essential dataset criteria; we can identify which images contain object sets (e.g., keyboard, mouse, person, TV) in any particular image, and ensure that at least two images have the identical object set by excluding images with object sets that appear only once. Based on label set frequency distribution, these outliers are removed from both Open Images V7 and COCO datasets. The resulting down-sampled COCO and the Open Images V7 datasets contain 103,429 and 149,847 samples, respectively.

Here's an example image from the COCO dataset, and below it, the human-written captions corresponding to each object set.
Here's an example image from the COCO dataset, and below it, the human-written captions corresponding to the image's object set.

![COCO dataset example image](assets/use_cases/retrieval_from_image_and_text/reference_image.png)
[Example image from the COCO dataset.](https://cocodataset.org/#home)
Expand All @@ -36,15 +36,15 @@ A young kid with head phones on using a computer.

In our experiments below, we **vectorize/embed**, respectively, 1) image captions, 2) images, 3) both images and their captions, 4) images with multimodal vision transformers, 5) both images and their captions with multimodal vision transformers. In cases where images and their captions are vectorized separately, the embeddings are concatenated.

After embedding the entire dataset and normalizing each vector to unit length (to avoid larger vectors biasing the learning), we **assess the quality of the embedding vectors by retrieving them and calculating ranking metrics**. More specifically. we iterate over the vector space and retrieve each vector's **_k (=10)_** nearest neighbors based on [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). Cosine similarity measures and quantifies the angle between two (unit normalized) vectors by simplifying to a [dot product](https://en.wikipedia.org/wiki/Dot_product) calculation.
After embedding the entire dataset and normalizing each vector to unit length, we **assess the quality of the embedding vectors by retrieving them and calculating ranking metrics**. More specifically. we iterate over the vector space and retrieve each vector's **_k (=10)_** nearest neighbors based on [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). Cosine similarity measures and quantifies the angle between two vectors by simplifying to a [dot product](https://en.wikipedia.org/wiki/Dot_product) calculation.

For the retrieved vectors, we calculate ranking metrics using [Torchmetrics](https://lightning.ai/docs/torchmetrics/stable/all-metrics.html). We focus primarily on [Mean Reciprocal Rank](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) (MRR) and [Normalized Discounted Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) (NDCG), both of which you can read more about [here](https://www.shaped.ai/blog/evaluating-recommendation-systems-map-mmr-ndcg). But we also use other information retrieval metrics like Mean Average Precision (MAP), Precision@k, and Recall@k, which are explained in detail [here](https://www.shaped.ai/blog/evaluating-recommendation-systems-part-1). In all of these metrics, the higher the ranking of relevant items/hits, the more effective the retrieval.

Now that we understand the basics, let's dive into each of our embedding experiments and their results. Afterwards, we'll put these results side by side to compare them.

### 1. Embedding image captions

In experiment 1, we vectorized image captions using the [Sentence-Transformers](https://www.sbert.net/) library. We selected top-performing models suited to our use case from [SBERT Pretrained Models Leaderboard](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/) as well as [Huggingface Sentence Transformers](https://huggingface.co/sentence-transformers). In addition to using different models, we tried different ways of processing our textual data in different types of runs:
In experiment 1, we vectorized image captions using the Sentence-Transformers library, selecting top-performing models suited to our use case from [SBERT Pretrained Models Leaderboard](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/) as well as [Huggingface Sentence Transformers](https://huggingface.co/sentence-transformers). In addition to using different models, we tried different ways of processing our textual data in different types of runs:

- Concatenating the 5 human-written image captions and embedding the combined text. All these runs are marked with a "_concat_captions" suffix in the table below.
- Randomly selecting one of the human-written image captions. All these runs are marked with "_random_caption."
Expand Down Expand Up @@ -76,17 +76,17 @@ On the COCO dataset, the [caformer_m36](https://arxiv.org/pdf/2210.13452.pdf) mo

![](assets/use_cases/retrieval_from_image_and_text/table_embed_image_oiv7.png)

The smallest EfficientNetv2 model was the most efficient performer on the Open Images V7 dataset, caformer_m36 came second, followed by the EfficientNetv2 model, sizes m and l. The models' performance relative to each other remained consistent across datasets. Also, though we expected superior performance from the [Data-efficient Image Transformer models (DeITs)](https://arxiv.org/abs/2012.12877) because of their inductive biases (acquired through knowledge distillation), of all the models we tested on both datasets, DeITs performed the most poorly.
The smallest EfficientNetv2 model was the most efficient performer on the Open Images V7 dataset, caformer_m36 came second, followed by the EfficientNetv2 model, sizes m and l. The models' performance relative to each other remained roughly consistent across datasets. Also, though we expected superior performance from the [Data-efficient Image Transformer models (DeiTs)](https://arxiv.org/abs/2012.12877) because of their inductive biases (acquired through knowledge distillation), of all the models we tested on both datasets, DeiTs performed the most poorly.

### 3. Embedding both images and their captions

Our third experiment concatenated vectors from our first two experiments into a combined vector space. We iterated through this space to retrieved the k nearest neighbors for each concatenated vector, with the following results.

![](assets/use_cases/retrieval_from_image_and_text/table_embed_text_image.png)

Concatenating vectors from **two unaligned vector spaces** into one space - specifically, using the Sentence Transformers models on the COCO dataset - led to a deterioration in performance to the level of the Computer Vision models. As a result, we next investigated (in experiments 4. and 5.) whether _jointly training_ text and image encoders, and then concatenating their vectors, might lead to better performance than concatenating vectors created by _separately trained_ image and text encoders.
Concatenating vectors from **two unaligned vector spaces** into one space - specifically, using the Sentence Transformers models on the COCO dataset - led to a deterioration in performance to the level of the Computer Vision models. As a result, we next investigated (in experiments 4. and 5.) whether using _jointly trained_ text and image encoders, and then concatenating their vectors, might lead to better performance than concatenating vectors created by _separately trained_ image and text encoders.

### 4. Embedding images with Multimodal Vision Transformers
### 4. Embedding images with Multimodal Transformers

In experiment 4, we look at the performance of models based on [Contrastive Language-Image Pretraining](https://arxiv.org/pdf/2103.00020.pdf) (CLIP). CLIP models employ separate but jointly trained Text and Image encoders to create a single multimodal embedding space. Regardless of whether the embeddings in this space represent text or image, if they are semantically similar, they are positioned closer together.

Expand All @@ -109,7 +109,7 @@ In our final experiment, we used Text and Image encoders from both CLIP and BLIP

![](assets/use_cases/retrieval_from_image_and_text/table_multimodal_vit_embed_image_text.png)

In experiment 5, the rank order of the two ViT-based OpenCLIP models on the COCO dataset was inverted (from what it was in experiment 4), but they performed comparably well - on both the COCO and Open Images V7 datasets. In the BLIP experiments (below), the BLIP models once again proved to be more efficient; the largest model had an MRR score of 0.4953 on the COCO dataset - marginally (0.26%) better than the ViT model, and 0.112 on Open Images V7 - 7.07% better than the ViT model.
In experiment 5, the rank order of the two ViT-based OpenCLIP models on the COCO dataset was inverted (from what it was in experiment 4), but they performed comparably well - on both the COCO and Open Images V7 datasets. In the BLIP experiments (below), the BLIP models once again proved to be more efficient; the largest model had an MRR score of 0.4953 on the COCO dataset - marginally (0.26%) better than the best OpenCLIP model, and 0.112 on Open Images V7 - 7.07% better than the best OpenCLIP model.

![](assets/use_cases/retrieval_from_image_and_text/table_multimodal_vit_embed_image_text_blip.png)

Expand All @@ -123,7 +123,7 @@ Now, let's put all our results side by side for comparison.

![](assets/use_cases/retrieval_from_image_and_text/table_all_scenarios_oiv7.png)

In both the COCO and Open Images V7 datasets, the BLIP and OpenCLIP models (vision transformer models) proved to be the most efficient feature extractors. On the COCO dataset, the BLIP model (all-mpnet-base-v2_blip-image-captioning-base) performed about the same using only image embeddings as it did when using both image and caption embeddings. Indeed, in general, using both image and caption embeddings makes the highest performing models perform only marginally better - regardless of whether the model embeds images or text. This may be because both the caption and the image refer to the same abstract concept, and these models do a similarly good job encoding it. The top Sentence Transformers models' MRR scores trailed by about 2%, but their inference speed was significantly faster. However, on the Open Images V7 dataset, Sentence Transformers models' proportional MRR scores lagged behind the other models by around -37%.
In both the COCO and Open Images V7 datasets, the BLIP and OpenCLIP models proved to be the most efficient feature extractors. On the COCO dataset, the BLIP model performed about the same using only image embeddings as it did when using both image and caption embeddings. Indeed, in general, using both image and caption embeddings makes the highest performing models perform only marginally better - regardless of whether the model embeds images or text. This may be because both the caption and the image refer to the same abstract concept, and these models do a similarly good job encoding it. The top Sentence Transformers models' MRR scores trailed by about 2%, but their inference speed was significantly faster. However, on the Open Images V7 dataset, Sentence Transformers models' proportional MRR scores lagged behind the other models by around -37%.

We should also **take into account the inference time and GPU demands** for each of our experiments. These metrics were gathered using an [RTX 3080 16 GB GPU](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621), capable of 29.77 TFLOPS on FP32. When processing the merged COCO training and validation dataset, containing 103,429 data samples post-preprocessing, we noted the following **inference times and resource allocations**. It's important to note that GPU utilization was always maximized through parallelized data loading to ensure efficiency.

Expand Down

0 comments on commit 383f465

Please sign in to comment.