Skip to content

Commit

Permalink
Update retrieval_from_image_and_text.md
Browse files Browse the repository at this point in the history
just awaiting an update from Mór on all scenarios images...
  • Loading branch information
robertdhayanturner authored Feb 13, 2024
1 parent 383f465 commit 28a8e0b
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions docs/use_cases/retrieval_from_image_and_text.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ But do these outcome patterns hold true for the more diverse Open Images V7 data

BLIP-generated captions were more efficient than human Localized Narratives on the Open Images V7 dataset. The "all-distilroberta-v1," "bge-large-en-v1.5", and "e5-large-v2" models maintained their relative performance order, but the "all-mpnet-v2" model did better, with the top MRR score: 0.0706. Overall, all the models performed comparably on generated captions, with slight variations attributable to Approximate Nearest Neighbor Search using [FAISS](https://github.com/facebookresearch/faiss).

When we used [LLaVA](https://arxiv.org/pdf/2304.08485.pdf) 1.5 to generate detailed descriptions for each image, the model tended to hallucinate non-existing objects at least 50% of the time. Performance improved when prompting for detailed descriptions of only those elements LLaVa 1.5 was confident about, but the model's one to two short sentence output was no better than BLIP's output. We also looked at GPT-4, which performed well for all tested images. But GPT-4's current API limit means that it would take an estimated 2 weeks to re-caption an entire dataset, making it impractical.
When we used [LLaVA](https://arxiv.org/pdf/2304.08485.pdf) 1.5 to generate detailed descriptions for each image, the model tended to hallucinate non-existing objects at least 50% of the time. Performance improved when prompting for detailed descriptions of only those elements LLaVa 1.5 was confident about, but the model's one to two short sentence output was no better than 's output. We also looked at GPT-4, which performed well for all tested images. But GPT-4's current API limit means that it would take an estimated 2 weeks to re-caption an entire dataset, making it impractical.

In sum, the Sentence Transformers models performed consistently across diverse datasets in our first experiment. In addition, generating captions with BLIP seems to be a viable option, especially when the captions provide a detailed description of each image. However, in use cases requiring descriptions that focus on the overall concept, and such fine granularity isn't necessary, BLIP-generated captions may unnecessarily reduce the system's retrieval capabilities.

Expand Down Expand Up @@ -123,7 +123,7 @@ Now, let's put all our results side by side for comparison.

![](assets/use_cases/retrieval_from_image_and_text/table_all_scenarios_oiv7.png)

In both the COCO and Open Images V7 datasets, the BLIP and OpenCLIP models proved to be the most efficient feature extractors. On the COCO dataset, the BLIP model performed about the same using only image embeddings as it did when using both image and caption embeddings. Indeed, in general, using both image and caption embeddings makes the highest performing models perform only marginally better - regardless of whether the model embeds images or text. This may be because both the caption and the image refer to the same abstract concept, and these models do a similarly good job encoding it. The top Sentence Transformers models' MRR scores trailed by about 2%, but their inference speed was significantly faster. However, on the Open Images V7 dataset, Sentence Transformers models' proportional MRR scores lagged behind the other models by around -37%.
In both the COCO and Open Images V7 datasets, the BLIP and OpenCLIP models proved to be the most efficient feature extractors. On the COCO dataset, the BLIP model performed about the same using only image embeddings as it did when using both image and caption embeddings. Indeed, in general, using both image and caption embeddings makes the highest performing models perform only marginally better - regardless of whether the model embeds images or text. The top Sentence Transformers models' MRR scores trailed by about 2%, but their inference speed was significantly faster. However, on the Open Images V7 dataset, Sentence Transformers models' proportional MRR scores lagged behind the other models by around -37%.

We should also **take into account the inference time and GPU demands** for each of our experiments. These metrics were gathered using an [RTX 3080 16 GB GPU](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621), capable of 29.77 TFLOPS on FP32. When processing the merged COCO training and validation dataset, containing 103,429 data samples post-preprocessing, we noted the following **inference times and resource allocations**. It's important to note that GPU utilization was always maximized through parallelized data loading to ensure efficiency.

Expand Down

0 comments on commit 28a8e0b

Please sign in to comment.