From 1038f630b3973636056b15ac12e7a79170a94204 Mon Sep 17 00:00:00 2001 From: robertturner <143536791+robertdhayanturner@users.noreply.github.com> Date: Tue, 13 Feb 2024 13:21:03 -0500 Subject: [PATCH] Update retrieval_from_image_and_text.md trying different spacing between images and captions --- docs/use_cases/retrieval_from_image_and_text.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/use_cases/retrieval_from_image_and_text.md b/docs/use_cases/retrieval_from_image_and_text.md index 16e4095f6..38bc6d970 100644 --- a/docs/use_cases/retrieval_from_image_and_text.md +++ b/docs/use_cases/retrieval_from_image_and_text.md @@ -21,8 +21,7 @@ COCO and Open Images V7 fulfill our essential dataset criteria; we can identify Here's an example image from the COCO dataset, and below it, the human-written captions corresponding to the image's object set. -![COCO dataset example image](assets/use_cases/retrieval_from_image_and_text/reference_image_COCO.png) - +![COCO dataset example image](assets/use_cases/retrieval_from_image_and_text/reference_image_COCO.png) _Example image from the_ [_COCO dataset_](https://cocodataset.org/#home). ``` @@ -91,8 +90,7 @@ Concatenating vectors from two unaligned vector spaces into one space - using th In experiment 4, we look at the performance of models based on [Contrastive Language-Image Pretraining](https://arxiv.org/pdf/2103.00020.pdf) (CLIP). CLIP models employ separate but jointly trained Text and Image encoders to create a single multimodal embedding space. Regardless of whether the embeddings in this space represent text or image, if they are semantically similar, they are positioned closer together. -![](assets/use_cases/retrieval_from_image_and_text/clip.png) - +![](assets/use_cases/retrieval_from_image_and_text/clip.png) _CLIP's high level architecture, from_ [_"Learning Transferable Visual Models From Natural Language Supervision"_](https://arxiv.org/pdf/2103.00020.pdf) The structure of CLIP encoders (image above) makes them versatile and adaptable to various model architectures for embedding text or image data. In our experiment, we used pretrained models from the [OpenClip leaderboard](https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv), and applied the Image Encoder to embed the images. Then we evaluated the outcomes.