Skip to content

Commit

Permalink
Update retrieval_from_image_and_text.md
Browse files Browse the repository at this point in the history
changed md for img caption
  • Loading branch information
robertdhayanturner authored Feb 13, 2024
1 parent a828aad commit 03ce9e2
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions docs/use_cases/retrieval_from_image_and_text.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ COCO and Open Images V7 fulfill our essential dataset criteria; we can identify

Here's an example image from the COCO dataset, and below it, the human-written captions corresponding to the image's object set.

![COCO dataset example image](assets/use_cases/retrieval_from_image_and_text/reference_image_COCO.png)
_Example image from the_ [_COCO dataset_](https://cocodataset.org/#home).
![COCO dataset example image](assets/use_cases/retrieval_from_image_and_text/reference_image_COCO.png)
*Example image from the [COCO dataset](https://cocodataset.org/#home).*

```
A young boy standing in front of a computer keyboard.
Expand Down Expand Up @@ -90,8 +90,8 @@ Concatenating vectors from two unaligned vector spaces into one space - using th

In experiment 4, we look at the performance of models based on [Contrastive Language-Image Pretraining](https://arxiv.org/pdf/2103.00020.pdf) (CLIP). CLIP models employ separate but jointly trained Text and Image encoders to create a single multimodal embedding space. Regardless of whether the embeddings in this space represent text or image, if they are semantically similar, they are positioned closer together.

![](assets/use_cases/retrieval_from_image_and_text/clip.png)
_CLIP's high level architecture, from_ [_"Learning Transferable Visual Models From Natural Language Supervision"_](https://arxiv.org/pdf/2103.00020.pdf)
![](assets/use_cases/retrieval_from_image_and_text/clip.png)
*CLIP's high level architecture, from ["Learning Transferable Visual Models From Natural Language Supervision"](https://arxiv.org/pdf/2103.00020.pdf)*

The structure of CLIP encoders (image above) makes them versatile and adaptable to various model architectures for embedding text or image data. In our experiment, we used pretrained models from the [OpenClip leaderboard](https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv), and applied the Image Encoder to embed the images. Then we evaluated the outcomes.

Expand Down

0 comments on commit 03ce9e2

Please sign in to comment.