Add multi-modal use case section (#8823)

run-llama · Nov 10, 2023 · 336a88d · 336a88d
1 parent 9f8a08d
commit 336a88d
Show file tree

Hide file tree

Showing 2 changed files with 40 additions and 0 deletions.
diff --git a/docs/index.rst b/docs/index.rst
@@ -93,6 +93,7 @@ Associated projects
    use_cases/chatbots.md
    use_cases/agents.md
    use_cases/extraction.md
+   use_cases/multimodal.md
 
 .. toctree::
    :maxdepth: 2

diff --git a/docs/use_cases/multimodal.md b/docs/use_cases/multimodal.md
@@ -0,0 +1,39 @@
+# Multi-modal
+
+LlamaIndex offers capabilities to not only build language-based applications, but also **multi-modal** applications - combining language and images.
+
+## Types of Multi-modal Use Cases
+
+This space is actively being explored right now, but there are some fascinating use cases popping up.
+
+### Multi-Modal RAG
+
+All the core RAG concepts: indexing, retrieval, and synthesis, can be extended into the image setting.
+
+- The input could be text or image.
+- The stored knowledge base can consist of text or images.
+- The inputs to response generation can be text or image.
+- The final response can be text or image.
+
+Check out our guides below:
+
+```{toctree}
+---
+maxdepth: 1
+---
+/examples/multi_modal/gpt4v_multi_modal_retrieval.ipynb
+[Old] Multi-modal retrieval with CLIP </examples/multi_modal/multi_modal_retrieval.ipynb>
+```
+
+### Retrieval-Augmented Image Captioning
+
+Oftentimes understanding an image requires looking up information from a knowledge base. A flow here is retrieval-augmented image captioning - first caption the image with a multi-modal model, then refine the caption by retrieving from a text corpus.
+
+Check out our guides below:
+
+```{toctree}
+---
+maxdepth: 1
+---
+/examples/multi_modal/llava_multi_modal_tesla_10q.ipynb
+```