Skip to content

pockerman/multi_modal_rag

Repository files navigation

Document fusion & multi-step retrieval of multi-modal RAG

Toy implementation of hybrid document fusion and multi-step retrieval for multi-modal RAG with images and text modalities. The implementation that is used is as follows:

  1. Indexing: For each image produce one or more representations (embeddings, captions, metadata). Two collections are used one for the images and one for the description of the image
  2. Multi-stage retrieval.
    1. Retrieve: Get the top K images and image descriptions i.e 2 DB queries
    2. Fuse: Merge candidate sets, normalize scores, compute fused score (weighted and RRF)
    3. Re-rank: Run cross-modal re-ranker (e.g., cross-encoder that takes query + image) over top N fused candidates; final ranking from re-rank or fused+rerank.
  3. Generate: top M candidates (images + captions + metadata) into context for the generator. If generator is text-only, supply captions & metadata; if multi-modal generator, feed images directly.

There are various options for the retriever

  • Visual dense retriever: CLIP (ViT-B/32 or ViT-L) embeddings stored in FAISS (HNSW/IVF+PQ for scale).
  • Text retriever: BM25 (Elastic) or vector text search (Chroma/FAISS) on captions/tags.
  • Optional geometric / local: ORB / SIFT + FLANN for near-duplicate or precise visual matches.
  • Cross-modal re-ranker: A model that takes query (image or text) and each candidate image and returns a relevance score. Off-the-shelf: CLIP-Score, BLIP-2 cross encoder, or a small transformer fine-tuned for ranking.

Score normalization & fusion

Different retrievers produce incompatible scores; thus before fusion we need to normalize the results. Normalization options (per query over candidate set):

  • Min-max: s' = (s - min)/(max-min)
  • Softmax: s' = exp(s)/sum(exp(s)) (makes scores comparable as probabilities)
  • Rank normalization: convert to 1/(rank + k) or scale ranks to [0,1]

Similarly, several options exist to fuse the results:

  • Simple weighted fusion: fused = w_vis * vis_score_norm + w_txt * txt_score_norm
  • Reciprocal Rank Fusion (RRF) — robust, hyperparameter k (e.g., 60): RRF_score = sum(1 / (k + rank_i)) for each retriever i
  • Learned fusion: Train a small model (logistic regression, LightGBM) on features:
    • normalized scores, ranks, metadata features (age, popularity)
    • optionally cross-encoder score if available during training

Label with human judgments / clicks.

Re-ranking

Cross-encoder that consumes (query, candidate) pair and outputs a high-quality score. For images:

  • Use BLIP-2 / Flamingo / a cross-attention image+text model — fine-tune for relevance.
  • Or use CLIP similarity as a fast re-ranker if cross-encoder infrastructure not available.
  • Batch inference on GPU for speed.

How to use

Install the requirements and start a ChromaDB node:

chroma run --path ./chromadb --host 0.0.0.0 --port 8003

Run the ingest_pipeline.py script to populate the database. This creates three collections:

  • faqs_repo
  • defects_images
  • defects_texts

Start the Ollama server

OLLAMA_HOST=0.0.0.0 OLLAMA_PORT=11434 ollama serve

Assess the query classifier

Run the script evaluate_query_classifier.py by default the script loads the data/test/test_query_classifier.json data and runs mistral as an LLM. It also uses the prompt in prompts/query_classifier/query_classifier.txt

Assess the image retrieval

Run the script evaluate_image_retrieval.py by default the script loads the test images in data/test/test_image_retrieval.json. The test image files are in data/test/hull_defects_imgs. The following metrics have been implemented:

  • Precision@k: How many of the top k retrieved documents are relevant?
  • Recall@k: How many of all relevant documents were retrieved in the top k?
  • Mean Reciprocal Rank or MRR: Measures the rank position of the first relevant document.
  • Normalized Discounted Cumulative Gain or NDCG: Weighs relevance based on position in the ranked list.

Why corrosion precision may be low

  • Visual ambiguity

    • Corrosion can look like stains, paint blistering, or fouling. Models might confuse it with osmosis/blistering or surface cracks.
  • Dataset imbalance

    • If you have fewer labeled corrosion images than other defect types, embeddings won’t represent corrosion well.
  • Embedding quality

    • The image encoder (e.g., CLIP, ViT) may not capture fine-grained texture cues (rust patterns, pitting, discoloration).
  • Indexing granularity

    • If your chunks combine corrosion with surrounding non-defective areas, retrieval noise increases

How to Improve Precision

  • Data improvements
    • Collect more diverse examples (different severities, colors, environments).
    • Use data augmentation: color jitter, contrast changes, texture overlays to mimic rust patterns.

Better negative sampling

During fine-tuning, ensure corrosion is contrasted against visually similar but distinct defects (e.g., blistering, algae stains).

  • Cross-modal reranking

    • Use a cross-encoder stage (e.g., CLIP cross-encoder or ViLT) to rerank top-20 candidates → boosts precision@5.
  • Modality fusion

    • If the user provides text (e.g., “orange-brown rust spots”), fuse text + image retrieval. This helps disambiguate corrosion vs. blistering.
  • Fine-grained classification after retrieval

    • After retrieval, pass candidates to a corrosion-vs-non-corrosion classifier. Acts as a filter to clean up top-k results.

References

  1. Multimodality and Large Multimodal Models (LMMs)

About

Hybrid document fusion multi-step retrieval for multi-modal RAG

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages