Document fusion & multi-step retrieval of multi-modal RAG

Toy implementation of hybrid document fusion and multi-step retrieval for multi-modal RAG with images and text modalities. The implementation that is used is as follows:

Indexing: For each image produce one or more representations (embeddings, captions, metadata). Two collections are used one for the images and one for the description of the image
Multi-stage retrieval.
1. Retrieve: Get the top K images and image descriptions i.e 2 DB queries
2. Fuse: Merge candidate sets, normalize scores, compute fused score (weighted and RRF)
3. Re-rank: Run cross-modal re-ranker (e.g., cross-encoder that takes query + image) over top N fused candidates; final ranking from re-rank or fused+rerank.
Generate: top M candidates (images + captions + metadata) into context for the generator. If generator is text-only, supply captions & metadata; if multi-modal generator, feed images directly.

There are various options for the retriever

Visual dense retriever: CLIP (ViT-B/32 or ViT-L) embeddings stored in FAISS (HNSW/IVF+PQ for scale).
Text retriever: BM25 (Elastic) or vector text search (Chroma/FAISS) on captions/tags.
Optional geometric / local: ORB / SIFT + FLANN for near-duplicate or precise visual matches.
Cross-modal re-ranker: A model that takes query (image or text) and each candidate image and returns a relevance score. Off-the-shelf: CLIP-Score, BLIP-2 cross encoder, or a small transformer fine-tuned for ranking.

Score normalization & fusion

Different retrievers produce incompatible scores; thus before fusion we need to normalize the results. Normalization options (per query over candidate set):

Min-max: s' = (s - min)/(max-min)
Softmax: s' = exp(s)/sum(exp(s)) (makes scores comparable as probabilities)
Rank normalization: convert to 1/(rank + k) or scale ranks to [0,1]

Similarly, several options exist to fuse the results:

Simple weighted fusion: fused = w_vis * vis_score_norm + w_txt * txt_score_norm
Reciprocal Rank Fusion (RRF) — robust, hyperparameter k (e.g., 60): RRF_score = sum(1 / (k + rank_i)) for each retriever i
Learned fusion: Train a small model (logistic regression, LightGBM) on features:
- normalized scores, ranks, metadata features (age, popularity)
- optionally cross-encoder score if available during training

Label with human judgments / clicks.

Re-ranking

Cross-encoder that consumes (query, candidate) pair and outputs a high-quality score. For images:

Use BLIP-2 / Flamingo / a cross-attention image+text model — fine-tune for relevance.
Or use CLIP similarity as a fast re-ranker if cross-encoder infrastructure not available.
Batch inference on GPU for speed.

How to use

Install the requirements and start a ChromaDB node:

chroma run --path ./chromadb --host 0.0.0.0 --port 8003

Run the ingest_pipeline.py script to populate the database. This creates three collections:

faqs_repo
defects_images
defects_texts

Start the Ollama server

OLLAMA_HOST=0.0.0.0 OLLAMA_PORT=11434 ollama serve

Assess the query classifier

Run the script evaluate_query_classifier.py by default the script loads the data/test/test_query_classifier.json data and runs mistral as an LLM. It also uses the prompt in prompts/query_classifier/query_classifier.txt

Assess the image retrieval

Run the script evaluate_image_retrieval.py by default the script loads the test images in data/test/test_image_retrieval.json. The test image files are in data/test/hull_defects_imgs. The following metrics have been implemented:

Precision@k: How many of the top k retrieved documents are relevant?
Recall@k: How many of all relevant documents were retrieved in the top k?
Mean Reciprocal Rank or MRR: Measures the rank position of the first relevant document.
Normalized Discounted Cumulative Gain or NDCG: Weighs relevance based on position in the ranked list.

Why corrosion precision may be low

Visual ambiguity
- Corrosion can look like stains, paint blistering, or fouling. Models might confuse it with osmosis/blistering or surface cracks.
Dataset imbalance
- If you have fewer labeled corrosion images than other defect types, embeddings won’t represent corrosion well.
Embedding quality
- The image encoder (e.g., CLIP, ViT) may not capture fine-grained texture cues (rust patterns, pitting, discoloration).
Indexing granularity
- If your chunks combine corrosion with surrounding non-defective areas, retrieval noise increases

How to Improve Precision

Data improvements
- Collect more diverse examples (different severities, colors, environments).
- Use data augmentation: color jitter, contrast changes, texture overlays to mimic rust patterns.

Better negative sampling

During fine-tuning, ensure corrosion is contrasted against visually similar but distinct defects (e.g., blistering, algae stains).

Cross-modal reranking
- Use a cross-encoder stage (e.g., CLIP cross-encoder or ViLT) to rerank top-20 candidates → boosts precision@5.
Modality fusion
- If the user provides text (e.g., “orange-brown rust spots”), fuse text + image retrieval. This helps disambiguate corrosion vs. blistering.
Fine-grained classification after retrieval
- After retrieval, pass candidates to a corrosion-vs-non-corrosion classifier. Acts as a filter to clean up top-k results.

References

Multimodality and Large Multimodal Models (LMMs)

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
prompts		prompts
src		src
.gitignore		.gitignore
README.md		README.md
app_main.py		app_main.py
evaluate_image_retrieval.py		evaluate_image_retrieval.py
evaluate_query_classifier.py		evaluate_query_classifier.py
faq_assessment.py		faq_assessment.py
ingest_pipeline.py		ingest_pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document fusion & multi-step retrieval of multi-modal RAG

Score normalization & fusion

Re-ranking

How to use

Assess the query classifier

Assess the image retrieval

Why corrosion precision may be low

How to Improve Precision

References

About

Uh oh!

Releases

Packages

Languages

pockerman/multi_modal_rag

Folders and files

Latest commit

History

Repository files navigation

Document fusion & multi-step retrieval of multi-modal RAG

Score normalization & fusion

Re-ranking

How to use

Assess the query classifier

Assess the image retrieval

Why corrosion precision may be low

How to Improve Precision

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages