🧬 ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos

ViDRiP-LLaVA is a vision-language framework designed for instruction-based diagnostic reasoning using both image patches and video clips from pathology slides. It builds on LLaVA and extends it to the medical domain with domain-specific datasets and fine-tuned models.

🧠 Introducing our ViDRiP-LLaVA: the first multimodal model for diagnostic reasoning in pathology through video-based instruction. 🔬📽️

Our method leverages chain-of-thought (CoT) prompting to distill the reasoning capabilities of LLMs. ViDRiP-LLaVA generates both detailed histological descriptions and final diagnoses, simulating how pathologists analyze and sign out cases.

📚 Trained on 4,278 instructional video pairs

⚙️ Combines single-image + clip transfer and fine-tuning on segmented diagnostic videos

📚 Video Datasets

🎥 Released Video Format

All clips are:

Cleaned using a Visual Data Refinement pipeline (temporal trimming + YoloPath filtering + OCR exclusion + inpainting)
Downsampled to 1–5 FPS to reduce file size and support fair-use compliance
Muted to avoid redistribution of original YouTube audio

These steps preserve diagnostic signal while respecting the rights of YouTube creators and complying with YouTube’s Terms of Service.

🔍 Training vs. Public Release Notice

The ViDRiP-LLaVA models were trained on an internal dataset version that included:

Full-frame-rate video clips
Visual content prior to OCR filtering

All evaluations (including those in our benchmark) are conducted using the publicly released test set, ensuring full reproducibility.

🔹 ViDRiP_Instruct_Train

The videos data is ~ 60 GB:

🔹 ViDRiP_Instruct_Train_Video_Hugging Face (There is 6 zip files)

4,000+ instruction-style samples
Each sample includes:
- A pathology video clip
- A diagnostic question
- A multi-turn reasoning answer
Format: JSON + MP4
Croissant-compliant metadata for structured use

🔹 ViDRiP_Instruct_Test

🔹 ViDRiP_Instruct_Test_Video

Held-out test set of diagnostic Q&A pairs
Used for benchmarking reasoning performance

📚 Image Datasets

We use publicly available datasets: Quilt-LLaVA and PathAsst. Please refer to their respective repositories for download instructions.

Quilt-LLaVA: A vision-language dataset for pathology adapted from LLaVA.
PathAsst: A generative assistant for pathology with curated image-text pairs.

🤖 Models

🔸 ViDRiP_LLaVA_video

Vision-language model for video-based diagnostic reasoning
Trained on ViDRiP_Instruct_Train
Suitable for:
- Medical VQA
- Instructional explanation generation
- Educational pathology summarization

🔸 ViDRiP_LLaVA_image

Vision-language model for patch-based diagnostic prompts
Useful for pathology captioning and single-frame inference

🚀 Quickstart

🔧 Fine-tuning the model on video dataset

./scripts/train/finetune_ov_video.sh

🪄 Fine-tuning with LoRA

./scripts/train/finetune_ov_video_lora.sh

🔗 Merge LoRA weights

./scripts/train/merge_lora_weights.py

🧪 Usage / Demo

./doc/ViDRiP_LLaVA_trial.py

🔧 Evaluate on our video dataset

We use lmms_eval to evaluate the performance of video diagnostic reasoning.

To benchmark ViDRiP-LLaVA and compare it with other models:

Clone the lmms_eval repo
Copy our evaluation task folder into it:

cp -r lmms_eval/tasks/ViDRiP_Instruct_Test /path/to/lmms_eval/tasks/

You can then run evaluation using the standard lmms_eval CLI interface.

Citation:

Coming soon

📄 Usage and License Notices

ViDRiP-LLaVA (Vision-language Diagnostic Reasoning in Pathology), including its dataset, code, and model checkpoints, is released strictly for non-commercial research purposes only.

📁 Licenses

Dataset: Licensed under CC BY-NC-ND 3.0 (Attribution–NonCommercial–NoDerivatives)
Code and pretrained models: Licensed under CC BY-NC 3.0 (Attribution–NonCommercial)

⚙️ Dependencies and Components

This project may incorporate or build upon resources such as LLaVA-Next, QUILT-1M, LLaMA, PathAsst, and GPT-4, each subject to their own licenses and Terms of Use.

🎥 Source Acknowledgment

ViDRiP-LLaVA includes data derived from public educational pathology videos hosted on YouTube. All content usage complies with YouTube’s Terms of Service, and the intellectual property rights of the original pathologist creators are fully acknowledged and respected.

🚫 Restrictions

Not for commercial use
Not to be used in clinical care or medical decision-making
For academic research, development, and evaluation only

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
docs		docs
llava		llava
lmms_eval/tasks/VidDiag_Instruct_Test		lmms_eval/tasks/VidDiag_Instruct_Test
playground		playground
scripts		scripts
trl		trl
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
launch.py		launch.py
predict.py		predict.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬 ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos

📚 Video Datasets

🎥 Released Video Format

🔍 Training vs. Public Release Notice

🔹 ViDRiP_Instruct_Train

🔹 ViDRiP_Instruct_Train_Video_Hugging Face (There is 6 zip files)

🔹 ViDRiP_Instruct_Test

🔹 ViDRiP_Instruct_Test_Video

📚 Image Datasets

🤖 Models

🔸 ViDRiP_LLaVA_video

🔸 ViDRiP_LLaVA_image

🚀 Quickstart

🔧 Fine-tuning the model on video dataset

🪄 Fine-tuning with LoRA

🧪 Usage / Demo

🔧 Evaluate on our video dataset

Citation:

📄 Usage and License Notices

📁 Licenses

⚙️ Dependencies and Components

🎥 Source Acknowledgment

🚫 Restrictions

About

Uh oh!

Releases

Packages

Languages

License

QuIIL/ViDRiP-LLaVA

Folders and files

Latest commit

History

Repository files navigation

🧬 ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos

📚 Video Datasets

🎥 Released Video Format

🔍 Training vs. Public Release Notice

🔹 ViDRiP_Instruct_Train

🔹 ViDRiP_Instruct_Train_Video_Hugging Face (There is 6 zip files)

🔹 ViDRiP_Instruct_Test

🔹 ViDRiP_Instruct_Test_Video

📚 Image Datasets

🤖 Models

🔸 ViDRiP_LLaVA_video

🔸 ViDRiP_LLaVA_image

🚀 Quickstart

🔧 Fine-tuning the model on video dataset

🪄 Fine-tuning with LoRA

🧪 Usage / Demo

🔧 Evaluate on our video dataset

Citation:

📄 Usage and License Notices

📁 Licenses

⚙️ Dependencies and Components

🎥 Source Acknowledgment

🚫 Restrictions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages