A PyTorch-powered library for accelerating multimodal AI research with Vision-Language Models
A flexible, all-in-one PyTorch library that streamlines research and development with state-of-the-art vision-language models. Whether you’re experimenting with soft-prompt tuning (e.g., CoOp, CoCoOp) or large-scale models such as CLIP, this toolbox provides a robust foundation built on PyTorch and Hugging Face Transformers.
- Key Features
- Supported Models
- Quick Start
- Usage
- Jupyter Notebooks
- Installation
- Acknowledgments
- Contributing
- License
Feature | Description |
---|---|
Multimodal Datasets | Supports ImageNet1k, CIFAR-100, Stanford Cars, iNaturalist 2021, MSCOCO Captions, and more. |
Model Flexibility | Works with CLIP (ViT & ResNet), DINO-V2, MiniLM, MPNet, and also allows adding custom models. |
Custom Objectives/Tasks | Quickly add new tasks or losses with minimal code changes for all combined vision-language flows. |
Prompt Tuning | Supports soft prompts (CoOp, CoCoOp) and predefined hard prompts for dataset adaptation. |
Scalability & Precision | Supports multi-GPU, mixed precision (FP16, BF16, FP32, FP64), sharding, and DeepSpeed. |
Sampling Strategies | Includes oversampling, undersampling, and hybrid methods like SMOTE, ADASYN, and Tomek Links. |
Data Augmentation | Provides image and text augmentations for model training. |
Evaluation Metrics | Tracks accuracy, precision, recall, F1-score, AUC-ROC, and more. |
Logging & Visualization | Supports TensorBoard & Loguru for monitoring and debugging. |
Flexible API | Pre-built modules & functionalities for datasets, models, tasks, setups, and more. |
Backbone | Supported Provider(s) | Modality |
---|---|---|
CLIP-ViT-B/32 | OpenAI Hugging Face |
Multimodal |
CLIP-ViT-B/16 | OpenAI Hugging Face |
Multimodal |
CLIP-ViT-L/14 | OpenAI Hugging Face |
Multimodal |
CLIP-ViT-L/14-336 | OpenAI Hugging Face |
Multimodal |
CLIP-RN50 | OpenAI | Multimodal |
CLIP-RN101 | OpenAI | Multimodal |
CLIP-RN50x4 | OpenAI | Multimodal |
CLIP-RN50x16 | OpenAI | Multimodal |
CLIP-RN50x64 | OpenAI | Multimodal |
DINO-V2-GIANT | Hugging Face | Image |
ALL-MiniLM-L6-v2 | Hugging Face | Text |
ALL-MPNET-BASE-V2 | Hugging Face | Text |
Fine-tuning a CLIP model on ImageNet is as simple as:
python vlm_toolbox/scripts/train.py \
--dataset_name imagenet1k \
--backbone_name vit_b_32 \
--trainer_name clip \
--model_type few_shot \
--setup_type full \
--num_epochs 100 \
--train_batch_size 64 \
--eval_batch_size 256 \
--precision_dtype fp16 \
--source huggingface \
--main_metric_name accuracy \
--random_state 42 \
--device_type cuda \
--collate_all_m2_samples False \
--save_predictions True
This command uses a ViT-B/32 CLIP model from Hugging Face, automatically logs progress, and stores prediction outputs for later review.
You can also import this toolbox as a library for more advanced or custom experimentation. Here’s a minimal code example illustrating how to set up a multimodal pipeline:
from config.enums import (
CLIPBackbones,
ImageDatasets,
Trainers,
Sources,
Metrics,
Stages,
)
from pipeline.pipeline import Pipeline
from config.setup import Setup
from util.memory import flush
# 1. Define your setup
setup = Setup(
dataset_name=ImageDatasets.IMAGENET_1K,
backbone_name=CLIPBackbones.CLIP_VIT_B_32,
trainer_name=Trainers.CLIP,
model_type='few_shot',
setup_type='full',
num_epochs=100,
train_batch_size=64,
eval_batch_size=256,
precision_dtype='fp16',
main_metric_name=Metrics.ACCURACY,
random_state=42,
device_type='cuda'
)
# 2. Initialize the pipeline
pipeline = Pipeline(setup, device_type='cuda')
# 3. Run the training
pipeline.run(
collate_all_m2_samples=False,
save_predictions=True,
persist=True,
)
# 4. Clean up
pipeline.tear_down()
flush()
Note: The toolbox treats multiple data inputs as modalities:
m1
andm2
. This modular design makes it easy to extend support for text, image, video, or other data streams.
One key strength of this repository is its extensibility. Integrating your own model is straightforward:
-
Add Your Model to an Enum
ExtendImageBackbones
orCLIPBackbones
inenums.py
:class ImageBackbones(BaseEnum): DINO_V2_GIANT = 'dino_v2_giant' NEW_IMAGE_MODEL = 'new_image_model'
-
Specify the Model URL
Updatebackbones.py
:class BackboneURLConfig(BaseConfig): config = { Backbones.IMAGE: { ImageBackbones.NEW_IMAGE_MODEL: { Sources.HUGGINGFACE: 'new/image-model-url', }, }, ... }
-
Train & Evaluate
Reference your new model from the command line or from your Python code. Your model is now part of the VL Models Toolbox!
Similar to adding new models, you can integrate additional datasets seamlessly:
-
Extend the
ImageDatasets
Enum
Inenums.py
, add:class ImageDatasets(BaseEnum): IMAGENET_1K = 'imagenet1k' FOOD101 = 'food101' ... MY_NEW_DATASET = 'my_new_dataset'
-
Add Configuration
Inimage_datasets.py
, define:ImageDatasetConfig.config = { ... ImageDatasets.MY_NEW_DATASET: { 'splits': ['train', 'validation'], DataStatus.RAW: { 'path': 'HuggingFaceM4/MYNEW', 'type': StorageType.HUGGING_FACE, }, DataStatus.EMBEDDING: { 'path': '/path/to/embeddings/my_new_dataset', 'type': StorageType.DISK, }, 'id_col': 'my_label_column_name', }, }
-
Validate Paths
If using a local folder, ensureStorageType.IMAGE_FOLDER
orStorageType.DISK
is set, and that the path exists. -
Reference the Dataset
Usemy_new_dataset
in your script or code, and you're all set. The dataset is now recognized and processed just like any other!
For deeper experimentation and visualization, explore our Jupyter notebooks in the notebooks
directory:
-
Zero-Shot Image Classification with CLIP
Demonstrates example usage and evaluation for zero-shot scenarios. -
Embedding Distribution Visualization
Compare embeddings via t-SNE, PCA, and more. -
Multi-Granular Performance on ImageNet
Assess model accuracy at different class hierarchical levels. -
Misclassification Error Analysis
Gain insights into where and why the model misclassifies.
1. (Optional) Create a Conda Environment
conda create -n vlm python=3.9
conda activate vlm
2. Install From the Source
git clone https://github.com/deepmancer/vlm-toolbox.git
cd vlm-toolbox
pip install -e .
For more detailed instructions (e.g., installing separate packages individually), see SETUP.md.
This project benefits from the work of several open-source repositories. We acknowledge and appreciate their contributions to the research community:
Contributions, suggestions, and new ideas are highly appreciated!
- Submit Issues & PRs: If you find bugs or have feature requests, open an issue or a pull request.
- Spread the Word: Star the repo and share your results to help grow the community.
For direct inquiries, feel free to reach out via email:
alirezaheidari dot cs at gmail dot com
This project is under the BSD 3-Clause License.
Use it freely, modify it, and share your improvements under the same terms.
Loved This Toolbox?
Give us a ⭐ on GitHub to support the project and help more researchers discover it!
Happy Coding!