Vision-Language Models Toolbox

A PyTorch-powered library for accelerating multimodal AI research with Vision-Language Models

Vision-Language Models Toolbox

A flexible, all-in-one PyTorch library that streamlines research and development with state-of-the-art vision-language models. Whether you’re experimenting with soft-prompt tuning (e.g., CoOp, CoCoOp) or large-scale models such as CLIP, this toolbox provides a robust foundation built on PyTorch and Hugging Face Transformers.

Key Features

Feature	Description
Multimodal Datasets	Supports ImageNet1k, CIFAR-100, Stanford Cars, iNaturalist 2021, MSCOCO Captions, and more.
Model Flexibility	Works with CLIP (ViT & ResNet), DINO-V2, MiniLM, MPNet, and also allows adding custom models.
Custom Objectives/Tasks	Quickly add new tasks or losses with minimal code changes for all combined vision-language flows.
Prompt Tuning	Supports soft prompts (CoOp, CoCoOp) and predefined hard prompts for dataset adaptation.
Scalability & Precision	Supports multi-GPU, mixed precision (FP16, BF16, FP32, FP64), sharding, and DeepSpeed.
Sampling Strategies	Includes oversampling, undersampling, and hybrid methods like SMOTE, ADASYN, and Tomek Links.
Data Augmentation	Provides image and text augmentations for model training.
Evaluation Metrics	Tracks accuracy, precision, recall, F1-score, AUC-ROC, and more.
Logging & Visualization	Supports TensorBoard & Loguru for monitoring and debugging.
Flexible API	Pre-built modules & functionalities for datasets, models, tasks, setups, and more.

Supported Models

Backbone	Supported Provider(s)	Modality
CLIP-ViT-B/32	OpenAI Hugging Face	Multimodal
CLIP-ViT-B/16	OpenAI Hugging Face	Multimodal
CLIP-ViT-L/14	OpenAI Hugging Face	Multimodal
CLIP-ViT-L/14-336	OpenAI Hugging Face	Multimodal
CLIP-RN50	OpenAI	Multimodal
CLIP-RN101	OpenAI	Multimodal
CLIP-RN50x4	OpenAI	Multimodal
CLIP-RN50x16	OpenAI	Multimodal
CLIP-RN50x64	OpenAI	Multimodal
DINO-V2-GIANT	Hugging Face	Image
ALL-MiniLM-L6-v2	Hugging Face	Text
ALL-MPNET-BASE-V2	Hugging Face	Text

Quick Start

Fine-tuning a CLIP model on ImageNet is as simple as:

python vlm_toolbox/scripts/train.py \
    --dataset_name imagenet1k \
    --backbone_name vit_b_32 \
    --trainer_name clip \
    --model_type few_shot \
    --setup_type full \
    --num_epochs 100 \
    --train_batch_size 64 \
    --eval_batch_size 256 \
    --precision_dtype fp16 \
    --source huggingface \
    --main_metric_name accuracy \
    --random_state 42 \
    --device_type cuda \
    --collate_all_m2_samples False \
    --save_predictions True

This command uses a ViT-B/32 CLIP model from Hugging Face, automatically logs progress, and stores prediction outputs for later review.

Usage

Running Experiments

You can also import this toolbox as a library for more advanced or custom experimentation. Here’s a minimal code example illustrating how to set up a multimodal pipeline:

from config.enums import (
    CLIPBackbones,
    ImageDatasets,
    Trainers,
    Sources,
    Metrics,
    Stages,
)
from pipeline.pipeline import Pipeline
from config.setup import Setup
from util.memory import flush

# 1. Define your setup
setup = Setup(
    dataset_name=ImageDatasets.IMAGENET_1K,
    backbone_name=CLIPBackbones.CLIP_VIT_B_32,
    trainer_name=Trainers.CLIP,
    model_type='few_shot',
    setup_type='full',
    num_epochs=100,
    train_batch_size=64,
    eval_batch_size=256,
    precision_dtype='fp16',
    main_metric_name=Metrics.ACCURACY,
    random_state=42,
    device_type='cuda'
)

# 2. Initialize the pipeline
pipeline = Pipeline(setup, device_type='cuda')

# 3. Run the training
pipeline.run(
    collate_all_m2_samples=False,
    save_predictions=True,
    persist=True,
)

# 4. Clean up
pipeline.tear_down()
flush()

Note: The toolbox treats multiple data inputs as modalities: m1 and m2. This modular design makes it easy to extend support for text, image, video, or other data streams.

Adding New Models

One key strength of this repository is its extensibility. Integrating your own model is straightforward:

Add Your Model to an Enum
Extend ImageBackbones or CLIPBackbones in enums.py:

class ImageBackbones(BaseEnum):
    DINO_V2_GIANT = 'dino_v2_giant'
    NEW_IMAGE_MODEL = 'new_image_model'

Specify the Model URL
Update backbones.py:

class BackboneURLConfig(BaseConfig):
    config = {
        Backbones.IMAGE: {
            ImageBackbones.NEW_IMAGE_MODEL: {
                Sources.HUGGINGFACE: 'new/image-model-url',
            },
        },
        ...
    }

Train & Evaluate
Reference your new model from the command line or from your Python code. Your model is now part of the VL Models Toolbox!

Adding a New Dataset

Similar to adding new models, you can integrate additional datasets seamlessly:

Extend the ImageDatasets Enum
In enums.py, add:

class ImageDatasets(BaseEnum):
    IMAGENET_1K = 'imagenet1k'
    FOOD101 = 'food101'
    ...
    MY_NEW_DATASET = 'my_new_dataset'

Add Configuration
In image_datasets.py, define:

ImageDatasetConfig.config = {
    ...
    ImageDatasets.MY_NEW_DATASET: {
        'splits': ['train', 'validation'],
        DataStatus.RAW: {
            'path': 'HuggingFaceM4/MYNEW',
            'type': StorageType.HUGGING_FACE,
        },
        DataStatus.EMBEDDING: {
            'path': '/path/to/embeddings/my_new_dataset',
            'type': StorageType.DISK,
        },
        'id_col': 'my_label_column_name',
    },
}

Validate Paths
If using a local folder, ensure StorageType.IMAGE_FOLDER or StorageType.DISK is set, and that the path exists.
Reference the Dataset
Use my_new_dataset in your script or code, and you're all set. The dataset is now recognized and processed just like any other!

Jupyter Notebooks

For deeper experimentation and visualization, explore our Jupyter notebooks in the notebooks directory:

Zero-Shot Image Classification with CLIP
Demonstrates example usage and evaluation for zero-shot scenarios.
Embedding Distribution Visualization
Compare embeddings via t-SNE, PCA, and more.
Multi-Granular Performance on ImageNet
Assess model accuracy at different class hierarchical levels.
Misclassification Error Analysis
Gain insights into where and why the model misclassifies.

Installation

1. (Optional) Create a Conda Environment

conda create -n vlm python=3.9
conda activate vlm

2. Install From the Source

git clone https://github.com/deepmancer/vlm-toolbox.git
cd vlm-toolbox
pip install -e .

For more detailed instructions (e.g., installing separate packages individually), see SETUP.md.

Acknowledgments

This project benefits from the work of several open-source repositories. We acknowledge and appreciate their contributions to the research community:

OpenAI CLIP
CoOp
ProText
CuPL

Contributing

Contributions, suggestions, and new ideas are highly appreciated!

Submit Issues & PRs: If you find bugs or have feature requests, open an issue or a pull request.
Spread the Word: Star the repo and share your results to help grow the community.

For direct inquiries, feel free to reach out via email:

alirezaheidari dot cs at gmail dot com

License

This project is under the BSD 3-Clause License.
Use it freely, modify it, and share your improvements under the same terms.

Loved This Toolbox?
Give us a ⭐ on GitHub to support the project and help more researchers discover it!
Happy Coding!

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
assets		assets
notebooks		notebooks
tests		tests
vlm_toolbox		vlm_toolbox
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-Language Models Toolbox

Table of Contents

Key Features

Supported Models

Quick Start

Usage

Running Experiments

Adding New Models

Adding a New Dataset

Jupyter Notebooks

Installation

Acknowledgments

Contributing

License

About

Releases

Packages

Contributors 2

Languages

License

deepmancer/vlm-toolbox

Folders and files

Latest commit

History

Repository files navigation

Vision-Language Models Toolbox

Table of Contents

Key Features

Supported Models

Quick Start

Usage

Running Experiments

Adding New Models

Adding a New Dataset

Jupyter Notebooks

Installation

Acknowledgments

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages