vlm_training

This repository includes training code for the LLaVA Vision Language Model (VLM) using the Transformers library. The LLM used is phi-1_5 and the vision model is clip-vit-large-patch14. The LLM is relatively lightweight, with only 1.3 billion parameters. This makes it feasible to train on consumer-grade GPUs, such as the NVIDIA RTX 4090.

The training process involves two steps:

Training the projector (while freezing the LLM and Vision tower)
LORA fine-tuning for instruction tuning

Projector Training Data Preparation

Download the 558K subset of the LAION-CC-SBU dataset with BLIP captions from here . Organize the images according to the folder structure described below.

Instruction Fine-tuning Data Preparation

Download the annotation of the final mixture llava instruction tuning data llava_v1_5_mix665k.json, and download the images from following datasets:

COCO: train2017
GQA: images
OCR-VQA: download script, we save all files as .jpg
TextVQA: train_val_images
VisualGenome: part1, part2

After downloading all of them, organize the data as follows in data based on the following structure:

Data Folder Structure

data
├── instruction_finetuning
│   ├── coco
│   │   └── train2017
│   ├── gqa
│   │   └── images
│   ├── llava_v1_5_mix665k.json
│   ├── ocr_vqa
│   │   └── images
│   ├── text_vqa
│   │   └── train_images
│   └── vg
│       ├── VG_100K
│       └── VG_100K_2
└── projector_training
    ├── blip_laion_cc_sbu_558k.json
    └── images

Projector Training

pip install -r requirements.txt
python prepare_projector_data.py
python run_train_projector.py

LORA Instruction Fine-tuning

python prepare_instruction_data.py
python run_train_instruction_lora.py

Inference

Finally, use the inference.ipynb to run inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

vlm_training

Projector Training Data Preparation

Instruction Fine-tuning Data Preparation

Projector Training

LORA Instruction Fine-tuning

Inference

Files

README.md

Latest commit

History

README.md

File metadata and controls

vlm_training

Projector Training Data Preparation

Instruction Fine-tuning Data Preparation

Projector Training

LORA Instruction Fine-tuning

Inference