This repository includes training code for the LLaVA Vision Language Model (VLM) using the Transformers library. The LLM used is phi-1_5 and the vision model is clip-vit-large-patch14. The LLM is relatively lightweight, with only 1.3 billion parameters. This makes it feasible to train on consumer-grade GPUs, such as the NVIDIA RTX 4090.
The training process involves two steps:
- Training the projector (while freezing the LLM and Vision tower)
- LORA fine-tuning for instruction tuning
Download the 558K subset of the LAION-CC-SBU dataset with BLIP captions from here . Organize the images according to the folder structure described below.
Download the annotation of the final mixture llava instruction tuning data llava_v1_5_mix665k.json, and download the images from following datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in data
based on the following structure:
Data Folder Structure
data
├── instruction_finetuning
│ ├── coco
│ │ └── train2017
│ ├── gqa
│ │ └── images
│ ├── llava_v1_5_mix665k.json
│ ├── ocr_vqa
│ │ └── images
│ ├── text_vqa
│ │ └── train_images
│ └── vg
│ ├── VG_100K
│ └── VG_100K_2
└── projector_training
├── blip_laion_cc_sbu_558k.json
└── images
pip install -r requirements.txt
python prepare_projector_data.py
python run_train_projector.py
python prepare_instruction_data.py
python run_train_instruction_lora.py
Finally, use the inference.ipynb
to run inference.