vlm_training

This repository includes training code for the LLaVA Vision Language Model (VLM) using the Transformers library. The LLM used is phi-1_5 and the vision model is clip-vit-large-patch14. The LLM is relatively lightweight, with only 1.3 billion parameters. This makes it feasible to train on consumer-grade GPUs, such as the NVIDIA RTX 4090.

The training process involves two steps:

Training the projector (while freezing the LLM and Vision tower)
LORA fine-tuning for instruction tuning

Projector Training Data Preparation

Download the 558K subset of the LAION-CC-SBU dataset with BLIP captions from here . Organize the images according to the folder structure described below.

Instruction Fine-tuning Data Preparation

Download the annotation of the final mixture llava instruction tuning data llava_v1_5_mix665k.json, and download the images from following datasets:

COCO: train2017
GQA: images
OCR-VQA: download script, we save all files as .jpg
TextVQA: train_val_images
VisualGenome: part1, part2

After downloading all of them, organize the data as follows in data based on the following structure:

Data Folder Structure

data
├── instruction_finetuning
│   ├── coco
│   │   └── train2017
│   ├── gqa
│   │   └── images
│   ├── llava_v1_5_mix665k.json
│   ├── ocr_vqa
│   │   └── images
│   ├── text_vqa
│   │   └── train_images
│   └── vg
│       ├── VG_100K
│       └── VG_100K_2
└── projector_training
    ├── blip_laion_cc_sbu_558k.json
    └── images

Projector Training

pip install -r requirements.txt
python prepare_projector_data.py
python run_train_projector.py

LORA Instruction Fine-tuning

python prepare_instruction_data.py
python run_train_instruction_lora.py

Inference

Finally, use the inference.ipynb to run inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vlm_training

Projector Training Data Preparation

Instruction Fine-tuning Data Preparation

Projector Training

LORA Instruction Fine-tuning

Inference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
modules		modules
outputs		outputs
README.md		README.md
inference.ipynb		inference.ipynb
prepare_instruction_data.py		prepare_instruction_data.py
prepare_projector_data.py		prepare_projector_data.py
requirements.txt		requirements.txt
run_train_instruction_lora.py		run_train_instruction_lora.py
run_train_projector.py		run_train_projector.py

amirarsalan90/vlm_training

Folders and files

Latest commit

History

Repository files navigation

vlm_training

Projector Training Data Preparation

Instruction Fine-tuning Data Preparation

Projector Training

LORA Instruction Fine-tuning

Inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages