Skip to content

Latest commit

 

History

History
64 lines (48 loc) · 2.58 KB

README.md

File metadata and controls

64 lines (48 loc) · 2.58 KB

vlm_training

This repository includes training code for the LLaVA Vision Language Model (VLM) using the Transformers library. The LLM used is phi-1_5 and the vision model is clip-vit-large-patch14. The LLM is relatively lightweight, with only 1.3 billion parameters. This makes it feasible to train on consumer-grade GPUs, such as the NVIDIA RTX 4090.

The training process involves two steps:

  1. Training the projector (while freezing the LLM and Vision tower)
  2. LORA fine-tuning for instruction tuning

Projector Training Data Preparation

Download the 558K subset of the LAION-CC-SBU dataset with BLIP captions from here . Organize the images according to the folder structure described below.

Instruction Fine-tuning Data Preparation

Download the annotation of the final mixture llava instruction tuning data llava_v1_5_mix665k.json, and download the images from following datasets:

After downloading all of them, organize the data as follows in data based on the following structure:

Data Folder Structure

data
├── instruction_finetuning
│   ├── coco
│   │   └── train2017
│   ├── gqa
│   │   └── images
│   ├── llava_v1_5_mix665k.json
│   ├── ocr_vqa
│   │   └── images
│   ├── text_vqa
│   │   └── train_images
│   └── vg
│       ├── VG_100K
│       └── VG_100K_2
└── projector_training
    ├── blip_laion_cc_sbu_558k.json
    └── images

Projector Training

pip install -r requirements.txt
python prepare_projector_data.py
python run_train_projector.py

LORA Instruction Fine-tuning

python prepare_instruction_data.py
python run_train_instruction_lora.py

Inference

Finally, use the inference.ipynb to run inference.