Paper link: arXiv
We explore several aspects of CLIP training with limited resources (e.g., up to tens of GPUs). First, we introduce FastCLIP, a general CLIP training framework built on advanced compositional optimization techniques while designed and optimized for the distributed setting. Second, to further boost training efficiency, we investigate three components of the framework from an optimization perspective: the schedule of the inner learning rate, the update rules of the temperature parameter and the model parameters, respectively. Finally, we benchmark the performance of FastCLIP and OpenCLIP on different compute scales (up to 32 GPUs on 8 nodes), and three data scales (CC3M, CC12M and LAION400M).
Table of Contents
The FastCLIP framework is an efficient distributed training framework of CLIP models, which includes a family of algorithms differing in loss computation and temperature parameter update. It is powered by advanced finite-sum coupled compositional optimization (FCCO) [4] techniques for optimizing global contrastive losses, which is suitable for training CLIP models with limited compute resources. Below we introduce three algorithms in this framework which are named FastCLIP-v1 to v3. We first present the pseudo-code of FastCLIP:
The three algorithms have different ways of computing the loss and updating the temperature
where
FastCLIP-v2 optimizes the Robust Global Contrastive Loss (RGCL), which is first used by iSogCLR [2]. In RGCL, the temperature parameter becomes a variable that needs to be optimized. Also, each data point now has its individual temperature parameter, as opposed to the global temperature in GCL. Let
where
The main difference between RGCL and RGCL-g is that RGCL-g unifies the individual temperature parameter into a single global temperature. In FastCLIP-v3, the temperature parameter is also learnable. In the following table we provide comparison between different algorithms.
Note that OpenCLIP [3] uses the Mini-Batch Contrastive Loss (MBCL), which requires a large batch size for good performance. "FCCO" means the algorithm leverages finite-sum coupled compositional optimization techniques. "Distributed" means the algorithm is designed for distributed training. In "Temperature Scheme", "G" means global temperature while "I" means individual temperature.
Next we present the results of FastCLIP vs. OpenCLIP, SogCLR and iSogCLR. For more results please refer to our paper.
FastCLIP vs. OpenCLIP: We plot the ImageNet Top1 accuracy curve of OpenCLIP and FastCLIP-v3 in the xlarge-scale setting (LAION400M, batch size 5120) in subfigure (a) and the average of ImageNet and variants performance across different number of nodes in the medium-scale (CC3M, batch size 1024) and large-scale (CC12M, batch size 2048) settings in subfigures (b) and (c), respectively.
We also plot the average of ImageNet and variants curves of OpenCLIP and FastCLIP-v3 in the medium-scale and large-scale settings (corresponding to subfigures (b) and (c) above).
FastCLIP vs. SogCLR, iSogCLR: The following table shows the results of FastCLIP-v1 (FastCLIP-v2, resp.) vs. SogCLR (iSogCLR, resp.) in the medium-scale and large-scale settings.
The following figure shows the training time in the medium-scale and large-scale settings. Subfigures (a) and (b) plot the per-iteration training time. Subfigures (c) and (d) plot the communication time per iteration.
[1] Zhuoning Yuan, Yuexin Wu, Zi-Hao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, and Tianbao Yang. Provable stochastic optimization for global contrastive learning: Small batch does not harm performance. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 25760–25782. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/yuan22b.html.
[2] Zi-Hao Qiu, Quanqi Hu, Zhuoning Yuan, Denny Zhou, Lijun Zhang, and Tianbao Yang. Not all semantics are created equal: Contrastive self-supervised learning with automatic temperature individualization. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28389–28421. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/qiu23a.html.
[3] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. https://doi.org/10.5281/zenodo.5143773, July 2021. URL https://doi.org/10.5281/zenodo.5143773.
[4] Bokun Wang and Tianbao Yang. Finite-sum coupled compositional stochastic optimization: Theory and applications. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23292–23317. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/wang22ak.html.
To set up the environment for training, please
- Download this repository:
git clone https://github.com/Optimization-AI/fast_clip.git cd fast_clip
- Create a new environment:
conda create -n fastclip python=3.11 conda activate fastclip pip install -r requirements-training.txt
We present sample slurm scripts to run OpenCLIP and FastCLIP-v0 to v3. For non-slurm instructions, please refer to the end of this subsection. To train on your own data, you need to modify the following options
-
--train-data
: the path to the training data, currently only webdataset format is supported. -
--train-num-samples
: this many samples will be seen for one epoch, we recommend to set it to the actual size of the dataset. -
--data_size
: the original size of the dataset, this may take a value different from--train-num-samples
. In the case of CC3M, its metadata contains 3318333 image-URL/caption pairs, but we were only able to down 2723840 of them. So we set--data_size
to 3318333 and set--train-num-samples
to 2723848. -
--epochs
: for this many epochs the model will be trained. -
--gamma_decay_epochs
: for this many epochs$\gamma$ will decrease from 1.0 to--gamma
. We recommend to set it to half of--epochs
.
Sample script to run FastCLIP-v3 on CC3M using 8 GPUs (2 nodes and 4 GPUs per node)
#!/bin/bash
#SBATCH --time=2-00:00:00
#SBATCH --mem=120G
#SBATCH --nodes=2
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=6
#SBATCH --wait-all-nodes=1
#SBATCH --job-name=fastclipv3
#SBATCH --partition=gpu
#SBATCH --output=%x_%j.log
source ~/.bashrc
conda activate fastclip
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
export MASTER_PORT=12805
export CUDA_VISIBLE_DEVICES='0,1,2,3'
export PYTHONPATH="$PYTHONPATH:$PWD/src"
export HUGGINGFACE_HUB_CACHE='./checkpoints/huggingface'
srun python -u src/training/main.py \
--save-frequency 1 \
--train-data './datasets/cc3m_webdataset/cc3m_train/{00000..00331}.tar' \
--train-num-samples 2723840 --data_size 3318333 \
--warmup 10000 \
--batch-size 128 \
--epochs 37 \
--workers 6 \
--model RN50 \
--name medium_fastclipv3 \
--seed 2024 \
--profile \
--wd 0.1 \
--local-loss \
--fastclip --multiply_tau --temperature_scheme global_learnable \
--lr 1e-3 --lr_tau 2e-4 --lr_tau_scheduler step_thresh --rho 6.5 \
--gamma 0.2 --gamma_schedule cosine --gamma_decay_epochs 18
Sample script to run OpenCLIP on CC3M using 8 GPUs (click to expand):
Replace the srun python -u src/training/main.py
command in the FastCLIP-v3 script with
srun python -u src/training/main.py \
--save-frequency 1 \
--train-data './datasets/cc3m_webdataset/cc3m_train/{00000..00331}.tar' \
--train-num-samples 2723840 --data_size 3318333 \
--warmup 10000 \
--batch-size 128 \
--epochs 37 \
--workers 6 \
--model RN50 \
--name medium_openclip \
--seed 2024 \
--profile \
--wd 0.1 \
--local-loss \
--gather-with-grad \
--lr 1e-3
Sample script to run FastCLIP-v0 on CC3M using 8 GPUs (click to expand):
Replace the srun python -u src/training/main.py
command in the FastCLIP-v3 script with
srun python -u src/training/main.py \
--save-frequency 1 \
--train-data './datasets/cc3m_webdataset/cc3m_train/{00000..00331}.tar' \
--train-num-samples 2723840 --data_size 3318333 \
--warmup 10000 \
--batch-size 128 \
--epochs 37 \
--workers 6 \
--model RN50 \
--name medium_fastclipv0 \
--seed 2024 \
--profile \
--wd 0.1 \
--local-loss \
--fastclip --temperature_scheme global_learnable \
--lr 1e-3 \
--gamma 0.2 --gamma_schedule cosine --gamma_decay_epochs 18
Sample script to run FastCLIP-v1 on CC3M using 8 GPUs (click to expand):
Replace the srun python -u src/training/main.py
command in the FastCLIP-v3 script with
srun python -u src/training/main.py \
--save-frequency 1 \
--train-data './datasets/cc3m_webdataset/cc3m_train/{00000..00331}.tar' \
--train-num-samples 2723840 --data_size 3318333 \
--warmup 10000 \
--batch-size 128 \
--epochs 37 \
--workers 6 \
--model RN50 \
--name medium_fastclipv1 \
--seed 2024 \
--profile \
--wd 0.1 \
--local-loss \
--fastclip --temperature_scheme global_constant \
--lr 1e-3 \
--gamma 0.2 --gamma_schedule cosine --gamma_decay_epochs 18
Sample script to run FastCLIP-v2 on CC3M using 8 GPUs (click to expand):
Replace the srun python -u src/training/main.py
command in the FastCLIP-v3 script with
srun python -u src/training/main.py \
--save-frequency 1 \
--train-data './datasets/cc3m_webdataset/cc3m_train/{00000..00331}.tar' \
--train-num-samples 2723840 --data_size 3318333 \
--warmup 10000 \
--batch-size 128 \
--epochs 37 \
--workers 6 \
--model RN50 \
--name medium_fastclipv2 \
--seed 2024 \
--profile \
--wd 0.1 \
--local-loss \
--fastclip --temperature_scheme individual_learnable \
--lr 1e-3 --lr_tau 0.0133 --lr_tau_scheduler const --temperature 0.03 --rho 7.0 \
--gamma 0.2 --gamma_schedule cosine --gamma_decay_epochs 18
Sample script to run FastCLIP-v3 on CC12M using 8 GPUs (click to expand):
Replace the srun python -u src/training/main.py
command in the CC3M script with
srun python -u src/training/main.py \
--save-frequency 1 \
--train-data './datasets/cc12m_webdataset/cc12m/{00000..01242}.tar' \
--train-num-samples 9187328 --data_size 12423374 \
--warmup 10000 \
--batch-size 256 \
--epochs 33 \
--workers 6 \
--model ViT-B-32 \
--name large_fastclipv3 \
--seed 2024 \
--profile \
--wd 0.1 \
--local-loss \
--fastclip --multiply_tau --temperature_scheme global_learnable \
--lr 4e-4 --lr_tau 1e-4 --lr_tau_scheduler step_thresh --rho 8.5 \
--gamma 0.2 --gamma_schedule cosine --gamma_decay_epochs 16
Sample script to run FastCLIP-v3 on LAION400M using 16 GPUs (click to expand):
Many slurm systems have a limit on the running time of a job (e.g., 2 days), which is far less than the time required for 30 epochs on LAION400M. We recommend to leverage the --stop_epochs
and --resume
options. By setting --stop_epochs
, the training will stop after the specified epoch. And by setting --resume
, the training will resume at the specified checkpoint. Then the user can construct an array of jobs such that the --resume
of a job points to the checkpoint where its preceding job stops. Below is a sample script that stops after the third epoch.
#!/bin/bash
#SBATCH --time=2-00:00:00
#SBATCH --mem=120G
#SBATCH --nodes=4
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=6
#SBATCH --wait-all-nodes=1
#SBATCH --job-name=fastclipv3
#SBATCH --partition=gpu
#SBATCH --output=%x_%j.log
source ~/.bashrc
conda activate fastclip
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
export MASTER_PORT=12805
export CUDA_VISIBLE_DEVICES='0,1,2,3'
export PYTHONPATH="$PYTHONPATH:$PWD/src"
export HUGGINGFACE_HUB_CACHE='./checkpoints/huggingface'
srun python -u src/training/main.py \
--save-frequency 1 \
--train-data './datasets/laion400m/laion400m-data/{00000..41407}.tar' \
--train-num-samples 315658316 --data_size 414080000 \
--warmup 13200 \
--batch-size 320 \
--epochs 30 \
--workers 6 \
--model ViT-B-16 \
--name xlarge_fastclipv3 \
--seed 2024 \
--profile \
--wd 0.2 \
--local-loss \
--fastclip --fastclip_eps 1e-6 --multiply_tau --temperature_scheme global_learnable \
--lr 2e-4 --lr_min 4e-5 --lr_tau 5e-5 --lr_tau_scheduler step_thresh --rho 16.0 \
--gamma 0.8 --gamma_schedule cosine --gamma_decay_epochs 10 \
--stop_epochs 3
Non-slurm Training: For non-slurm training, please set master_addr
manually (e.g., 127.0.0.1
), change srun python -u src/training/main.py
to cd src && torchrun --nproc_per_node=4 --rdzv_endpoint=$master_addr -m training.main
, and run the above script with /bin/bash
.
Sample slurm script to evaluate a trained CLIP model (specified by --resume
) on ImageNet-1k:
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --mem=20G
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --job-name=eval_fastclip
#SBATCH --partition=gpu
#SBATCH --output=%x_%j.log
source ~/.bashrc
conda activate fastclip
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
export MASTER_PORT=12802
export PYTHONPATH="$PYTHONPATH:$PWD/src"
export HUGGINGFACE_HUB_CACHE='./checkpoints/huggingface'
srun python -u src/training/main.py \
--resume ./logs/medium_fastclipv3/checkpoints/epoch_37.pt \
--zeroshot-frequency 1 \
--imagenet-val ./datasets/imagenet/val \
--batch-size 512 \
--epochs 0 \
--workers 6 \
--model RN50 \
--name eval_medium_fastclipv3_epoch_37 \
--seed 2024
Datacomp: For evaluation on the Datacomp benchmark, please refer to the "Evaluation" section in the Datacomp repository.
Non-slurm Training: For non-slurm training, please set master_addr
manually (e.g., 127.0.0.1
), change srun python -u src/training/main.py
to python src/training/main.py
, and run the above script with /bin/bash
.
If you find FastCLIP useful in your research, please consider citing the following paper:
@article{wei2024fastclip,
title={FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources},
author={Wei, Xiyuan and Ye, Fanjiang and Yonay, Ori and Chen, Xingyu and Sun, Baixi and Tao, Dingwen and Yang, Tianbao},
journal={arXiv preprint arXiv:2407.01445},
year={2024}
}