SigLIP training/fine-tune #2383

alexisdrakopoulos · 2024-12-31T12:06:08Z

alexisdrakopoulos
Dec 31, 2024

Hi,

It's been a while since I've worked on CV problems. I saw some trainer scripts but they're all really high level. Do you have any advice or lower-level python scripts for training SigLIP?

I'd like control over data-loading/augmentations.

rwightman · 2024-12-31T17:51:22Z

rwightman
Dec 31, 2024
Maintainer

@alexisdrakopoulos I'm biased, but I'd typically check this repo out locally -- I have many diff instances of it -- and hack away. It's setup so that if you're in the root of the repo you can run the scripts and it'll reference the timm module in the local path. You can still do a pip -e editable install if that's the pref. Either way, train script was left quite flat to be be easy to hack / see what's going on.

Transforms and data loading is in https://github.com/huggingface/pytorch-image-models/tree/main/timm/data ... aug pipelines are built in transforms_factory.py and one can add / change the stack, can override with your own and switch to albumentations stack, etc.

That said, for siglip fine-tune, you can get pretty darn good results using the script as is to fine-tune on existing datasets on the hf hub, etc.

Without doing really any special augs (just default imagenet base), you can get pretty good fine-tune with these models because they are so strong:

python train.py --dataset hfds/timm/resisc45 --num-classes 45 --model vit_so400m_patch14_siglip_224 --pretrained --opt adamw --lr 2e-5 --weight-decay 0.01 --warmup-lr 0 --sched-on-updates --warmup-epochs 3 -b 32 --epochs 20 --amp --amp-dtype bfloat16 --layer-decay 0.8

There are a lot of other similarly capable encoders, in the past week I just added the aimv2 and pali2 encoders (not in a pip release yet), there are also the pali 1 encoders, the siglip models, and quite a number of CLIP encoders on different datasets.

[
'aimv2_1b_patch14_224.apple_pt',
'aimv2_1b_patch14_336.apple_pt',
'aimv2_1b_patch14_448.apple_pt',
'aimv2_3b_patch14_224.apple_pt',
'aimv2_3b_patch14_336.apple_pt',
'aimv2_3b_patch14_448.apple_pt',
'aimv2_huge_patch14_224.apple_pt',
'aimv2_huge_patch14_336.apple_pt',
'aimv2_huge_patch14_448.apple_pt',
'aimv2_large_patch14_224.apple_pt',
'aimv2_large_patch14_224.apple_pt_dist',
'aimv2_large_patch14_336.apple_pt',
'aimv2_large_patch14_336.apple_pt_dist',
'aimv2_large_patch14_448.apple_pt',
'vit_base_patch16_siglip_224.webli',
'vit_base_patch16_siglip_256.webli',
'vit_base_patch16_siglip_256.webli_i18n',
'vit_base_patch16_siglip_384.webli',
'vit_base_patch16_siglip_512.webli',
'vit_base_patch16_siglip_gap_224.webli',
'vit_base_patch16_siglip_gap_256.webli',
'vit_base_patch16_siglip_gap_256.webli_i18n',
'vit_base_patch16_siglip_gap_384.webli',
'vit_base_patch16_siglip_gap_512.webli',
'vit_large_patch16_siglip_256.webli',
'vit_large_patch16_siglip_384.webli',
'vit_large_patch16_siglip_gap_256.webli',
'vit_large_patch16_siglip_gap_384.webli',
'vit_so400m_patch14_siglip_224.webli',
'vit_so400m_patch14_siglip_378.webli',
'vit_so400m_patch14_siglip_378.webli_ft_in1k',
'vit_so400m_patch14_siglip_384.webli',
'vit_so400m_patch14_siglip_gap_224.pali2_3b_pt',
'vit_so400m_patch14_siglip_gap_224.pali2_10b_pt',
'vit_so400m_patch14_siglip_gap_224.pali_mix',
'vit_so400m_patch14_siglip_gap_224.pali_pt',
'vit_so400m_patch14_siglip_gap_224.webli',
'vit_so400m_patch14_siglip_gap_378.webli',
'vit_so400m_patch14_siglip_gap_378.webli_ft_in1k',
'vit_so400m_patch14_siglip_gap_384.webli',
'vit_so400m_patch14_siglip_gap_448.pali2_3b_docci',
'vit_so400m_patch14_siglip_gap_448.pali2_3b_pt',
'vit_so400m_patch14_siglip_gap_448.pali2_10b_docci',
'vit_so400m_patch14_siglip_gap_448.pali2_10b_pt',
'vit_so400m_patch14_siglip_gap_448.pali_mix',
'vit_so400m_patch14_siglip_gap_448.pali_ocrvqa',
'vit_so400m_patch14_siglip_gap_448.pali_pt',
'vit_so400m_patch14_siglip_gap_448.pali_refcoco_seg',
'vit_so400m_patch14_siglip_gap_896.pali2_3b_pt',
'vit_so400m_patch14_siglip_gap_896.pali2_10b_pt',
'vit_so400m_patch14_siglip_gap_896.pali_ocrvqa',
'vit_so400m_patch14_siglip_gap_896.pali_pt',
'vit_so400m_patch14_siglip_gap_896.pali_refcoco_seg',
'vit_so400m_patch16_siglip_256.webli_i18n',
'vit_so400m_patch16_siglip_gap_256.webli_i18n',
'vit_base_patch16_clip_224.dfn2b',
'vit_huge_patch14_clip_224.dfn5b',
'vit_huge_patch14_clip_378.dfn5b',
'vit_huge_patch14_clip_quickgelu_224.dfn5b',
'vit_huge_patch14_clip_quickgelu_378.dfn5b',
'vit_large_patch14_clip_224.dfn2b',
'vit_large_patch14_clip_224.dfn2b_s39b',
'vit_large_patch14_clip_quickgelu_224.dfn2b',
'vit_base_patch16_clip_224.datacompxl',
'vit_base_patch32_clip_224.datacompxl',
'vit_base_patch32_clip_256.datacompxl',
'vit_large_patch14_clip_224.datacompxl',
'vitamin_base_224.datacomp1b_clip',
'vitamin_base_224.datacomp1b_clip_ltt',
'vitamin_large2_224.datacomp1b_clip',
'vitamin_large2_256.datacomp1b_clip',
'vitamin_large2_336.datacomp1b_clip',
'vitamin_large2_384.datacomp1b_clip',
'vitamin_large_224.datacomp1b_clip',
'vitamin_large_256.datacomp1b_clip',
'vitamin_large_336.datacomp1b_clip',
'vitamin_large_384.datacomp1b_clip',
'vitamin_small_224.datacomp1b_clip',
'vitamin_small_224.datacomp1b_clip_ltt',
'vitamin_xlarge_256.datacomp1b_clip',
'vitamin_xlarge_336.datacomp1b_clip',
'vitamin_xlarge_384.datacomp1b_clip',
'eva02_base_patch16_clip_224.merged2b',
'eva02_large_patch14_clip_224.merged2b',
'eva02_large_patch14_clip_336.merged2b',
'eva_giant_patch14_clip_224.merged2b',
'eva02_large_patch14_224.mim_m38m',
'vit_base_patch16_clip_224.laion2b',
'vit_base_patch32_clip_224.laion2b',
'vit_giant_patch14_clip_224.laion2b',
'vit_gigantic_patch14_clip_224.laion2b',
'vit_huge_patch14_clip_224.laion2b',
'vit_large_patch14_clip_224.laion2b',
'vit_base_patch16_clip_224.openai',
'vit_base_patch16_clip_quickgelu_224.openai',
'vit_base_patch32_clip_224.openai',
'vit_base_patch32_clip_quickgelu_224.openai',
'vit_large_patch14_clip_224.openai',
'vit_large_patch14_clip_336.openai',
'vit_large_patch14_clip_quickgelu_224.openai',
'vit_large_patch14_clip_quickgelu_336.openai',
'convnext_base.clip_laion2b',
'convnext_base.clip_laion2b_augreg',
'convnext_base.clip_laiona',
'convnext_base.clip_laiona_320',
'convnext_base.clip_laiona_augreg_320',
'convnext_large_mlp.clip_laion2b_augreg',
'convnext_large_mlp.clip_laion2b_ft_320',
'convnext_large_mlp.clip_laion2b_ft_soup_320',
'convnext_xxlarge.clip_laion2b_rewind',
'convnext_xxlarge.clip_laion2b_soup',
]

6 replies

alexisdrakopoulos Dec 31, 2024
Author

Thanks for answering so quickly! I'll try it out.

My dataset is custom. I have around 2.2 million images with text annotations.

The thing is, each image has 6 equivalent sets of text (semantically the same but written in different ways).

I'd like to use the 6 sets of text randomly when selecting image/text pairs as a form of data augmentation.

Hence I need some control.

Also my dataset is already in a special directory format so I'd like to control the image loading rather than rewrite how I store 2 million images.

alexisdrakopoulos Dec 31, 2024
Author

I forgot to mention I don't have "classes", it's (image, text) pairs.

rwightman Dec 31, 2024
Maintainer

Then you'd want to look at https://github.com/mlfoundations/open_clip .. it has train / fine-tune / eval code for image + text and supports the SigLIP models (both image & text tower) and works with SigLIP or CLIP loss, including distributed train support, or can hack your own losses, etc. Either webdataset or .jpg + csv dataloading is supported but strongly recommend webdataset tar files.

alexisdrakopoulos Dec 31, 2024
Author

Thanks that's fantastic. I knew of the repo but not the data format. My current filesystem approach is each jpg has a hash string name like 12345.jpg and files are stored as 1/2/3/12345.jpg to make browsing manually quick.

I'll look into that new format.

rwightman Dec 31, 2024
Maintainer

@alexisdrakopoulos you can see some examples on the HuggingFace hub that will work out of the box w/ OpenCLIP training

Best to download manually with the huggingface-cli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SigLIP training/fine-tune #2383

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

SigLIP training/fine-tune #2383

alexisdrakopoulos Dec 31, 2024

Replies: 1 comment · 6 replies

rwightman Dec 31, 2024 Maintainer

alexisdrakopoulos Dec 31, 2024 Author

alexisdrakopoulos Dec 31, 2024 Author

rwightman Dec 31, 2024 Maintainer

alexisdrakopoulos Dec 31, 2024 Author

rwightman Dec 31, 2024 Maintainer

alexisdrakopoulos
Dec 31, 2024

Replies: 1 comment 6 replies

rwightman
Dec 31, 2024
Maintainer

alexisdrakopoulos Dec 31, 2024
Author

alexisdrakopoulos Dec 31, 2024
Author

rwightman Dec 31, 2024
Maintainer

alexisdrakopoulos Dec 31, 2024
Author

rwightman Dec 31, 2024
Maintainer