Support for glm-4v-9b with mllm_plugin. #5343

marko1616 · 2024-09-03T12:43:10Z

🎉 Support for glm-4v-9b with `mllm_plugin`.

Fixes #4375

📝 Submission Checklist

Did you read the contributor guideline?
TODO: Done eval tests for this PR.(Limited)
TODO: Done SFT tests for this PR.

notes:

Eval test: Some padding issues may encounter due to self.training in modeling_chatglm.py.

TheDuckingDuck · 2024-09-26T00:56:45Z

I am very much looking forward to having more vision models to experiment with :)

GoGoZeppeli-towa · 2024-09-29T10:19:44Z

I tried using the configuration from this PR and set up the following yaml file:

### model
model_name_or_path: /workspace/ckpt/glm-4v-9b
print_param_status: false

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z2_config.json
freeze_vision_tower: False

### dataset
dataset: en_3k_img
template: glm4_v
cutoff_len: 2048
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /workspace/sehyak/android_rl_checkpoints/glm-4v-9b/sft/en-3k-AE-image
logging_steps: 1
save_strategy: epoch
overwrite_output_dir: true
save_total_limit: 2
load_best_model_at_end: true
metric_for_best_model: eval_loss

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 8.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs: {min_lr_rate: 0.1}
warmup_ratio: 0.03
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.05
per_device_eval_batch_size: 1
eval_strategy: epoch

report_to: wandb
run_name: glm-4v-9b-en-3k-image

However, when I train on a multi-node cluster, I keep encountering various NCCL timeout errors such as Heartbeat monitor timed out and NCCL WARN socketProgress: Connection closed by remote peer node078 in NCCL debug file. This doesn't happen when I use the same configuration to train qwen2vl. Has anyone else experienced this problem or have any recommendations for a solution?

marko1616 and others added 4 commits September 3, 2024 20:31

Support for glm4-v.

5750ba7

Train support.

6bc66a3

Tiny fix & GPG test.

cdfda93

Merge branch 'main' into feat/Support-glm4v

83511f3

hiyouga added the pending This problem is yet to be addressed label Sep 3, 2024

marko1616 and others added 6 commits September 3, 2024 17:41

Readme update.

2722c25

Merge branch 'main' into feat/Support-glm4v

67f307a

Tiny fix.

99e2ab5

Merge branch 'main' into feat/Support-glm4v

0a079df

Merge fixes

920dd69

Merge branch 'hiyouga:main' into feat/Support-glm4v

0209e2c

hiyouga force-pushed the main branch from 5569125 to b4c7dd3 Compare October 29, 2024 07:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for glm-4v-9b with mllm_plugin. #5343

Support for glm-4v-9b with mllm_plugin. #5343

marko1616 commented Sep 3, 2024 •

edited

Loading

TheDuckingDuck commented Sep 26, 2024

GoGoZeppeli-towa commented Sep 29, 2024 •

edited

Loading

Support for glm-4v-9b with mllm_plugin. #5343

Are you sure you want to change the base?

Support for glm-4v-9b with mllm_plugin. #5343

Conversation

marko1616 commented Sep 3, 2024 • edited Loading

🎉 Support for glm-4v-9b with mllm_plugin.

📝 Submission Checklist

notes:

TheDuckingDuck commented Sep 26, 2024

GoGoZeppeli-towa commented Sep 29, 2024 • edited Loading

marko1616 commented Sep 3, 2024 •

edited

Loading

🎉 Support for glm-4v-9b with `mllm_plugin`.

GoGoZeppeli-towa commented Sep 29, 2024 •

edited

Loading