-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for glm-4v-9b with mllm_plugin. #5343
base: main
Are you sure you want to change the base?
Conversation
I am very much looking forward to having more vision models to experiment with :) |
I tried using the configuration from this PR and set up the following ### model
model_name_or_path: /workspace/ckpt/glm-4v-9b
print_param_status: false
### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z2_config.json
freeze_vision_tower: False
### dataset
dataset: en_3k_img
template: glm4_v
cutoff_len: 2048
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: /workspace/sehyak/android_rl_checkpoints/glm-4v-9b/sft/en-3k-AE-image
logging_steps: 1
save_strategy: epoch
overwrite_output_dir: true
save_total_limit: 2
load_best_model_at_end: true
metric_for_best_model: eval_loss
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 8.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs: {min_lr_rate: 0.1}
warmup_ratio: 0.03
bf16: true
ddp_timeout: 180000000
### eval
val_size: 0.05
per_device_eval_batch_size: 1
eval_strategy: epoch
report_to: wandb
run_name: glm-4v-9b-en-3k-image However, when I train on a multi-node cluster, I keep encountering various NCCL timeout errors such as |
🎉 Support for glm-4v-9b with
mllm_plugin
.Fixes #4375
📝 Submission Checklist
notes:
self.training
inmodeling_chatglm.py
.