Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for glm-4v-9b with mllm_plugin. #5343

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

marko1616
Copy link
Contributor

@marko1616 marko1616 commented Sep 3, 2024

🎉 Support for glm-4v-9b with mllm_plugin.

Fixes #4375

📝 Submission Checklist

  • Did you read the contributor guideline?
  • TODO: Done eval tests for this PR.(Limited)
  • TODO: Done SFT tests for this PR.

notes:

  • Eval test: Some padding issues may encounter due to self.training in modeling_chatglm.py.

@hiyouga hiyouga added the pending This problem is yet to be addressed label Sep 3, 2024
@TheDuckingDuck
Copy link

I am very much looking forward to having more vision models to experiment with :)

@GoGoZeppeli-towa
Copy link

GoGoZeppeli-towa commented Sep 29, 2024

I tried using the configuration from this PR and set up the following yaml file:

### model
model_name_or_path: /workspace/ckpt/glm-4v-9b
print_param_status: false

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z2_config.json
freeze_vision_tower: False

### dataset
dataset: en_3k_img
template: glm4_v
cutoff_len: 2048
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /workspace/sehyak/android_rl_checkpoints/glm-4v-9b/sft/en-3k-AE-image
logging_steps: 1
save_strategy: epoch
overwrite_output_dir: true
save_total_limit: 2
load_best_model_at_end: true
metric_for_best_model: eval_loss

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 8.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs: {min_lr_rate: 0.1}
warmup_ratio: 0.03
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.05
per_device_eval_batch_size: 1
eval_strategy: epoch

report_to: wandb
run_name: glm-4v-9b-en-3k-image

However, when I train on a multi-node cluster, I keep encountering various NCCL timeout errors such as Heartbeat monitor timed out and NCCL WARN socketProgress: Connection closed by remote peer node078 in NCCL debug file. This doesn't happen when I use the same configuration to train qwen2vl. Has anyone else experienced this problem or have any recommendations for a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request] 支持Qwen-VL
4 participants