Wandb tracking issue for distributed runs launched with accelerate. #33

eihli · 2023-11-14T16:30:37Z

I noticed I was getting 4 wandb runs when I trained with accelerate launch on a machine with 4 GPUs. All 4 of the wandb runs include system statistics, like GPU temp. But only one of them includes the training/evaluation panels, because of our check for is_main_process.

All 4 of the runs highlighted in the image below are from a single accelerate launch train.py ...

It looks like the way to use wandb with Accelerator is to use the log_with argument when instantiating the accelerator: accelerator = Accelerator(log_with="wandb") and then call accelerator.init_trackers under a condition of is_main_process.

This issue has some discussion on the 🤗 forums: https://discuss.huggingface.co/t/multiple-wandb-outputs/21394

The text was updated successfully, but these errors were encountered:

eihli mentioned this issue Feb 13, 2024

Fix issues with distributed training #80

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wandb tracking issue for distributed runs launched with accelerate. #33

Wandb tracking issue for distributed runs launched with accelerate. #33

eihli commented Nov 14, 2023 •

edited

Loading

Wandb tracking issue for distributed runs launched with accelerate. #33

Wandb tracking issue for distributed runs launched with accelerate. #33

Comments

eihli commented Nov 14, 2023 • edited Loading

eihli commented Nov 14, 2023 •

edited

Loading