Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wandb tracking issue for distributed runs launched with accelerate. #33

Open
eihli opened this issue Nov 14, 2023 · 0 comments
Open

Wandb tracking issue for distributed runs launched with accelerate. #33

eihli opened this issue Nov 14, 2023 · 0 comments

Comments

@eihli
Copy link
Contributor

eihli commented Nov 14, 2023

I noticed I was getting 4 wandb runs when I trained with accelerate launch on a machine with 4 GPUs. All 4 of the wandb runs include system statistics, like GPU temp. But only one of them includes the training/evaluation panels, because of our check for is_main_process.

All 4 of the runs highlighted in the image below are from a single accelerate launch train.py ...
image

It looks like the way to use wandb with Accelerator is to use the log_with argument when instantiating the accelerator: accelerator = Accelerator(log_with="wandb") and then call accelerator.init_trackers under a condition of is_main_process.

This issue has some discussion on the 🤗 forums: https://discuss.huggingface.co/t/multiple-wandb-outputs/21394

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant