Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Rank-Zero Printing and Improve Wandb Initialization #16

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

sadamov
Copy link
Collaborator

@sadamov sadamov commented May 1, 2024

Description:
This PR introduces rank_zero_print and init_wandb utility functions to enhance printing behavior in multi-process environments and streamline Weights and Biases (wandb) initialization for logging and experiment tracking.

Changes:

  • Added rank_zero_print function:

    • Ensures printing only occurs from rank 0 process in multi-process setup.
    • Wraps print function for selective printing based on process rank.
  • Introduced init_wandb function:

    • Initializes wandb based on provided arguments.
    • Handles new runs and resuming existing runs.
    • Generates meaningful run names with relevant information.
    • Creates WandbLogger for PyTorch Lightning integration.
    • Saves constants.py file to wandb for reference.
  • Added necessary imports for time, PyTorch Lightning, wandb, and rank-zero utilities.

Benefits:

  • Prevents duplicate or conflicting output in multi-process environments.
  • Simplifies wandb setup and configuration for experiment tracking and logging.
  • Improves organization and identification of experiments with informative run names.
  • Enhances reproducibility by saving constants.py file to wandb.

Simon Adamov added 3 commits May 1, 2024 20:15
@sadamov sadamov changed the title Feature rank one utils Add Rank-Zero Printing and Improve Wandb Initialization May 1, 2024
@sadamov sadamov requested a review from joeloskarsson May 1, 2024 20:21
@sadamov sadamov self-assigned this May 1, 2024
@sadamov sadamov added the enhancement New feature or request label May 1, 2024
@sadamov sadamov requested a review from leifdenby May 14, 2024 05:33
Copy link
Collaborator

@joeloskarsson joeloskarsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall a nice improvement. I had some small comments related to my use cases.

neural_lam/utils.py Show resolved Hide resolved
neural_lam/utils.py Outdated Show resolved Hide resolved
neural_lam/utils.py Outdated Show resolved Hide resolved
neural_lam/utils.py Outdated Show resolved Hide resolved
neural_lam/utils.py Show resolved Hide resolved
@sadamov
Copy link
Collaborator Author

sadamov commented May 25, 2024

Okay I requested another review, and from my side this PR should be ready for merge.

Copy link
Collaborator

@joeloskarsson joeloskarsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this on multi-gpu and it works great. I do however get a warning when the logger is created, that I think we maybe can avoid.

neural_lam/utils.py Show resolved Hide resolved
neural_lam/utils.py Show resolved Hide resolved
@sadamov
Copy link
Collaborator Author

sadamov commented Jun 6, 2024

@joeloskarsson the issue was that wandb is only initialized once the trainer.fit() started. So a nice way to solve this issue is: removing the wandb.init() as you suggested and then saving the configs in the ar_model:

    def on_run_end(self):
        if self.trainer.is_global_zero:
            wandb.save("neural_lam/data_config.yaml")

@sadamov
Copy link
Collaborator Author

sadamov commented Jun 6, 2024

@leifdenby can I add this PR to the roadmap for v.0.2.0? and then add it as a feature to the changelog as well?

Copy link
Collaborator

@joeloskarsson joeloskarsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I know remember that the wandb initialization timing is a bit annoying. But I think this is a great solution! I just don't understand the on_run_end hook, so want to double check that before we merge.

Also: You can add this to the changelog already now.

neural_lam/models/ar_model.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@joeloskarsson joeloskarsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that wandb.save retains the whole directory structure relative to the file it is saving, which I don't think we want for the data config file (it has a quite odd api https://docs.wandb.ai/ref/python/save). I took the freedom to push a change so that the data config is saved directly under files in wandb. Now that there is a bit more logic I also put it in a separate method, and switched the save policy to policy="now" while at it.

If that looks ok to you I think this is good to go, after input from @leifdenby on how to place this w.r.t. roadmap. If you thing my change is stupid please just overwrite it 😄

@leifdenby
Copy link
Member

LGTM!

@sadamov
Copy link
Collaborator Author

sadamov commented Sep 6, 2024

Apologies I have been quite slow reacting to this PR, I merged latest main into PR-branch. Okay with @leifdenby if I merge this now into v0.2.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants