Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(training,rollout)!: Rollout Schedulers #46

Open
wants to merge 65 commits into
base: main
Choose a base branch
from

Conversation

HCookie
Copy link
Member

@HCookie HCookie commented Dec 20, 2024

Closes #14

Rollout Schedulers

Expand the ways to describe rollout, and provide an interface to schedule updates

New default rollout config

rollout:
  _target_: anemoi.training.schedulers.rollout.stepped.EpochStepped
  minimum: 1
  maximum: 12
  # increase rollout every n epochs
  every_n_epochs: 1
  # Control the incrementing of the rollout window
  increment:
    step:
      0: 0
      200000: 1 # After 200k steps, increment by 1 every 1 epoch

Can step by epoch, step, and control the increment based on either the step or epoch.

Additonally, formally add random steppers.

Todo

  • Integrate with the data loader
  • Ensure that the randomness is seeded appropriately
  • Randomness broadcast
  • Ensure restartability
  • Ability to change config

Tested

Tested with restart, and with change in config after restart.


📚 Documentation preview 📚: https://anemoi-training--46.org.readthedocs.build/en/46/


📚 Documentation preview 📚: https://anemoi-graphs--46.org.readthedocs.build/en/46/


📚 Documentation preview 📚: https://anemoi-models--46.org.readthedocs.build/en/46/

@HCookie HCookie self-assigned this Dec 20, 2024
@HCookie HCookie changed the title feat(rollout)!: Rollout Schedulers feat(training,rollout)!: Rollout Schedulers Dec 20, 2024
@HCookie HCookie added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 20, 2024
Copy link
Contributor

@anaprietonem anaprietonem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started to go through the PR and left some comments! I still need to understand better some of the functionality so hope te questions makes sense. Thanks for this Harrison!

@FussyDuck
Copy link

FussyDuck commented Jan 9, 2025

CLA assistant check
All committers have signed the CLA.

HCookie and others added 6 commits January 23, 2025 15:49
@HCookie HCookie force-pushed the 14-training-rollout-scheduling branch from e543ff1 to 88eacfb Compare February 17, 2025 19:12
@HCookie HCookie force-pushed the 14-training-rollout-scheduling branch from b3c8815 to 944aa66 Compare February 18, 2025 09:04
@HCookie HCookie force-pushed the 14-training-rollout-scheduling branch from 944aa66 to 49c09b6 Compare February 18, 2025 09:06
Copy link
Member

@ssmmnn11 ssmmnn11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, please see comments

@HCookie
Copy link
Member Author

HCookie commented Feb 18, 2025

@ssmmnn11

can you check that the runs are still shuffled differently for every epoch?

They are not, the indices are the same for each epoch.
However, I have been thinking further on the need to reload, I think it may be possible to do it without it, but I'll need to figure out how to broadcast a value to all the dataloader workers, as this is why they do not update. Rebuilding them obviously fixed that but causes other issues.

I don't believe it is as simple as setting persistent_workers=False, as this also causes the issue with no shuffling.

@ssmmnn11
Copy link
Member

persistent_workers=False means we have an initial seed and then just draw randon numbers from the generator, hence there should not be any repetition.

What we could do: before we did not have the epoch information in the dataset. This should now be the case and you could simply add the epoch to the random seed:

base_seed = get_base_seed()

something like: base_seed = get_base_seed() + self.epoch * 100

though we should probably rename base_seed here to dataset_seed otherwise it might be confusing -> we call refer to the base seed elsewhere in the code.

@HCookie HCookie force-pushed the 14-training-rollout-scheduling branch from 56e11ab to 92e0a0b Compare February 19, 2025 15:05
@HCookie HCookie force-pushed the 14-training-rollout-scheduling branch from 8629419 to fe43a24 Compare February 19, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change config Affects the config files to training documentation Improvements or additions to documentation enhancement New feature or request training
Projects
Status: ATS
Development

Successfully merging this pull request may close these issues.

Rollout Scheduling
7 participants