feat(training,rollout)!: Rollout Schedulers #46

HCookie · 2024-12-20T11:02:37Z

Closes #14

Rollout Schedulers

Expand the ways to describe rollout, and provide an interface to schedule updates

New default rollout config

rollout:
  _target_: anemoi.training.schedulers.rollout.stepped.EpochStepped
  minimum: 1
  maximum: 12
  # increase rollout every n epochs
  every_n_epochs: 1
  # Control the incrementing of the rollout window
  increment:
    step:
      0: 0
      200000: 1 # After 200k steps, increment by 1 every 1 epoch

Can step by epoch, step, and control the increment based on either the step or epoch.

Additonally, formally add random steppers.

Todo

Integrate with the data loader
Ensure that the randomness is seeded appropriately
Randomness broadcast
Ensure restartability
Ability to change config

Tested

Tested with restart, and with change in config after restart.

📚 Documentation preview 📚: https://anemoi-training--46.org.readthedocs.build/en/46/

📚 Documentation preview 📚: https://anemoi-graphs--46.org.readthedocs.build/en/46/

📚 Documentation preview 📚: https://anemoi-models--46.org.readthedocs.build/en/46/

- Allow for complex incrementing setup

- Calculation based not step based

for more information, see https://pre-commit.ci

…out-scheduling

anaprietonem

Started to go through the PR and left some comments! I still need to understand better some of the functionality so hope te questions makes sense. Thanks for this Harrison!

training/src/anemoi/training/train/train.py

training/src/anemoi/training/diagnostics/callbacks/rollout.py

training/src/anemoi/training/train/forecaster.py

FussyDuck · 2025-01-09T08:11:30Z

All committers have signed the CLA.

training/src/anemoi/training/config/training/default.yaml

training/tests/schedulers/__init__.py

training/src/anemoi/training/train/train.py

- Add epoch record - Remove sync - Generalise base class

for more information, see https://pre-commit.ci

* Revert back to expanded workflow --------- Co-authored-by: Mario Santa Cruz <[email protected]>

for more information, see https://pre-commit.ci

ssmmnn11

nice, please see comments

training/src/anemoi/training/train/train.py

training/src/anemoi/training/schedulers/rollout/__init__.py

HCookie · 2025-02-18T17:15:31Z

@ssmmnn11

can you check that the runs are still shuffled differently for every epoch?

They are not, the indices are the same for each epoch.
However, I have been thinking further on the need to reload, I think it may be possible to do it without it, but I'll need to figure out how to broadcast a value to all the dataloader workers, as this is why they do not update. Rebuilding them obviously fixed that but causes other issues.

I don't believe it is as simple as setting persistent_workers=False, as this also causes the issue with no shuffling.

ssmmnn11 · 2025-02-19T08:13:01Z

persistent_workers=False means we have an initial seed and then just draw randon numbers from the generator, hence there should not be any repetition.

What we could do: before we did not have the epoch information in the dataset. This should now be the case and you could simply add the epoch to the random seed:

anemoi-core/training/src/anemoi/training/data/dataset.py

Line 224 in 83d72e1

base_seed = get_base_seed()

something like: base_seed = get_base_seed() + self.epoch * 100

though we should probably rename base_seed here to dataset_seed otherwise it might be confusing -> we call refer to the base seed elsewhere in the code.

for more information, see https://pre-commit.ci

…cmwf/anemoi-core into 14-training-rollout-scheduling

…cheduling

- Fix issue with RandomIncrease

HCookie and others added 11 commits December 18, 2024 15:30

Rollout Schedulers

973349f

Incrementer

fcf1f1f

- Allow for complex incrementing setup

Improve incrementor

a712c48

- Calculation based not step based

[pre-commit.ci] auto fixes from pre-commit.com hooks

72e0bf9

for more information, see https://pre-commit.ci

Precommit fixes

c199c0e

Add changelog entry

69a5d9a

Seed random every time and remove -1 for inf

d5a0ff9

Merge commit 'd5a0ff9c20b0560da9bb540a1e63a4e9015dcc79' into 145-roll…

a0759ce

…out-scheduling

MIGRATION COMMIT

249144e

Merge commit '249144e5c4ea3222e2d5eddec80025ce3acd5a5d' into 145-roll…

71f99f3

…out-scheduling

Merge branch 'develop' into 145-rollout-scheduling

7659721

HCookie added the training label Dec 20, 2024

HCookie self-assigned this Dec 20, 2024

HCookie changed the title ~~feat(rollout)!: Rollout Schedulers~~ feat(training,rollout)!: Rollout Schedulers Dec 20, 2024

HCookie added 3 commits December 20, 2024 15:11

Update warnings

433362a

Add tests

3ac7dcd

pre-commit

71a9e08

HCookie added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 20, 2024

anaprietonem reviewed Jan 7, 2025

View reviewed changes

anaprietonem reviewed Jan 10, 2025

View reviewed changes

training/src/anemoi/training/config/training/default.yaml Outdated Show resolved Hide resolved

anaprietonem reviewed Jan 10, 2025

View reviewed changes

training/tests/schedulers/__init__.py Show resolved Hide resolved

anaprietonem reviewed Jan 10, 2025

View reviewed changes

training/src/anemoi/training/train/train.py Outdated Show resolved Hide resolved

HCookie and others added 6 commits January 23, 2025 15:49

Rewrite of schedulers

c81ed96

- Add epoch record - Remove sync - Generalise base class

Fix InterEpochRolloutMixin

9486f05

Add docs

5eb6a4d

Merge branch 'main' into 14-training-rollout-scheduling

8853025

[pre-commit.ci] auto fixes from pre-commit.com hooks

fe68831

for more information, see https://pre-commit.ci

ci(docs): readthedocs ci (#87)

6907f21

* Revert back to expanded workflow --------- Co-authored-by: Mario Santa Cruz <[email protected]>

HCookie force-pushed the 14-training-rollout-scheduling branch from e543ff1 to 88eacfb Compare February 17, 2025 19:12

pre-commit-ci bot and others added 4 commits February 17, 2025 19:12

[pre-commit.ci] auto fixes from pre-commit.com hooks

0eb463d

for more information, see https://pre-commit.ci

Add scheduler schema to docs

6c4fb28

Remove type aliases for STEP_TYPE & INCREMENT

fa6bb8c

Update scheduler docs

14dd5bc

HCookie force-pushed the 14-training-rollout-scheduling branch from b3c8815 to 944aa66 Compare February 18, 2025 09:04

Rejig examples in docstrings

49c09b6

HCookie force-pushed the 14-training-rollout-scheduling branch from 944aa66 to 49c09b6 Compare February 18, 2025 09:06

HCookie added 6 commits February 18, 2025 09:07

Add h2

422205f

Allow increment maximum to be dictionary

43d5421

Fix for 3.9

bd6fd94

Fix docs?

18a8901

Update docs

b8cc373

Docs are hard

7715c9b

ssmmnn11 requested changes Feb 18, 2025

View reviewed changes

training/src/anemoi/training/train/train.py Outdated Show resolved Hide resolved

training/src/anemoi/training/schedulers/rollout/__init__.py Show resolved Hide resolved

HCookie added 2 commits February 19, 2025 15:01

Use multiprocessess Value to sync rollout value

9c40248

Fix type hints

92e0a0b

HCookie force-pushed the 14-training-rollout-scheduling branch from 56e11ab to 92e0a0b Compare February 19, 2025 15:05

Change warning

fe43a24

HCookie force-pushed the 14-training-rollout-scheduling branch from 8629419 to fe43a24 Compare February 19, 2025 15:07

pre-commit-ci bot and others added 8 commits February 19, 2025 15:08

[pre-commit.ci] auto fixes from pre-commit.com hooks

d246ea5

for more information, see https://pre-commit.ci

Use logging

61b346b

Merge branch '14-training-rollout-scheduling' of https://github.com/e…

4efadeb

…cmwf/anemoi-core into 14-training-rollout-scheduling

Fix tests

26e8e52

Fix docstring

24eb505

Merge remote-tracking branch 'origin/main' into 14-training-rollout-s…

eb2a0d9

…cheduling

Revert changelog change

c27f7f8

Update docs

a0821c6

- Fix issue with RandomIncrease

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(training,rollout)!: Rollout Schedulers #46

feat(training,rollout)!: Rollout Schedulers #46

HCookie commented Dec 20, 2024 •

edited by github-actions bot

Loading

anaprietonem left a comment

FussyDuck commented Jan 9, 2025 •

edited

Loading

ssmmnn11 left a comment

HCookie commented Feb 18, 2025 •

edited

Loading

ssmmnn11 commented Feb 19, 2025

feat(training,rollout)!: Rollout Schedulers #46

Are you sure you want to change the base?

feat(training,rollout)!: Rollout Schedulers #46

Conversation

HCookie commented Dec 20, 2024 • edited by github-actions bot Loading

Rollout Schedulers

New default rollout config

Todo

Tested

anaprietonem left a comment

Choose a reason for hiding this comment

FussyDuck commented Jan 9, 2025 • edited Loading

ssmmnn11 left a comment

Choose a reason for hiding this comment

HCookie commented Feb 18, 2025 • edited Loading

ssmmnn11 commented Feb 19, 2025

HCookie commented Dec 20, 2024 •

edited by github-actions bot

Loading

FussyDuck commented Jan 9, 2025 •

edited

Loading

HCookie commented Feb 18, 2025 •

edited

Loading