Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to recover after checkpoint saving #868

Open
peregilk opened this issue Sep 6, 2024 · 2 comments
Open

Unable to recover after checkpoint saving #868

peregilk opened this issue Sep 6, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@peregilk
Copy link

peregilk commented Sep 6, 2024

I am suddenly seeing crashes after saving checkpoints. This is with code that did run perfectly earlier. However, it is after a system reinstall. Wonder if someone have seen the same issue.

The checkpoints are successfully saved. Training is however not recovering, and crashing with this error:

I0907 22:26:01.536684 139725339944960 utils.py:253] [process=24][thread=MainThread] Waiting with jax/sync_global_devices("CheckpointManager:old_steps_to_remove.20000")
I0907 22:26:01.539081 139725339944960 utils.py:260] [process=24][thread=MainThread] Done waiting with jax/sync_global_devices("CheckpointManager:old_steps_to_remove.20000")
I0907 22:26:01.539187 139725339944960 checkpoint_manager.py:1744] [host=24][thread=MainThread][step=20000] CheckpointManager Save Finalize is syncing with other hosts...
Traceback (most recent call last):
  File "/home/perk/maxtext/MaxText/train.py", line 687, in <module>
    app.run(main)
  File "/home/perk/maxtext-env/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/perk/maxtext-env/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/perk/maxtext/MaxText/train.py", line 683, in main
    train_loop(config)
  File "/home/perk/maxtext/MaxText/train.py", line 606, in train_loop
    if save_checkpoint(checkpoint_manager, int(step), state, config.dataset_type, data_iterator):
  File "/home/perk/maxtext/MaxText/train.py", line 184, in save_checkpoint
    return checkpoint_manager.save(
  File "/home/perk/maxtext-env/lib/python3.10/site-packages/orbax/checkpoint/checkpoint_manager.py", line 1253, in save
    self._finalize(step, steps_to_remove)
  File "/home/perk/maxtext-env/lib/python3.10/site-packages/orbax/checkpoint/checkpoint_manager.py", line 1751, in _finalize
    barrier_sync_fn = self._create_thread_safe_barrier_sync_fn()
  File "/home/perk/maxtext-env/lib/python3.10/site-packages/orbax/checkpoint/checkpoint_manager.py", line 722, in _create_thread_safe_barrier_sync_fn
    or multihost.get_barrier_sync_fn(
  File "/home/perk/maxtext-env/lib/python3.10/site-packages/orbax/checkpoint/multihost/utils.py", line 154, in get_barrier_sync_fn
    client = _get_jax_distributed_client()
  File "/home/perk/maxtext-env/lib/python3.10/site-packages/orbax/checkpoint/multihost/utils.py", line 113, in _get_jax_distributed_client
    raise ValueError(
ValueError: Distributed system is not available; please initialize it via `jax.distributed.initialize()` at the start of your program.
I0907 22:26:01.596143 139677066847808 grain_pool.py:397] Grain pool is exiting.     
@jonb377
Copy link
Collaborator

jonb377 commented Sep 17, 2024

Thanks for reporting! A patch is in the works in #895.

As an immediate workaround, you can enable async checkpointing with the config async_checkpointing=true, which initializes the jax distributed client.

@gobbleturk gobbleturk added the bug Something isn't working label Sep 17, 2024
@peregilk
Copy link
Author

Awesome. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants