Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate emergency checkpointer into standalone_checkpointer for CPUs. #767

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

RoshaniN
Copy link
Collaborator

@RoshaniN RoshaniN commented Jul 12, 2024

  1. When local checkpoints are available for restore, alter mesh setup as follows.
  • Ignore the JAX coordinator provided by XPK and override the JAX coordinator to be the pod containing process_id 0.
  • Persist the IP address of this pod using socket APIs to GCS.
  • Let other processes retrieve the address of this coordinator, similar to TPUs.
  1. Restore state and not state["items"] if emergency checkpoint manager is enabled.

  2. Some changes in Jax/Orbax are needed to get this working end-to-end on CPUs. Other changes are addressed in this PR.

@xuefgu
Copy link
Collaborator

xuefgu commented Jul 25, 2024

Can you please resolve the conflicts first? I see that the last update was from 2 weeks ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants