Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fabric Lightning] Named barriers #20027

Open
tesslerc opened this issue Jun 28, 2024 · 1 comment
Open

[Fabric Lightning] Named barriers #20027

tesslerc opened this issue Jun 28, 2024 · 1 comment
Labels
distributed Generic distributed-related topic feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@tesslerc
Copy link

tesslerc commented Jun 28, 2024

Description & Motivation

To prevent ranks losing alignment due to user error -- it would be beneficial to have named barriers with lightning allowing nodes to move forward only if same barrier name is met.

Pitch

For example:

if fabric.global_rank == 0:
  fabric.barrier("rank_0")
else:
  fabric.barrier("not_rank_0")

will fail in this case, and upon timeout each rank will raise an error with the barrier at which it is held up.

This is as opposed to potential user error where due to incorrect logic the various ranks might go different paths, reach some other barrier which in turn enables the whole flow to continue.

An issue that will likely repeat itself is with fabric.save. It is not obvious to new users (that don't dig into the documentation) that this should be called in all nodes, as it implements its own internal barrier call.

A typical mistake would be to construct

if fabric.global_rank == 0:
  fabric.save(...)
fabric.barrier()

do_training_stuff

fabric.barrier()

In this case, rank 0 will start to lag behind as it performs an additional barrier call.
If fabric.save would implement fabric.barrier("save") then the above program would exit printing that there is an alignment issue.

Alternatives

No response

Additional context

#19780

cc @Borda @awaelchli

@tesslerc tesslerc added feature Is an improvement or enhancement needs triage Waiting to be triaged by maintainers labels Jun 28, 2024
@awaelchli
Copy link
Member

Sounds good to me! We do it in some places but not consistently everywhere. Happy to receive a PR for this if you're interested.

@awaelchli awaelchli added help wanted Open to be worked on distributed Generic distributed-related topic and removed needs triage Waiting to be triaged by maintainers labels Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Generic distributed-related topic feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

2 participants