[Fabric Lightning] Named barriers #20027
Labels
distributed
Generic distributed-related topic
feature
Is an improvement or enhancement
help wanted
Open to be worked on
Description & Motivation
To prevent ranks losing alignment due to user error -- it would be beneficial to have named barriers with lightning allowing nodes to move forward only if same barrier name is met.
Pitch
For example:
will fail in this case, and upon timeout each rank will raise an error with the barrier at which it is held up.
This is as opposed to potential user error where due to incorrect logic the various ranks might go different paths, reach some other barrier which in turn enables the whole flow to continue.
An issue that will likely repeat itself is with
fabric.save
. It is not obvious to new users (that don't dig into the documentation) that this should be called in all nodes, as it implements its own internal barrier call.A typical mistake would be to construct
In this case, rank 0 will start to lag behind as it performs an additional barrier call.
If
fabric.save
would implementfabric.barrier("save")
then the above program would exit printing that there is an alignment issue.Alternatives
No response
Additional context
#19780
cc @Borda @awaelchli
The text was updated successfully, but these errors were encountered: