Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Potential risk of getting stuck in PipeFusion #310

Open
HOOLoLo opened this issue Oct 17, 2024 · 7 comments
Open

[Bug] Potential risk of getting stuck in PipeFusion #310

HOOLoLo opened this issue Oct 17, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@HOOLoLo
Copy link

HOOLoLo commented Oct 17, 2024

I have submitted a issue in pytorch: pytorch/pytorch#138074 which describes the problem, hoping they will add a new interface of setting custom stream for communication.

This problem hasn't occurred so far because the send kernel of NCCL will ignore the recv kernel and complete, when the size of data is less than 64MB.

Do you guys know of any other solutions?

@HOOLoLo
Copy link
Author

HOOLoLo commented Oct 17, 2024

@feifeibear Can you help me make a double check of this logic? I am not quite familiar with this project.

@feifeibear
Copy link
Collaborator

Your code snippet in the issue is very helpful. But, can you also give us a run script to reproduce the error in xdit. Also what kind of GPU cluser are you using?

@HOOLoLo
Copy link
Author

HOOLoLo commented Oct 28, 2024

@feifeibear Sorry, I was busy recently. It's hard to reproduce the error on gpu, because i can only change the output picture size to make size of the patch_latent bigger, and it will OOM to make the picture big enough to reproduce the error.
I came up with an idea that we can pair up the ranks for send and recv and create group each pair to solve the problem, so the recv will not wait the send of the same rank. Here is a demo picture:
{cdc3f97d-4f0a-435a-a5dc-35e966b31b65}

@feifeibear
Copy link
Collaborator

num_pipeline_patch can not be set too large, for example I sometimes encounter stuck when it is set to 16.
I did not delve into this problem. I guess it maybe to because of much async P2P.

@feifeibear feifeibear added the bug Something isn't working label Nov 26, 2024
@feifeibear
Copy link
Collaborator

the problem has been fixed with properiate NCCL env setting!

@HOOLoLo
Copy link
Author

HOOLoLo commented Dec 4, 2024

the problem has been fixed with properiate NCCL env setting!

@feifeibear Hi, Can you tell me which NCCL env should be set to solve the problem ?

@feifeibear
Copy link
Collaborator

You can try to export NCCL_DEBUG='INFO' to get more information, check if there is information like 'via SHM/direct/direct'. If so, try export NCCL_SHM_DISABLE='1' before running the scripts.

@HOOLoLo tell me if you still get stuck

@feifeibear feifeibear reopened this Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants