Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with 3D convolutions #8

Open
EiffL opened this issue May 27, 2021 · 1 comment
Open

Bug with 3D convolutions #8

EiffL opened this issue May 27, 2021 · 1 comment
Labels
bug Something isn't working horovod Issues related to the horovod backend

Comments

@EiffL
Copy link
Member

EiffL commented May 27, 2021

We got some weird deadlock when trying to run a simple 3d conv with blocks, most likely from the halo exchange:

all_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
1: [shift/HorovodAllgather_shift_ExpandDims_0]
3: [shift_1/HorovodAllgather_shift_1_ExpandDims_0]

@b-remy can you document here exactly how this happened? We'll need to sort it out....

@EiffL EiffL added horovod Issues related to the horovod backend bug Something isn't working labels May 27, 2021
@tobias-liaudat
Copy link
Member

Leaving two comments as I'm online right now.

To reproduce the error one can run this script with this job.

After finding the deadlock, we changed the 3D convolution by a dense layer and it was working fine in this test script and this job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working horovod Issues related to the horovod backend
Projects
None yet
Development

No branches or pull requests

2 participants