Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi: fix setting local ranks #432

Merged
merged 1 commit into from
Apr 22, 2024
Merged

mpi: fix setting local ranks #432

merged 1 commit into from
Apr 22, 2024

Conversation

csegarragonz
Copy link
Collaborator

In very constrained environments like GHA it may sometimes happen that some ranks completely finish (i.e. call destroy) before others have even called Init.

With our current implementation, this meant that the world may have been removed from the registry, even though there were still active local ranks to run.

To fix this, we set the number of active local ranks once, at the beginning, and decrement it on a per-thread basis. Note that this does not invaliate the fix to the previous race condition because, in fact, what we fixed is that we now always init a world when we execute in it.

Copy link

codecov bot commented Apr 21, 2024

Codecov Report

Attention: Patch coverage is 81.48148% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 81.93%. Comparing base (5b65b3e) to head (6666cff).

Files Patch % Lines
src/mpi/MpiWorld.cpp 78.94% 4 Missing ⚠️
src/executor/Executor.cpp 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #432      +/-   ##
==========================================
+ Coverage   81.83%   81.93%   +0.10%     
==========================================
  Files         115      115              
  Lines        7691     7702      +11     
==========================================
+ Hits         6294     6311      +17     
+ Misses       1397     1391       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

In very constrained environments like GHA it may sometimes happen that
some ranks _completely_ finish (i.e. call destroy) before others have
even called Init.

With our current implementation, this meant that the world may have been
removed from the registry, even though there were still active local
ranks to run.

To fix this, we set the number of active local ranks once, at the
beginning, and decrement it on a per-thread basis. Note that this does
not invaliate the fix to the previous race condition because, in fact,
what we fixed is that we now _always_ init a world when we execute in
it.
@csegarragonz csegarragonz merged commit 4db5ae3 into main Apr 22, 2024
11 of 12 checks passed
@csegarragonz csegarragonz deleted the fix-destroy branch April 22, 2024 15:44
@lgarithm
Copy link
Contributor

---------------------------------------------------------------------------------------------------------
local/S       local/L       remote/S      remote/L      commit                                         
---------------------------------------------------------------------------------------------------------
0.294         8.855         0.020         8.665         mpi                                            
0.130 (0.44)  5.470 (0.62)  0.008 (0.40)  4.254 (0.49)  5b65b3e77d5726d6e22748db930e75642ed7672a # main
0.139 (0.47)  5.480 (0.62)  0.008 (0.40)  4.227 (0.49)  4db5ae32160aa1299d920c17b4196716d1ba29f7 # main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants