[Enhancement] Support gather operation in NCCL backend #1061

sh0622-kim · 2023-04-09T09:13:44Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

I wanted to help with the work in #916.

Modification

Supports gather operation for NCCL backend.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
The documentation has been modified accordingly, like docstring or example tutorials.

HAOCHENYE

Hi, Thanks for your contribution! We should update the unit test here to verify the modification works as expected.

Besides, PyTorch has already supported gather in nccl since version 1.11, and we should also take it into account.

add pytorch version condition

zhouzaida · 2023-04-10T15:54:04Z

mmengine/dist/dist.py

+            torch_dist.gather(data, gather_list, dst, group)
+        else:
+            if get_rank(group) == dst:
+                gather_list = torch.cuda.comm.gather(data, dst, group)


Hi, torch.cuda.comm.gather only supports single-node. Can we use all_gather to implement it as a workaround?

HAOCHENYE · 2023-04-23T05:09:43Z

Hi, @sh0622-kim, you can use all_gather to replace torch.cuda.comm.gather when Pytorch version <= 1.11.0

HAOCHENYE · 2023-04-26T07:33:00Z

mmengine/dist/dist.py

-                    gather_list = all_gather_list
-                else:
-                    gather_list = []
+                gather_list = all_gather(data, group)


all_gather should be called at all ranks otherwise the program will be blocked. We should only return the gathered list at the main rank, and return an empty list at other ranks.

codecov · 2024-09-26T03:39:59Z

Codecov Report

Attention: Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@8bf1eca). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
mmengine/dist/dist.py	0.00%	6 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1061   +/-   ##
=======================================
  Coverage        ?   77.88%           
=======================================
  Files           ?      139           
  Lines           ?    11301           
  Branches        ?     2281           
=======================================
  Hits            ?     8802           
  Misses          ?     2104           
  Partials        ?      395

Flag	Coverage Δ
unittests	`77.88% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Support gather operation in NCCL backend

8ba956f

sh0622-kim requested review from zhouzaida and C1rN09 as code owners April 9, 2023 09:13

HAOCHENYE reviewed Apr 10, 2023

View reviewed changes

zhouzaida mentioned this pull request Apr 10, 2023

MMEngine community collaboration 🚀 ! #916

Open

sh0622-kim added 2 commits April 10, 2023 21:12

add unit-test

5922097

add pytorch version condition

fix lint

4464ee6

zhouzaida reviewed Apr 10, 2023

View reviewed changes

Add multi-node support to Gather with NCCL backend

dc3277e

HAOCHENYE added this to the 0.7.4 milestone Apr 23, 2023

change to all_gather

9781b98

HAOCHENYE reviewed Apr 26, 2023

View reviewed changes

sh0622-kim closed this Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Support gather operation in NCCL backend #1061

[Enhancement] Support gather operation in NCCL backend #1061

sh0622-kim commented Apr 9, 2023

HAOCHENYE left a comment

zhouzaida Apr 10, 2023 •

edited

Loading

HAOCHENYE commented Apr 23, 2023

HAOCHENYE Apr 26, 2023

codecov bot commented Sep 26, 2024 •

edited

Loading

[Enhancement] Support gather operation in NCCL backend #1061

[Enhancement] Support gather operation in NCCL backend #1061

Conversation

sh0622-kim commented Apr 9, 2023

Motivation

Modification

Checklist

HAOCHENYE left a comment

Choose a reason for hiding this comment

zhouzaida Apr 10, 2023 • edited Loading

Choose a reason for hiding this comment

HAOCHENYE commented Apr 23, 2023

HAOCHENYE Apr 26, 2023

Choose a reason for hiding this comment

codecov bot commented Sep 26, 2024 • edited Loading

Codecov Report

zhouzaida Apr 10, 2023 •

edited

Loading

codecov bot commented Sep 26, 2024 •

edited

Loading