custom allreduce + torch.compile #10121

SageMoore · 2024-11-07T15:49:24Z

This Pr changes pynccl all reduce to be out of place and removes support for torch distributed's all reduce.

github-actions · 2024-11-07T15:49:35Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2024-11-08T09:42:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @SageMoore.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…m-ar-stuff

youkaichao · 2024-11-12T20:06:24Z

vllm/distributed/parallel_state.py

-        else:
-            torch.distributed.all_reduce(input_, group=self.device_group)
+        assert pynccl_comm is not None
+        with pynccl_comm.change_state(enable=True,


we can change pynccl to be always enabled.

Signed-off-by: youkaichao <[email protected]>

youkaichao · 2024-11-26T00:49:46Z

should work now vllm serve meta-llama/Llama-3.1-8B-Instruct -O3 -tp 2 :

input_.shape=torch.Size([2048, 4096]), using pynccl allreduce
(VllmWorkerProcess pid=3908615) input_.shape=torch.Size([2048, 4096]), using pynccl allreduce


(VllmWorkerProcess pid=3908615) input_.shape=torch.Size([256, 4096]), using custom allreduce
input_.shape=torch.Size([248, 4096]), using custom allreduce
...
(VllmWorkerProcess pid=3908615) input_.shape=torch.Size([8, 4096]), using custom allreduce
input_.shape=torch.Size([2, 4096]), using custom allreduce
(VllmWorkerProcess pid=3908615) input_.shape=torch.Size([4, 4096]), using custom allreduce
input_.shape=torch.Size([1, 4096]), using custom allreduce
(VllmWorkerProcess pid=3908615) input_.shape=torch.Size([2, 4096]), using custom allreduce
INFO 11-25 16:46:25 custom_all_reduce.py:224] Registering 2275 cuda graph addresses
(VllmWorkerProcess pid=3908615) input_.shape=torch.Size([1, 4096]), using custom allreduce
(VllmWorkerProcess pid=3908615) INFO 11-25 16:46:26 custom_all_reduce.py:224] Registering 2275 cuda graph addresses

for profiling size [2048, 4096], it is using pynccl.

for decode size [256, 4096], it is using custom allreduce.

youkaichao · 2024-11-26T00:50:14Z

@SageMoore thanks for your pioneering investigation!

tlrmchlsmth · 2024-11-26T01:20:05Z

vllm/distributed/parallel_state.py

+        # TODO: pynccl should not use `stream=`
+        # it can just always use the current stream.
+        out = pynccl_comm.all_reduce(input_,
+                                     stream=torch.cuda.current_stream())


I was a little confused about what this TODO meant, so I had to dig a bit.

Looks like PyNcclCommunicator creates a new stream in its __init__ method and uses it by default so we always have to pass in the current stream. Do you know it behaves this way?

mostly historical. we can remove it. but i don't want to do it in this pr.

I completely agree.

tlrmchlsmth

This looks good to me!

Signed-off-by: youkaichao <[email protected]>

SageMoore · 2024-11-26T04:46:16Z

@SageMoore thanks for your pioneering investigation!

Thanks for the help getting this over the line!

Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]> Signed-off-by: Andrew Feldman <[email protected]>

Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]>

init

7bed9ba

mergify bot added the needs-rebase label Nov 8, 2024

SageMoore added 3 commits November 8, 2024 15:37

use context manager when running with pynccl

fa683d4

only run custom ar when should_custom_ar is true

8576610

Merge branch 'main' of https://github.com/neuralmagic/vllm into custo…

13069d0

…m-ar-stuff

mergify bot removed the needs-rebase label Nov 8, 2024

temporarily disable custom ar

564cdad

youkaichao reviewed Nov 12, 2024

View reviewed changes

youkaichao added 4 commits November 25, 2024 15:52

Merge branch 'main' into custom-ar-stuff

3109c1a

update

a86bb7c

Signed-off-by: youkaichao <[email protected]>

improving

35d0e17

Signed-off-by: youkaichao <[email protected]>

enable pynccl by default

3e9d218

Signed-off-by: youkaichao <[email protected]>

youkaichao changed the title ~~[V1] Allow piecewise cuda graphs to run with custom allreduce~~ custom allreduce + torch.compile Nov 26, 2024

draft

54c7fb1

Signed-off-by: youkaichao <[email protected]>

mergify bot added the documentation Improvements or additions to documentation label Nov 26, 2024

youkaichao added 3 commits November 25, 2024 16:25

simplify

145ad3b

Signed-off-by: youkaichao <[email protected]>

simplify

100d26c

Signed-off-by: youkaichao <[email protected]>

add fallback

a2dceca

Signed-off-by: youkaichao <[email protected]>

youkaichao marked this pull request as ready for review November 26, 2024 00:49

youkaichao requested review from WoosukKwon, robertgshaw2-neuralmagic, njhill, ywang96, comaniac and alexm-neuralmagic as code owners November 26, 2024 00:49

youkaichao approved these changes Nov 26, 2024

View reviewed changes

tlrmchlsmth reviewed Nov 26, 2024

View reviewed changes

tlrmchlsmth approved these changes Nov 26, 2024

View reviewed changes

youkaichao added 2 commits November 25, 2024 19:10

fix tests (update for out-of-place allreduce)

c50abd2

Signed-off-by: youkaichao <[email protected]>

fix tests

53e0099

Signed-off-by: youkaichao <[email protected]>

youkaichao enabled auto-merge (squash) November 26, 2024 04:43

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 26, 2024

youkaichao disabled auto-merge November 26, 2024 06:00

youkaichao merged commit 9a88f89 into vllm-project:main Nov 26, 2024
63 of 67 checks passed

youkaichao deleted the custom-ar-stuff branch November 26, 2024 06:00

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

custom allreduce + torch.compile (vllm-project#10121)

a43682f

Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom allreduce + torch.compile #10121

custom allreduce + torch.compile #10121

SageMoore commented Nov 7, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 7, 2024

mergify bot commented Nov 8, 2024

youkaichao Nov 12, 2024

youkaichao commented Nov 26, 2024

youkaichao commented Nov 26, 2024

tlrmchlsmth Nov 26, 2024

youkaichao Nov 26, 2024

SageMoore Nov 26, 2024

tlrmchlsmth left a comment

SageMoore commented Nov 26, 2024

custom allreduce + torch.compile #10121

custom allreduce + torch.compile #10121

Conversation

SageMoore commented Nov 7, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 7, 2024

mergify bot commented Nov 8, 2024

youkaichao Nov 12, 2024

Choose a reason for hiding this comment

youkaichao commented Nov 26, 2024

youkaichao commented Nov 26, 2024

tlrmchlsmth Nov 26, 2024

Choose a reason for hiding this comment

youkaichao Nov 26, 2024

Choose a reason for hiding this comment

SageMoore Nov 26, 2024

Choose a reason for hiding this comment

tlrmchlsmth left a comment

Choose a reason for hiding this comment

SageMoore commented Nov 26, 2024

SageMoore commented Nov 7, 2024 •

edited by github-actions bot

Loading