[Kernels] Add an inductor pass to rewrite and fuse collective communication ops with gemms #9886

bnellnm · 2024-10-31T18:59:49Z

Add an inductor pass to rewrite and fuse collective communication ops with gemms

See #9883 for version that includes llama hacks.

TODO:

find workaround for Infinite recursion in torch._inductor.ir.ExternKernel.__str__ pytorch/pytorch#139501
fix issue with graph splitting
try to support for non-custom rms norm

cc @tlrmchlsmth , @ProExpertProg , @SageMoore , @youkaichao

Requires a special config to run:

config = CompilationConfig(
    level=3,
    custom_ops = ["+rms_norm"],
    splitting_ops = [],
)

llm = LLM(model=model,
          enforce_eager=eager,
          tensor_parallel_size=tp_size,
          disable_custom_all_reduce=not custom_ar,
          dtype=torch.float16,
          max_num_batched_tokens=2048,
          compilation_config=config)

Some benchmark results:

model = meta-llama/Llama-3.1-70B-Instruct
tp_size = 4
chunked prefill size = 2048
batch_size = 1
input_len=2048
output_len=1

Eager mode + torch.compile

Avg latency: 0.16625802051508798 seconds
10% percentile latency: 0.16468927392270416 seconds
25% percentile latency: 0.16511811560485512 seconds
50% percentile latency: 0.16571794101037085 seconds
75% percentile latency: 0.16671031567966565 seconds
90% percentile latency: 0.1675790420267731 seconds
99% percentile latency: 0.17226817809045325 seconds

Eager mode + torch.compile + flux

Avg latency: 0.1583265809295699 seconds
10% percentile latency: 0.15630255101714283 seconds
25% percentile latency: 0.15688058221712708 seconds
50% percentile latency: 0.15789097198285162 seconds
75% percentile latency: 0.15932484721997753 seconds
90% percentile latency: 0.16147575441282241 seconds
99% percentile latency: 0.16223905643215403 seconds

cudagraphs + torch.compile

Avg latency: 0.17894838895183057 seconds
10% percentile latency: 0.17591054290533065 seconds
25% percentile latency: 0.176349236513488 seconds
50% percentile latency: 0.17722250788938254 seconds
75% percentile latency: 0.17862555047031492 seconds
90% percentile latency: 0.18074012212455273 seconds
99% percentile latency: 0.2171030258946121 seconds

cudagraphs + torch.compile + flux

Avg latency: 0.17262270329520107 seconds
10% percentile latency: 0.17164990142919123 seconds
25% percentile latency: 0.17196793673792854 seconds
50% percentile latency: 0.1724927049363032 seconds
75% percentile latency: 0.1730666920193471 seconds
90% percentile latency: 0.17406681017018855 seconds
99% percentile latency: 0.1758251654729247 seconds

github-actions · 2024-10-31T19:00:05Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2024-10-31T19:00:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. @bnellnm please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth

looking forward to this one!

vllm/compilation/collective_fusion.py

mergify · 2024-11-11T23:07:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2024-11-25T19:00:59Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2024-11-26T05:43:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2024-11-26T06:02:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Bill Nell <[email protected]>

ProExpertProg · 2024-12-19T19:30:40Z

vllm/compilation/utils.py

+
+
+# Note: this heuristic is unique to flux
+def use_cc_kernels(m_shape: int, n_slices: Optional[int] = None) -> bool:


Maybe add _flux at the end of the function name to make the note clear?

ProExpertProg · 2024-12-19T19:31:42Z

vllm/compilation/utils.py

+def find_fn(nodes: Iterable[fx.Node], op) -> Optional[fx.Node]:
+    for node in nodes:
+        if node.op == "call_function" and node.target == op:
+            return node
+    return None
+
+
+def find_auto_fn(nodes: Iterable[fx.Node], op) -> Optional[fx.Node]:
+    for node in nodes:
+        if (node.op == "call_function" and node.target == auto_functionalized
+                and node.args[0] == op):
+            return node
+    return None
+
+
+def find_getitem(node: fx.Node, idx: int) -> Optional[fx.Node]:
+    for user in node.users:
+        if (user.op == "call_function" and user.target == operator.getitem
+                and user.args[1] == idx):
+            return user
+    return None


These should be available from fx_utils after #10906.

ProExpertProg · 2024-12-19T19:34:57Z

vllm/compilation/collective_fusion.py

+FLUX_TILE_SIZE: int = 128
+
+
+def use_cc_kernels(m_shape: int) -> bool:


Why is there a separate function with the same name? The other one is flux-only?

Also what does use_cc_kernels even mean?

ProExpertProg · 2024-12-19T19:37:49Z

vllm/compilation/collective_fusion.py

+    device_group = group.device_group
+    rank = group.rank_in_group
+
+    if use_flux:


Could we maybe use a better abstraction than if statements based on use_flux?

ProExpertProg · 2024-12-19T19:40:28Z

vllm/compilation/collective_fusion.py

+    rms_norm_weights: torch.Tensor,
+    gemm_2_weights: torch.Tensor,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    gemm_1_w_perm = torch.ops.aten.permute.default(gemm_1_weights, [1, 0])


Does the permutation need to be in the match? As in the replacement won't be permuted?

ProExpertProg · 2024-12-19T20:51:27Z

vllm/compilation/collective_fusion.py

+                fused_node = graph.call_function(fused_gemm_func,
+                                                 kwargs=kwargs)
+
+                graph.inserting_after(fused_node)
+                result_node_new = graph.call_function(operator.getitem,
+                                                      (fused_node, 0))
+                residual_node_new = graph.call_function(
+                    operator.getitem, (fused_node, 1))
+                my_residual_node_new = graph.call_function(
+                    operator.getitem, (fused_node, 2))


I think multi-output match has a utility that emits a function and tuple accessors.

ProExpertProg · 2024-12-19T20:52:40Z

vllm/compilation/collective_fusion.py

+                res_replacements.append(residual_node_new)
+                my_res_replacements.append(my_residual_node_new)


Any reason we save all of the residuals instead of just the previous one?

ProExpertProg · 2024-12-19T20:55:02Z

vllm/compilation/utils.py

+    raise ValueError("No nodes in graph")
+
+
+def dump_graph(pass_config, graph: fx.Graph, name: str) -> None:


I think this is going to get phased out in favor of @youkaichao's depyf

ProExpertProg · 2024-12-19T21:01:37Z

vllm/config.py

+        if self.compilation_config.pass_config.enable_collective_fusion:
+            n_slices = self.parallel_config.world_size
+            max_tokens = self.scheduler_config.max_num_batched_tokens
+            if not use_cc_kernels(int(max_tokens / n_slices), n_slices):
+                logger.info(
+                    ("Disabling collective fusion pass since chunked prefill "
+                     "size %d is too small."), max_tokens)
+                self.compilation_config.pass_config.enable_collective_fusion = \
+                    False
+            if n_slices == 1:
+                logger.info("Disabling collective fusion pass since tensor "
+                            "parallelism is not enabled.")
+                self.compilation_config.pass_config.enable_collective_fusion = \
+                    False


Why does this only live under V1? Shouldn't it also happen for V0?

(so maybe put this under PassConfig.__post_init__)

ProExpertProg · 2024-12-19T21:03:33Z

vllm/compilation/collective_fusion.py

+                if gemm_1 is None or gemm_2 is None:
+                    raise ValueError("Missing 'val' in gemm weights meta data")


Wouldn't it be simpler if you just do meta["val"]

mergify · 2024-12-19T21:04:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label Oct 31, 2024

bnellnm force-pushed the collective-fusion branch from b3200f8 to 5183999 Compare November 1, 2024 22:09

tlrmchlsmth reviewed Nov 4, 2024

View reviewed changes

vllm/compilation/collective_fusion.py Outdated Show resolved Hide resolved

vllm/compilation/collective_fusion.py Outdated Show resolved Hide resolved

bnellnm force-pushed the collective-fusion branch 2 times, most recently from 0a1f637 to 1c9d79c Compare November 8, 2024 23:36

mergify bot removed the needs-rebase label Nov 8, 2024

bnellnm force-pushed the collective-fusion branch from e164973 to 1683f80 Compare November 9, 2024 23:04

bnellnm marked this pull request as ready for review November 9, 2024 23:10

mergify bot added the needs-rebase label Nov 11, 2024

bnellnm force-pushed the collective-fusion branch from 1683f80 to 34de3a4 Compare November 25, 2024 16:46

mergify bot added frontend and removed needs-rebase labels Nov 25, 2024

mergify bot added the needs-rebase label Nov 25, 2024

bnellnm force-pushed the collective-fusion branch from ef2be0d to 7ebd94c Compare November 25, 2024 19:02

mergify bot removed the needs-rebase label Nov 25, 2024

mergify bot added needs-rebase and removed needs-rebase labels Nov 26, 2024

mergify bot added the needs-rebase label Nov 26, 2024

tlrmchlsmth and others added 6 commits November 26, 2024 19:49

Prototype integration of bytedance Flux kernels

61c79b3

Signed-off-by: Bill Nell <[email protected]>

wip

62b5ab6

Signed-off-by: Bill Nell <[email protected]>

fix

93fe660

Signed-off-by: Bill Nell <[email protected]>

working naive

c317610

Signed-off-by: Bill Nell <[email protected]>

working real

57b3e74

Signed-off-by: Bill Nell <[email protected]>

working real

296d65d

Signed-off-by: Bill Nell <[email protected]>

bnellnm added 22 commits November 26, 2024 19:51

find max m for flux kernels

b01205d

Signed-off-by: Bill Nell <[email protected]>

rebase

9f90853

Signed-off-by: Bill Nell <[email protected]>

add error check

a21fb98

Signed-off-by: Bill Nell <[email protected]>

review comments

7aa7546

Signed-off-by: Bill Nell <[email protected]>

format

65fcaf5

Signed-off-by: Bill Nell <[email protected]>

fix cudagraph support

6ce19bd

Signed-off-by: Bill Nell <[email protected]>

perf improvements

ddc0b20

Signed-off-by: Bill Nell <[email protected]>

cleanups

039d285

Signed-off-by: Bill Nell <[email protected]>

wip

515f56c

Signed-off-by: Bill Nell <[email protected]>

rebase

5a6be3c

Signed-off-by: Bill Nell <[email protected]>

fixing

d4b0aa2

Signed-off-by: Bill Nell <[email protected]>

fix merge problems. make dump graph nicer

2c15cd3

Signed-off-by: Bill Nell <[email protected]>

disable collective fusion when chunk size is too small

da18a92

Signed-off-by: Bill Nell <[email protected]>

fix mypy

bead129

Signed-off-by: Bill Nell <[email protected]>

fix yapf

72953cc

Signed-off-by: Bill Nell <[email protected]>

disable collective fusion if TP is not on

8724fab

Signed-off-by: Bill Nell <[email protected]>

remove cruft

ec07de1

Signed-off-by: Bill Nell <[email protected]>

disable collective fusion pass if TP is not enabled

6e26b9a

Signed-off-by: Bill Nell <[email protected]>

wip

f69ae53

Signed-off-by: Bill Nell <[email protected]>

rebase + simplify

41ab065

Signed-off-by: Bill Nell <[email protected]>

rebase + simplify

b75cbba

Signed-off-by: Bill Nell <[email protected]>

cleanup

7e2c490

Signed-off-by: Bill Nell <[email protected]>

bnellnm force-pushed the collective-fusion branch from d713a7d to 7e2c490 Compare November 26, 2024 23:05

mergify bot removed the needs-rebase label Nov 26, 2024

tlrmchlsmth mentioned this pull request Nov 27, 2024

[Kernel] Prototype integration of bytedance/flux kernels #5917

Closed

bnellnm mentioned this pull request Dec 13, 2024

[RFC]: A Graph Optimization System in vLLM using torch.compile #6378

Open

tlrmchlsmth mentioned this pull request Dec 13, 2024

[torch.compile] fast inductor #11108

Merged

ProExpertProg reviewed Dec 19, 2024

View reviewed changes

mergify bot added the needs-rebase label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernels] Add an inductor pass to rewrite and fuse collective communication ops with gemms #9886

[Kernels] Add an inductor pass to rewrite and fuse collective communication ops with gemms #9886

bnellnm commented Oct 31, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Oct 31, 2024

mergify bot commented Oct 31, 2024

tlrmchlsmth left a comment

mergify bot commented Nov 11, 2024

mergify bot commented Nov 25, 2024

mergify bot commented Nov 26, 2024

mergify bot commented Nov 26, 2024

ProExpertProg Dec 19, 2024

ProExpertProg Dec 19, 2024

ProExpertProg Dec 19, 2024

ProExpertProg Dec 19, 2024

ProExpertProg Dec 19, 2024

ProExpertProg Dec 19, 2024

ProExpertProg Dec 19, 2024

ProExpertProg Dec 19, 2024

ProExpertProg Dec 19, 2024

ProExpertProg Dec 19, 2024

ProExpertProg Dec 19, 2024

ProExpertProg Dec 19, 2024

mergify bot commented Dec 19, 2024



		# Note: this heuristic is unique to flux
		def use_cc_kernels(m_shape: int, n_slices: Optional[int] = None) -> bool:

		FLUX_TILE_SIZE: int = 128


		def use_cc_kernels(m_shape: int) -> bool:

		res_replacements.append(residual_node_new)
		my_res_replacements.append(my_residual_node_new)

		raise ValueError("No nodes in graph")


		def dump_graph(pass_config, graph: fx.Graph, name: str) -> None:

		if gemm_1 is None or gemm_2 is None:
		raise ValueError("Missing 'val' in gemm weights meta data")

[Kernels] Add an inductor pass to rewrite and fuse collective communication ops with gemms #9886

Are you sure you want to change the base?

[Kernels] Add an inductor pass to rewrite and fuse collective communication ops with gemms #9886

Conversation

bnellnm commented Oct 31, 2024 • edited by github-actions bot Loading

github-actions bot commented Oct 31, 2024

mergify bot commented Oct 31, 2024

tlrmchlsmth left a comment

Choose a reason for hiding this comment

mergify bot commented Nov 11, 2024

mergify bot commented Nov 25, 2024

mergify bot commented Nov 26, 2024

mergify bot commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Dec 19, 2024

bnellnm commented Oct 31, 2024 •

edited by github-actions bot

Loading