add knobs control inner dim unroll and outer dim unroll in pointwise scheduler #3275

liqiangxl · 2024-10-25T16:05:38Z

What's in this PR?
(1) Added two knobs to control unroll in inner dim and outer dim for pointwise scheduler
(2) The original untoll knob which applies to outer dim is removed.
(3) Extended test UnrollOnTopOfVectorize to test 8 different combinations of vectorization, inner unroll, and outer unroll.
(4) Neither inner unroll nor outer unroll is used in the heuristics. They are always 1 unless vectorization == 1, in that case, inner unroll is used.
(5) If inner or outer unroll factor == 1, we won't split out an additional domain with size of 1.

Why?
These two knobs allows more performance optimizations, e.g. unroll in different dims based on broadcast dims.

…into llu/unroll_outer_dim

liqiangxl · 2024-10-25T17:45:30Z

!build

liqiangxl · 2024-10-25T17:47:42Z

!build

liqiangxl · 2024-10-25T19:20:17Z

!build --diff-bench --diff

liqiangxl · 2024-10-26T14:06:21Z

!build --diff-bench --diff

liqiangxl · 2024-10-27T01:46:00Z

!build --diff-bench --diff

liqiangxl · 2024-10-27T15:49:42Z

!build --diff-bench --diff

liqiangxl · 2024-10-27T16:04:49Z

!build --diff-bench --diff

liqiangxl · 2024-10-28T00:39:09Z

(1) diffs in nvfuser-ci/jit_codegen_diff_bench_17_5/5 — Failing after 43 minutes https://nv/e2E/118807639
This is due to the code change of If inner or outer unroll factor == 1, we won't split out an additional domain with size of 1.
before this PR, we split out an additional domain even when unroll factor == 1.

before: T4_l_float[ iblockIdx.x41{( ceilDiv(i2, blockDim.x) )}, iblockIdx.y47{( ceilDiv(( ceilDiv(( ceilDiv(i0, 1) ), 1) ), blockDim.y) )}, ithreadIdx.y48{blockDim.y}, iUS46{1}, iS44{1}, ithreadIdx.x42{blockDim.x} ]
After: T4_l_float[ iblockIdx.x35{( ceilDiv(i2, blockDim.x) )}, iblockIdx.y39{( ceilDiv(( ceilDiv(i0, 1) ), blockDim.y) )}, ithreadIdx.y40{blockDim.y}, iUS38{1}, ithreadIdx.x36{blockDim.x} ]

The additional domain iS44{1}, no longer exist after this PR. This leads to code change for cpp benchmark case NvFuserScheduler_Broadcast_Inner_fp32/64/160/manual_time:

+    float T2[1];
+    T2[0]
+       = T5[0];
     float T4[1];
     T4[0] = 0;
     T4[0]
        = T0[i2];
-    float T2[1];
-    T2[0]
-       = T5[0];

(2) diffs in nvfuser-ci/jit_codegen_diff_17_5/7 — Failing after 7 minutes https://nv/e2E/118807632
Same reason as explained in (1)

(3) diffs in nvfuser-ci/jit_codegen_diff_17_6/7 — Failing after 58 minutes https://nv/e2E/118807633
Detected many test changes. This shouldn't happen since the base kernels are generated from current top of main branch is [628a47e3 Pointwise shouldn't check transpose scheduler (#3256)] , is this realted to a recent change of codediff script? @jacobhinkle

(4) diffs in nvfuser-ci/jit_codegen_diff_17_7/7 — Failing after 58 minutes https://nv/e2E/118807634
Same reason as explained in (3)

liqiangxl · 2024-10-28T00:42:01Z

!build --diff-bench --diff

…de diff

jacobhinkle · 2024-10-29T20:42:14Z

(3) diffs in nvfuser-ci/jit_codegen_diff_17_6/7 — Failing after 58 minutes https://nv/e2E/118807633
Detected many test changes. This shouldn't happen since the base kernels are generated from current top of main branch is [628a47e3 Pointwise shouldn't check transpose scheduler (#3256)] , is this realted to a recent change of codediff script? @jacobhinkle

Yes, we have identified that it is a serde issue. @naoyam confirmed a fix in #3283, ~~we just need to turn that into a knob we can use inside CI~~ see PR #3304. See also #3265. cc @rdspring1

rdspring1 · 2024-10-30T16:42:03Z

csrc/python_frontend/python_bindings.cpp

@@ -640,7 +640,8 @@ void defineHeuristicParamBindings(py::module& nvfuser) {
      .PARAM(PointwiseParams, split_grid_y_dim)
      .PARAM(PointwiseParams, flip_grid_binding)
      .PARAM(PointwiseParams, vectorization_factor)
-      .PARAM(PointwiseParams, unroll_factor);
+      .PARAM(PointwiseParams, unroll_factor_inner)


rdspring1

LGTM.

You may want to change unroll_factor to unroll_factor_outer in https://github.com/NVIDIA/Fuser/blob/main/doc/dev/python_scheduling/autotune_pointwise.py#L92, so the script runs as-is?

jjsjann123 · 2024-10-30T17:48:21Z

csrc/scheduler/pointwise.cpp

-      reference_tv->split(0, pparams->unroll_factor);
-      // [o-remainder, Unroll| i-remainder, TIDx, Vect]
+      if (pparams->unroll_factor_inner > 1) {
+        reference_tv->split(1, pparams->unroll_factor_inner);


we are splitting on dimension 1? which is the TIDx here right?

This is for 2D scheduler, start with [outer dim, inner dim], so here dimension 1 is i-remainder in [0-outer | 1-i-remainder, 2-TIDx, 3-Vect]. i-remainder means what is left after splitting out other dims, e.g. Vect, TIDx

So this is a behavior change then.

If we look at the above commented code change, we are doing

- reference_tv->split(0, pparams->unroll_factor); - // [o-remainder, Unroll| i-remainder, TIDx, Vect] + if (pparams->unroll_factor_inner > 1) { + reference_tv->split(1, pparams->unroll_factor_inner);

Which means the old behavior (outer unroll) is being updated to a default inner unroll instead?

Good point. Should assign unroll to inner dim only when the scheduler is 1D, for 2D should assign to outer dim.

// for 1D scheduler, unroll the inner dimension // since there is no outer dimension. if (break_point == 0) { params->unroll_factor_inner = total_unroll; params->unroll_factor_outer = 1L; } else { // for 2D scheduler, unroll the outer dimension // to prioritize resue across different rows, will // be revised in heuristics tuning, e.g. unroll different // dims based on the broadcast dimension. params->unroll_factor_inner = 1L; params->unroll_factor_outer = total_unroll; }

jjsjann123 · 2024-10-30T17:49:51Z

csrc/scheduler/pointwise.cpp

        max_vect_unroll_factor, params->vectorization_factor);
+    params->unroll_factor_inner = total_unroll;


IIUC, this PR shouldn't impose any functional changes. So I would expect all old use of params->unroll_factor to be replaced with params->unroll_factor_inner.

Yes. here all the unroll factors go to unroll_factor_inner through params->unroll_factor_inner = total_unroll;

jjsjann123 · 2024-10-30T17:50:49Z

csrc/scheduler/pointwise.cpp

+      if (pparams->unroll_factor_outer > 1) {
+        reference_tv->split(0, pparams->unroll_factor_outer);
+      }
+      // [o-remainder, o-Unroll| i-remainder, i-Unroll, TIDx, Vect]


I'm a bit lost about the notation here. What's o-Unroll | i-remainder?

o represents outer dim and i represents inner dim. | sperates inner dim and outer dim. So here o-Unroll represents outer unroll and i-remainder means what is left in the inner dim after splitting out other domains, e.g. Vect, TIDx

I used o-Unroll and i-Unroll to distinguish between unroll in outer dim and inner dim.

ah, sorry I was totally not getting | part here. Now it reads clear to me.

add some comments for clarity.

// Here and in the following comments: // prefix [i] represents inner dimension // prefix [o] represents inner dimension // [|] separates the outer and inner dimensions

jjsjann123 · 2024-10-30T17:54:01Z

csrc/scheduler/pointwise.cpp

@@ -822,7 +847,9 @@ void schedulePointwise(Fusion* fusion, const PointwiseParams* pparams) {
      // Threads
      reference_tv->split(0, kThreadX);
      // Unroll
-      reference_tv->split(0, pparams->unroll_factor);
+      if (pparams->unroll_factor_inner > 1) {
+        reference_tv->split(0, pparams->unroll_factor_inner);


qq: we are not using unroll_factor_outer in this branch, is that expected?

Yes, this else branch is for 1D scheduler, all IDs are merged into 1 domain, there is no outer dim.

jjsjann123

LGTM.

Since this isn't applying any functional change, should we double check the code diff just to be sure?

liqiangxl · 2024-10-31T19:35:59Z

!build --diff-bench --diff

liqiangxl · 2024-11-01T11:48:07Z

!build --diff-bench --diff

liqiangxl · 2024-11-01T14:47:57Z

!test --diff-bench --diff

liqiangxl · 2024-11-01T21:53:50Z

Still not sure why code is changed, e.g. test_correctness_var_mean_float64 is a reduction, which shouldn't be changed by this PR. Let me close this PR and redo a new one. @jjsjann123 @jacobhinkle code diff test in #3311 seems work fine.

jacobhinkle · 2024-11-01T21:56:14Z

Still not sure why code is changed, e.g. test_correctness_var_mean_float64 is a reduction, which shouldn't be changed by this PR. Let me close this PR and redo a new one. @jjsjann123 @jacobhinkle code diff test in #3311 seems work fine.

I'm seeing that in all the PRs. I think there's something going on that is flipping the order of outputs in the generated kernel. It may or may not be related to serde.

liqiangxl · 2024-11-02T01:50:35Z

@jjsjann123 I am going to merge this PR after buid test. There are two types of code diffs.
(1) due to the change in this PR, If inner or outer unroll factor == 1, we won't split out an additional domain with size of 1. This removes an extra for-loop, it leads to different compute-at position and expr orders. Here is a case.
(2) as jacob said, something going on that is flipping the order of outputs in the generated kernel. Here is an example. I can't reproduce locally, so probabally related to CI scripts. The fusion also doesn't use pointwise scheduler.

liqiangxl · 2024-11-02T01:50:57Z

!build !test

liqiangxl · 2024-11-02T02:02:55Z

!tests

Fix the `autotune_pointwise` script which was broken by #3275. The earlier PR changed the pointwise setting from `unroll_factor` to `inner_unroll_factor`.

liqiangxl added 12 commits October 17, 2024 16:02

unroll the outer dim

12568b6

unroll the outer dim

09ba0b6

Merge branch 'llu/unroll_outer_dim' of https://github.com/nvidia/fuser …

c86a0b2

…into llu/unroll_outer_dim

comment

f5b349f

enable unroll

23efc80

adjust bdimx for divisible split

67450af

test heurs

35af092

unroll inner and outer

089dd85

merge main

e2221b5

wip

00571b4

tests

377b7fc

clean

12ad2e6

python

94680b6

Merge branch 'main' into llu/ps_unroll_inner_outer

7e04577

liqiangxl force-pushed the llu/ps_unroll_inner_outer branch from ffd65d1 to 7e04577 Compare October 25, 2024 18:37

fix pos

c5b0365

merge

1307ba8

Merge branch 'main' into llu/ps_unroll_inner_outer

4ea639b

clean

6c22a3b

split even outer unroll factor == 1, should drop this commit, test co…

b23cb41

…de diff

liqiangxl force-pushed the llu/ps_unroll_inner_outer branch from c739569 to b23cb41 Compare October 28, 2024 00:59

Merge branch 'main' into llu/ps_unroll_inner_outer

50ba432

liqiangxl marked this pull request as ready for review October 29, 2024 19:29

liqiangxl requested review from jjsjann123 and rdspring1 October 29, 2024 19:30

rdspring1 reviewed Oct 30, 2024

View reviewed changes

rdspring1 approved these changes Oct 30, 2024

View reviewed changes

jjsjann123 reviewed Oct 30, 2024

View reviewed changes

liqiangxl added 2 commits October 30, 2024 11:44

set unroll factor based on 1d or 2d scheduler

cb60e11

add comment

8616bc9

jjsjann123 approved these changes Oct 30, 2024

View reviewed changes

Merge branch 'main' into llu/ps_unroll_inner_outer

416bc31

Merge branch 'main' into llu/ps_unroll_inner_outer

63a38f1

liqiangxl closed this Nov 1, 2024

liqiangxl mentioned this pull request Nov 1, 2024

add knobs control inner dim unroll and outer dim unroll in pointwise scheduler redo pr-3275 to check code changes #3325

Closed

Merge branch 'main' into llu/ps_unroll_inner_outer

5293c7e

liqiangxl reopened this Nov 2, 2024

Merge branch 'main' into llu/ps_unroll_inner_outer

0710686

liqiangxl merged commit c02e7ee into main Nov 2, 2024
47 checks passed

liqiangxl deleted the llu/ps_unroll_inner_outer branch November 2, 2024 13:40

rdspring1 mentioned this pull request Nov 4, 2024

Fix autotune_pointwise.py script #3339

Merged

rdspring1 added a commit that referenced this pull request Nov 5, 2024

Fix autotune_pointwise.py script (#3339)

162a13b

Fix the `autotune_pointwise` script which was broken by #3275. The earlier PR changed the pointwise setting from `unroll_factor` to `inner_unroll_factor`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add knobs control inner dim unroll and outer dim unroll in pointwise scheduler #3275

add knobs control inner dim unroll and outer dim unroll in pointwise scheduler #3275

liqiangxl commented Oct 25, 2024 •

edited

Loading

liqiangxl commented Oct 25, 2024

liqiangxl commented Oct 25, 2024

liqiangxl commented Oct 25, 2024

liqiangxl commented Oct 26, 2024

liqiangxl commented Oct 27, 2024

liqiangxl commented Oct 27, 2024

liqiangxl commented Oct 27, 2024

liqiangxl commented Oct 28, 2024 •

edited

Loading

liqiangxl commented Oct 28, 2024

jacobhinkle commented Oct 29, 2024 •

edited

Loading

rdspring1 Oct 30, 2024

rdspring1 left a comment

jjsjann123 Oct 30, 2024

liqiangxl Oct 30, 2024

jjsjann123 Oct 30, 2024

liqiangxl Oct 30, 2024

jjsjann123 Oct 30, 2024

liqiangxl Oct 30, 2024

jjsjann123 Oct 30, 2024

liqiangxl Oct 30, 2024

liqiangxl Oct 30, 2024

jjsjann123 Oct 30, 2024

liqiangxl Oct 30, 2024

jjsjann123 Oct 30, 2024

liqiangxl Oct 30, 2024

jjsjann123 left a comment

liqiangxl commented Oct 31, 2024

liqiangxl commented Nov 1, 2024

liqiangxl commented Nov 1, 2024

liqiangxl commented Nov 1, 2024

jacobhinkle commented Nov 1, 2024

liqiangxl commented Nov 2, 2024 •

edited

Loading

liqiangxl commented Nov 2, 2024

liqiangxl commented Nov 2, 2024

		max_vect_unroll_factor, params->vectorization_factor);
		params->unroll_factor_inner = total_unroll;

add knobs control inner dim unroll and outer dim unroll in pointwise scheduler #3275

add knobs control inner dim unroll and outer dim unroll in pointwise scheduler #3275

Conversation

liqiangxl commented Oct 25, 2024 • edited Loading

liqiangxl commented Oct 25, 2024

liqiangxl commented Oct 25, 2024

liqiangxl commented Oct 25, 2024

liqiangxl commented Oct 26, 2024

liqiangxl commented Oct 27, 2024

liqiangxl commented Oct 27, 2024

liqiangxl commented Oct 27, 2024

liqiangxl commented Oct 28, 2024 • edited Loading

liqiangxl commented Oct 28, 2024

jacobhinkle commented Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

rdspring1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjsjann123 left a comment

Choose a reason for hiding this comment

liqiangxl commented Oct 31, 2024

liqiangxl commented Nov 1, 2024

liqiangxl commented Nov 1, 2024

liqiangxl commented Nov 1, 2024

jacobhinkle commented Nov 1, 2024

liqiangxl commented Nov 2, 2024 • edited Loading

liqiangxl commented Nov 2, 2024

liqiangxl commented Nov 2, 2024

liqiangxl commented Oct 25, 2024 •

edited

Loading

liqiangxl commented Oct 28, 2024 •

edited

Loading

jacobhinkle commented Oct 29, 2024 •

edited

Loading

liqiangxl commented Nov 2, 2024 •

edited

Loading