Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracks performance issues related to inner outer persistent scheduler #3272

Open
liqiangxl opened this issue Oct 25, 2024 · 1 comment
Open
Assignees
Labels
perf Schedulers & Heuristics Issues related to Schedulers & Heuristics.

Comments

@liqiangxl
Copy link
Collaborator

(1) After inner persistent buffers are stored in shared memory. There are still bank conflicts if the persistent buffer is NOT projected to inputs due to two reasons:

(a) We are missing a cacheBefore, to ensure vectorized write to shared memory.
(b) After adding vectorized read and write, if inputs are vectorized by 8, the innermost dim is 8, but can only be vectorized by 4 if the data type is fp32, and there is still a 2-way bank conflict.

(2) Can we project to inputs when there are view ops?
(3) If can't project to inputs, is using smem persistent still faster than regiser persistent?

@liqiangxl
Copy link
Collaborator Author

Some explotary observations:
Added a redundant reshape before ln bwd

    # reshape T1 to 3D tensor, multiply by 1, reshape back to 2D tensor
    # this reshape still allows scheduler to project buffer to inputs
    # but current heuristics disabled the projection and leads to lower performance
    G0 = fd.define_scalar(256, dtype=DataType.Int)
    C0 = fd.ops.div(T0.size(1), G0)    
    V1 = fd.define_vector([T1.size(0), C0, G0], dtype=DataType.Int)
    V2 = fd.define_vector([T1.size(0), T1.size(1)], dtype=DataType.Int)
    T1 = fd.ops.reshape(T1, new_shape=V1)
    S1 = fd.define_scalar(1.0, dtype=DataType.Float)
    T1 = fd.ops.mul(T1, S1)
    T1 = fd.ops.reshape(T1, new_shape=V2)

Tested performance on top of #3223
Image

Results indicate that:
(1) Should project to inputs to achieve higher performance as long as view ops won't inference with reductions.
(2) If can't project to inputs, using smem persistent is still faster than regiser persistent. (green is faster than yellow markers)

liqiangxl added a commit that referenced this issue Oct 29, 2024
…l vectorization size (#3271)

**Issue** InnerOuter persistent scheduler uses shared memory to store
persistent buffers, the data flow is `input in gmem ---> async copy to
smem --> vectorized load to registers (smem consumers)`, the `-->` are
simply `LoadStoreOp` and same vectorization factors of these two copies
are used. [CI](https://nv/e2E/118278383) found a case where the shared
memory persistent buffers have a data type of fp32 while the inputs are
fp16 (when there are view ops, project to inputs is not used). The
vectorization factor is set to 8 and caused 32 bytes vectorization when
loading from shared memory to registers.

**Changes**:
(1) Added code to handle the vectorization of smem consumers. Add an
additional split if `smem --> regs` copy leads to vectorization larger
than 16 bytes.
(2) Added a test

**Results**: Ensure vectorizations are <= 16 bytes.

**Following works**
See issue #3272

---------

Co-authored-by: Naoya Maruyama <[email protected]>
@kevinstephano kevinstephano added the Schedulers & Heuristics Issues related to Schedulers & Heuristics. label Oct 30, 2024
@liqiangxl liqiangxl changed the title Remaining issues of inner outer persistent scheduler Tracks performance issues related to inner outer persistent scheduler Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf Schedulers & Heuristics Issues related to Schedulers & Heuristics.
Projects
None yet
Development

No branches or pull requests

2 participants