Tracks performance issues related to inner outer persistent scheduler #3272

liqiangxl · 2024-10-25T01:41:43Z

(1) After inner persistent buffers are stored in shared memory. There are still bank conflicts if the persistent buffer is NOT projected to inputs due to two reasons:

(a) We are missing a cacheBefore, to ensure vectorized write to shared memory.
(b) After adding vectorized read and write, if inputs are vectorized by 8, the innermost dim is 8, but can only be vectorized by 4 if the data type is fp32, and there is still a 2-way bank conflict.

(2) Can we project to inputs when there are view ops?
(3) If can't project to inputs, is using smem persistent still faster than regiser persistent?

The text was updated successfully, but these errors were encountered:

liqiangxl · 2024-10-27T15:07:01Z

Some explotary observations:
Added a redundant reshape before ln bwd

    # reshape T1 to 3D tensor, multiply by 1, reshape back to 2D tensor
    # this reshape still allows scheduler to project buffer to inputs
    # but current heuristics disabled the projection and leads to lower performance
    G0 = fd.define_scalar(256, dtype=DataType.Int)
    C0 = fd.ops.div(T0.size(1), G0)    
    V1 = fd.define_vector([T1.size(0), C0, G0], dtype=DataType.Int)
    V2 = fd.define_vector([T1.size(0), T1.size(1)], dtype=DataType.Int)
    T1 = fd.ops.reshape(T1, new_shape=V1)
    S1 = fd.define_scalar(1.0, dtype=DataType.Float)
    T1 = fd.ops.mul(T1, S1)
    T1 = fd.ops.reshape(T1, new_shape=V2)

Tested performance on top of #3223

Results indicate that:
(1) Should project to inputs to achieve higher performance as long as view ops won't inference with reductions.
(2) If can't project to inputs, using smem persistent is still faster than regiser persistent. (green is faster than yellow markers)

…l vectorization size (#3271) **Issue** InnerOuter persistent scheduler uses shared memory to store persistent buffers, the data flow is `input in gmem ---> async copy to smem --> vectorized load to registers (smem consumers)`, the `-->` are simply `LoadStoreOp` and same vectorization factors of these two copies are used. [CI](https://nv/e2E/118278383) found a case where the shared memory persistent buffers have a data type of fp32 while the inputs are fp16 (when there are view ops, project to inputs is not used). The vectorization factor is set to 8 and caused 32 bytes vectorization when loading from shared memory to registers. **Changes**: (1) Added code to handle the vectorization of smem consumers. Add an additional split if `smem --> regs` copy leads to vectorization larger than 16 bytes. (2) Added a test **Results**: Ensure vectorizations are <= 16 bytes. **Following works** See issue #3272 --------- Co-authored-by: Naoya Maruyama <[email protected]>

liqiangxl mentioned this issue Oct 25, 2024

check vectorization factor of shared memory consumers to avoid illegal vectorization size #3271

Merged

liqiangxl added the perf label Oct 27, 2024

kevinstephano added the Schedulers & Heuristics Issues related to Schedulers & Heuristics. label Oct 30, 2024

kevinstephano assigned liqiangxl Oct 30, 2024

liqiangxl changed the title ~~Remaining issues of inner outer persistent scheduler~~ Tracks performance issues related to inner outer persistent scheduler Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracks performance issues related to inner outer persistent scheduler #3272

Tracks performance issues related to inner outer persistent scheduler #3272

liqiangxl commented Oct 25, 2024

liqiangxl commented Oct 27, 2024

Tracks performance issues related to inner outer persistent scheduler #3272

Tracks performance issues related to inner outer persistent scheduler #3272

Comments

liqiangxl commented Oct 25, 2024

liqiangxl commented Oct 27, 2024