You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(1) After inner persistent buffers are stored in shared memory. There are still bank conflicts if the persistent buffer is NOT projected to inputs due to two reasons:
(a) We are missing a cacheBefore, to ensure vectorized write to shared memory.
(b) After adding vectorized read and write, if inputs are vectorized by 8, the innermost dim is 8, but can only be vectorized by 4 if the data type is fp32, and there is still a 2-way bank conflict.
(2) Can we project to inputs when there are view ops?
(3) If can't project to inputs, is using smem persistent still faster than regiser persistent?
The text was updated successfully, but these errors were encountered:
Results indicate that:
(1) Should project to inputs to achieve higher performance as long as view ops won't inference with reductions.
(2) If can't project to inputs, using smem persistent is still faster than regiser persistent. (green is faster than yellow markers)
…l vectorization size (#3271)
**Issue** InnerOuter persistent scheduler uses shared memory to store
persistent buffers, the data flow is `input in gmem ---> async copy to
smem --> vectorized load to registers (smem consumers)`, the `-->` are
simply `LoadStoreOp` and same vectorization factors of these two copies
are used. [CI](https://nv/e2E/118278383) found a case where the shared
memory persistent buffers have a data type of fp32 while the inputs are
fp16 (when there are view ops, project to inputs is not used). The
vectorization factor is set to 8 and caused 32 bytes vectorization when
loading from shared memory to registers.
**Changes**:
(1) Added code to handle the vectorization of smem consumers. Add an
additional split if `smem --> regs` copy leads to vectorization larger
than 16 bytes.
(2) Added a test
**Results**: Ensure vectorizations are <= 16 bytes.
**Following works**
See issue #3272
---------
Co-authored-by: Naoya Maruyama <[email protected]>
liqiangxl
changed the title
Remaining issues of inner outer persistent scheduler
Tracks performance issues related to inner outer persistent scheduler
Oct 30, 2024
(1) After inner persistent buffers are stored in shared memory. There are still bank conflicts if the persistent buffer is NOT projected to inputs due to two reasons:
(2) Can we project to inputs when there are view ops?
(3) If can't project to inputs, is using smem persistent still faster than regiser persistent?
The text was updated successfully, but these errors were encountered: