Open
Description
🚀 The feature, motivation and pitch
We have some kernels not aligned with stock CUDA implementation, since,
- Functionality extension was added in stock CUDA implementation. But we have no sustainable rebase.
- General memory layout support was added in stock CUDA implementation. But we have no sustainable rebase.
- We have specific implementation for performance in some cases. But stock CUDA don't care these cases.
1 is functionality related and 2 and 3 are performance related.
- For Type-1, we should fix and aligned with stock CUDA during porting from IPEX to torch-xpu-ops.
- For Type-2, we should align with CUDA implementation with proper priority.
- For Type-3, we need to trad-off performance and feasibility of in-tree.
Here is the list. We will add items gradually when op is ported.
- aten::bernoulli_ // Type-2
- aten::cumsum // Type-3
- atem::cat @xytintel // Type-3
- aten::tril/triu @AlienLiang23 // Type-2 CUDA optimization commit: 1462d72904cb81917b9355d6a58916f389e9084c