Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement SplitK and StreamK algorithm for Intel PVC (#132)
* WIP: Introduce StreamK for PVC * fixed starting index calculation * Fixed barrier count update * Fixed compilation for normal GEMM * Perform fixup using threadid instead of subgroup_id * Fixed the k_idx offset for MMA atom and corrected the reduction offset calculation * Use log2 for available_xecores * SplitK working * Minor cleanup * Need to fix splitK for batch > 1 * Fixed splitK for batch > 1 * Re-enabled GEMM Universal Adater specialization * Update split barrier arguments * Minor cleanup * Changed initialization to workspace only * Fix CI failure * Added support for scheduling non-uniform tiles * Only include split barrier flags for PVC * Test * Code cleanup * Add separate example for StreamK * Address feedback for split barrier * Fix address space for atomicAdd * Instantiate new accumulator registers per iteration * Renamed the pipeline file * Renamed files to xe_* * Removed l2 workspace alignment * Update the example to reduce caching effects * Refactor pipeline code * Add the option to invoke data parallel decomposition * Fixing bugs post merge * Address PR feedback * Fix tile size * Fix performance for streamk * Match the number of workgroups to the available XeCores * Fix performance for pvc_gemm example * Address comments --------- Co-authored-by: Mehdi Goli <[email protected]> Co-authored-by: Alejandro Acosta <[email protected]>
- Loading branch information