Skip to content

Conversation

@petercad
Copy link

@petercad petercad commented Nov 3, 2025

Using smaller B loads reduces the amount of upconversion/reorder work, which can improve performance when that is expensive.

This PR updates the xe_gemm example to illustrate this, using 64x32 subgroup tiles instead of 32x64 subgroup tiles for such cases.

@tdeng5 tdeng5 added the release label Nov 4, 2025
@Antonyvance
Copy link

@petercad Can you quote the observed performance improvement as well?

@petercad petercad force-pushed the petercad/xe_gemm_4x8 branch from 3e2c654 to 2d46ba1 Compare November 5, 2025 00:40
@petercad
Copy link
Author

petercad commented Nov 5, 2025

@petercad Can you quote the observed performance improvement as well?

Here are some improved cases (BMG 160EU @ 2.85GHz, m = 2560, n = k = 4096):

data types layouts TF/s before TF/s after
u4 x u4 RxR 207 274
f16 x e4m3 RxR 70.5 94.5
f16 x u8 RxR 96.5 102.3
bf16 x s4 RxC 98.8 101

@petercad petercad force-pushed the petercad/xe_gemm_4x8 branch from 2d46ba1 to 27993fa Compare November 5, 2025 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants