#Unified GEMM and GEMM + GELU on Nvidia Tensor Cores, Intel XMX of PVC, LNL,
BMG and DG2, and Intel AMX of SPR using SYCL joint matrix
- cache tiling of i and j
- cache tiling on k as well (so no reordering is needed)
- data reuse of A and B in physical layer
- Out of Bounds checking is used for PVC, BMG, and LNL using -DOOB
- Prefetch for PVC, BMG, and LNL is enabled under -DPREFETCH
- Since BMG/LNL has a smaller L1 cache and slower DPAS, prefetch distance is reduced.
- Increase number of iterations to ensure the GPU doesn't idle and thus frequency does not drop
- Both row major and VNNI transform options. For row major ommit -DVNNI
- SLM tuning for DG2, add -DSG_SIZE=8 -DSLM to the common options below
no reordering, no SLM for Nvidia
For maximum performance, cache and registers blocking parameters are different between Nvidia Tensor Cores, AMX and DPAS of DG2 vs PVC, BMG, and LNL. See specific parameters below:
M=N=K=X cases, use -DMATRIX_SIZE=X Otherwise, use: -DMATRIX_M=1024 -DMATRIX_N=6144 -DMATRIX_K=6144
icpx -fsycl -fsycl-targets=nvidia_gpu_sm_80 joint_matrix_fill_k_cache.cpp -DNVIDIA -DMCACHE1=64 -DNCACHE1=64 -DMCACHE2=128 -DNCACHE2=128
icpx -fsycl -fsycl-targets=nvidia_gpu_sm_80 joint_matrix_fill_k_cache.cpp -DMATRIX_SIZE=4096 -DNVIDIA -DMCACHE1=64 -DNCACHE1=64 -DMCACHE2=128 -DNCACHE2=128
icpx -fsycl joint_matrix_fill_k_cache.cpp -DPREFETCH -DOOB
icpx -fsycl joint_matrix_fill_k_cache.cpp -DPREFETCH -DOOB -DMATRIX_SIZE=4096
icpx -fsycl joint_matrix_fill_k_cache.cpp -DNCACHE1=32 -DMCACHE2=128 -DNCACHE2=128 -DKCACHE2=16 -DVNNI
icpx -fsycl joint_matrix_fill_k_cache.cpp -DNCACHE1=32 -DMCACHE2=128 -DNCACHE2=128 -DKCACHE2=16 -DMATRIX_SIZE=4096 -DVNNI
icpx -fsycl joint_matrix_fill_k_cache.cpp -DNCACHE1=32 -DKCACHE1=32 -DMCACHE2=128 -DNCACHE2=128 -DKCACHE2=1024 -DVNNI
icpx -fsycl joint_matrix_fill_k_cache.cpp -DNCACHE1=32 -DKCACHE1=32 -DMCACHE2=256 -DNCACHE2=256 -DKCACHE2=1024 -DMATRIX_SIZE=4096 -DVNNI
ONEAPI_DEVICE_SELECTOR=cuda:0 ./a.out
SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file" ./a.out
To run on CPU: DPCPP_CPU_NUM_CUS=112 ./a.out