From a9a57885845c7cd0dc3e5fc5ded156268552d985 Mon Sep 17 00:00:00 2001 From: Jakub Kuderski Date: Thu, 22 Aug 2024 14:59:11 -0400 Subject: [PATCH] [docs] Add a tip on L1 bandwidth (#142) --- docs/amdgpu_kernel_optimization_guide.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/amdgpu_kernel_optimization_guide.md b/docs/amdgpu_kernel_optimization_guide.md index bf597cd94..09c5b59f9 100644 --- a/docs/amdgpu_kernel_optimization_guide.md +++ b/docs/amdgpu_kernel_optimization_guide.md @@ -4,7 +4,7 @@ Author: Jakub Kuderski @kuhar Date: 2024-06-24 -Last Update: 2024-08-14 +Last Update: 2024-08-22 ## Introduction @@ -280,6 +280,11 @@ at once. A sequence of up to 4 adjacent `global_load_dwordx4` instructions (implicitly) forms a *clause* that translates to a single data fabric transaction. +> [!TIP] +> To achieve peak L1 bandwidth, make sure that your memory access engages all +> four L1 cache sets. That is, at the level of the workgroup, you should be +> loading 4 cache lines (128 B) that each map to a different cache set. + > [!TIP] > For data that is 'streamed' and does not need to be cached, consider > using *non-temporal* loads/stores. This disables coherency and invalidates