From a9a57885845c7cd0dc3e5fc5ded156268552d985 Mon Sep 17 00:00:00 2001
From: Jakub Kuderski <jakub@nod-labs.com>
Date: Thu, 22 Aug 2024 14:59:11 -0400
Subject: [PATCH] [docs] Add a tip on L1 bandwidth (#142)

---
 docs/amdgpu_kernel_optimization_guide.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/docs/amdgpu_kernel_optimization_guide.md b/docs/amdgpu_kernel_optimization_guide.md
index bf597cd94..09c5b59f9 100644
--- a/docs/amdgpu_kernel_optimization_guide.md
+++ b/docs/amdgpu_kernel_optimization_guide.md
@@ -4,7 +4,7 @@ Author: Jakub Kuderski @kuhar
 
 Date: 2024-06-24
 
-Last Update: 2024-08-14
+Last Update: 2024-08-22
 
 ## Introduction
 
@@ -280,6 +280,11 @@ at once.
 A sequence of up to 4 adjacent `global_load_dwordx4` instructions (implicitly)
 forms a *clause* that translates to a single data fabric transaction.
 
+> [!TIP]
+> To achieve peak L1 bandwidth, make sure that your memory access engages all
+> four L1 cache sets. That is, at the level of the workgroup, you should be
+> loading 4 cache lines (128 B) that each map to a different cache set.
+
 > [!TIP]
 > For data that is 'streamed' and does not need to be cached, consider
 > using *non-temporal* loads/stores. This disables coherency and invalidates