Skip to content

Commit

Permalink
Merge branch 'main' into add_tuner
Browse files Browse the repository at this point in the history
  • Loading branch information
RattataKing authored Aug 22, 2024
2 parents cf76f86 + a9a5788 commit 2d74481
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion docs/amdgpu_kernel_optimization_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Author: Jakub Kuderski @kuhar

Date: 2024-06-24

Last Update: 2024-08-14
Last Update: 2024-08-22

## Introduction

Expand Down Expand Up @@ -280,6 +280,11 @@ at once.
A sequence of up to 4 adjacent `global_load_dwordx4` instructions (implicitly)
forms a *clause* that translates to a single data fabric transaction.
> [!TIP]
> To achieve peak L1 bandwidth, make sure that your memory access engages all
> four L1 cache sets. That is, at the level of the workgroup, you should be
> loading 4 cache lines (128 B) that each map to a different cache set.
> [!TIP]
> For data that is 'streamed' and does not need to be cached, consider
> using *non-temporal* loads/stores. This disables coherency and invalidates
Expand Down

0 comments on commit 2d74481

Please sign in to comment.