[AMDGPU] select v_sat_pk from two i16 or v2i16 #121124

Shoreshen · 2024-12-26T02:13:23Z

Selecting v_sat_pk instruction based on bit operation.

github-actions · 2024-12-26T02:13:41Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2024-12-26T02:14:16Z

@llvm/pr-subscribers-backend-amdgpu

Author: None (Shoreshen)

Changes

Selecting v_sat_pk instruction based on bit operation.

Full diff: https://github.com/llvm/llvm-project/pull/121124.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructions.td (+14)
(modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+12)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructions.td b/llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
index 6a5065cd4a0e8f..0a7747b8736786 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
@@ -315,6 +315,20 @@ def srl_16 : PatFrag<
   (ops node:$src0), (srl_oneuse node:$src0, (i32 16))
 >;
 
+def clamp_s16_u8 : PatFrag<
+  (ops node:$src),
+  (i16 (AMDGPUsmed3 $src, (i16 0), (i16 255)))
+>;
+
+def conc_lo_u8_i16 : PatFrags<
+    (ops node:$src0, node:$src1),
+    [
+        (or
+            (and (i16 $src0), (i16 255)),
+            (shl (i16 $src1), (i16 8))
+        )
+    ]
+>;
 
 def hi_i16_elt : PatFrag<
   (ops node:$src0), (i16 (trunc (i32 (srl_16 node:$src0))))
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 789ce8815cf801..c0dd87fccfb7bb 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -3298,6 +3298,18 @@ def : GCNPat <
   (v2i16 (V_LSHL_OR_B32_e64 $src1, (i32 16), (i32 (V_AND_B32_e64 (i32 (V_MOV_B32_e32 (i32 0xffff))), $src0))))
 >;
 
+multiclass V_SAT_PK_Pat<Instruction inst> {
+    def: GCNPat<
+        (i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (clamp_s16_u8 i16:$hi))),
+        (inst
+            (V_LSHL_OR_B32_e64 VGPR_32:$hi, (S_MOV_B32 (i32 16)),
+            (V_AND_B32_e64 VGPR_32:$lo, (S_MOV_B32 (i32 0xFFFF)))))
+    >;
+}
+
+let OtherPredicates = [NotHasTrue16BitInsts] in 
+defm : V_SAT_PK_Pat<V_SAT_PK_U8_I16_e64>;
+
 // With multiple uses of the shift, this will duplicate the shift and
 // increase register pressure.
 def : GCNPat <

arsenm

Missing tests

arsenm · 2024-12-26T03:45:17Z

llvm/lib/Target/AMDGPU/AMDGPUInstructions.td

+  (i16 (AMDGPUsmed3 $src, (i16 0), (i16 255)))
+>;
+
+def conc_lo_u8_i16 : PatFrags<


This is just one Pat, Can use PatFrag not PatFrags and avoid the list

I'll add another pattern of addition... sorry for the trouble its my first PR....

arsenm · 2024-12-26T03:46:43Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+        (i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (clamp_s16_u8 i16:$hi))),
+        (inst
+            (V_LSHL_OR_B32_e64 VGPR_32:$hi, (S_MOV_B32 (i32 16)),
+            (V_AND_B32_e64 VGPR_32:$lo, (S_MOV_B32 (i32 0xFFFF)))))


This feels like it should have had a dag combine form a nicer pattern to start with. Is this just recognizing the expanded form of the operation which should be legal in the first place?

The combination of 2 patterns is because there are many ways of taking lower 8 bit of 2 clamped integer into i16

I think it can be continuously added into conc_lo_u8_i16 if we meet new patterns. I'll update my PR...

arsenm · 2024-12-26T05:10:23Z

llvm/test/CodeGen/AMDGPU/v_sat_pk_u8_i16.ll

+  %src1.shl = shl i16 %src1.clamp, 8
+  %or = or i16 %src0.and, %src1.shl
+  ret i16 %or
+}


Needs vector tests, negative tests, unsigned versions, mismatched opcodes.

Is this the same operation as TRUNCATE_SSAT_S? Should that be made legal, and does that enable a combine that matches the complex case?

Hi I'm thinking of maybe not, because the instruction does SAT, truncate and pack (packing to v2u8) while TRUNCATE_SSAT_S does not do pack based on the description I read on the comment....

I'll add more tests based on comment. Thanks a lot~~

The pack part is just build_vector. You could treat the vector operation as legal and/or match the saturate truncate + build_vector

Hi it seems like AMDGPU backend doesn't support v2i8........

by returning v2i8, it will return two i16, and due to the existing of the other i16, one of the med instruction will remain.....

I think we need to modify multiple places to support v2i8, so maybe just focusing bit operations for now

Hi, I think for the unsigned case, it should not be selected.

Say we have: (umed i16:$src, 0, 255) and the $src = -1=0xffff then it should return 255

However, I think the v_sat_pk will return 0 instead as the description says:

Given two 16-bit signed integer inputs, saturate each input over an 8-bit unsigned range, pack the resulting values into a 16-bit word and store the result into a vector register.

The only case that it is equivalent should be (umin (smax i16:$src, 0), 255), but this would be optimized to (umed i16:$src, 0, 255)

So I think I may add a negative case (should not select v_sat_pk) for unsigned situation~

Pierre-vh · 2025-01-06T08:32:56Z

llvm/lib/Target/AMDGPU/AMDGPUInstructions.td

+  ]
+>;
+
+def conc_lo_v2i16_i16 : PatFrags<


@arsenm Isn't matching bitwise ops fragile?

Wouldn't it be better to make v2i8 legal, address codegen regressions (maybe by handling it in TargetLowering/CC stuff as well), then come back to this?

I'm afraid the patterns would constantly break whenever a new combine is added for the bitwise ops like shift/and/or/etc. if we try to match v2i8 ops that are lowered as bitwise ops

Depends what you mean by "fragile" but for an optimization it doesn't require robustness. -100 for making v2i8 legal, that's a huge amount of effort for one operation.

Sisyph

Can you please add tests for GFX12 and implementation for GFX11 and GFX12? The V_SAT_PK_U8_I16 instruction exists on those subtargets as V_SAT_PK_U8_I16_fake16 and V_SAT_PK_U8_I16_t16. V_SAT_PK_U8_I16_fake16 should work equivalently to gfx9 and should work now. A true16 version using V_SAT_PK_U8_I16_t16 may or may not be testable at the current time, and could make sense to defer.

Sisyph · 2025-01-06T22:21:40Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+
+  def: GCNPatIgnoreCopies<
+    (i16 (conc_lo_v2i16_i16 (clamp_v2i16_u8 v2i16:$src))),
+    (inst VGPR_32:$src)


Please use VGPRSrc_32 which is a RegisterOperand instead of VGPR_32 directly.

Actually it's probably VRegSrc_32 not VGPRSrc_32.

Shoreshen · 2025-01-07T09:25:21Z

Can you please add tests for GFX12 and implementation for GFX11 and GFX12? The V_SAT_PK_U8_I16 instruction exists on those subtargets as V_SAT_PK_U8_I16_fake16 and V_SAT_PK_U8_I16_t16. V_SAT_PK_U8_I16_fake16 should work equivalently to gfx9 and should work now. A true16 version using V_SAT_PK_U8_I16_t16 may or may not be testable at the current time, and could make sense to defer.

Hi, it seems like the t16 instruction has more than 1 operand, so the patterns doesn't fit..... I added for the fake16 instructions, and also checks for GFX12

arsenm · 2025-01-07T15:18:30Z

llvm/lib/Target/AMDGPU/SIInstructions.td

@@ -3298,6 +3301,32 @@ def : GCNPat <
  (v2i16 (V_LSHL_OR_B32_e64 $src1, (i32 16), (i32 (V_AND_B32_e64 (i32 (V_MOV_B32_e32 (i32 0xffff))), $src0))))
 >;

+multiclass V_SAT_PK_Pat<Instruction inst> {
+  def: GCNPatIgnoreCopies<


Don't know why you specifically need to ignore copies here

Hi for global isel, there would be some COPY MIR within the pattern, this ignored the COPY MIR so that it can be matched

There are more hazards here and I'd rather leave that for a separate patch

Hi @arsenm , so should I do the following:

do not ignore copy here

change back for global isel cases

Or should I do:

do not ignore copy here

add the copy into the pattern

Thanks a lot, and sorry for the late reply

Do not use GCNPatIgnoreCopies. Anything related to the globalisel handling should be done separately

arsenm · 2025-01-07T15:19:04Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+    (i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (smax i16:$hi, (i16 0)))),
+    (inst
+      (V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)),
+      (V_AND_B32_e64 VRegSrc_32:$lo, (S_MOV_B32 (i32 0xFFFF)))))


You shouldn't need to manually clamp values, this is the kind of pattern that should appear in the incoming DAG

Pierre-vh · 2025-01-08T08:32:42Z

llvm/test/CodeGen/AMDGPU/v_sat_pk_u8_i16.ll

+  %vec.trunc = trunc <2 x i16> %smed to <2 x i8>
+  %cast = bitcast <2 x i8> %vec.trunc to i16
+  ret i16 %cast
+}


Can you precommit the tests, or do a force push with the first commit only containing the tests + check lines generated without the changes? I would like to see the before/after

Sisyph · 2025-01-08T14:49:33Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+let OtherPredicates = [NotHasTrue16BitInsts] in {
+  defm : V_SAT_PK_Pat<V_SAT_PK_U8_I16_e64>;
+} // End OtherPredicates = [NotHasTrue16BitInsts]
+defm : V_SAT_PK_Pat<V_SAT_PK_U8_I16_fake16_e64>;


This should have let True16Predicate = UseFakeTrue16Insts
But the test changes look good, thanks!

Preparation for #121124 This PR provides tests added into [PR](#121124) that add selection patterns for instruction `v_sat_pk`, in order to specify the change of the tests before and after the commit. Pre-commit tests PR for #121124 : Add selection patterns for instruction `v_sat_pk`

Preparation for #121124 This PR provides tests added into [PR](llvm/llvm-project#121124) that add selection patterns for instruction `v_sat_pk`, in order to specify the change of the tests before and after the commit. Pre-commit tests PR for #121124 : Add selection patterns for instruction `v_sat_pk`

Pierre-vh

This looks fine to me but I will leave final approval to @arsenm because it's mostly codegen

Pierre-vh · 2025-01-15T08:40:51Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+
+let OtherPredicates = [NotHasTrue16BitInsts] in {
+  defm : V_SAT_PK_Pat<V_SAT_PK_U8_I16_e64>;
+} // End OtherPredicates = [NotHasTrue16BitInsts]


add a blank line between the two for readability

Also the // End comment is not really needed if the whole thing is just 3 lines IMO, so I'd remove it, but that's really a small nit

arsenm

I think the approach of using these patterns should be revisited. These patterns are unwieldy and duplicate generic combiner logic.

We should be using the generic TRUNCATE_SAT nodes. The only issue is that we don't want to make v2i8 legal, but we do not have to. We can custom lower these nodes on the illegal v2i8 type, use a target specific node and bitcast from the packed-as-i16 form of the instruction to the v2i8

arsenm · 2025-01-15T08:51:12Z

llvm/lib/Target/AMDGPU/SIInstructions.td

@@ -3298,6 +3301,32 @@ def : GCNPat <
  (v2i16 (V_LSHL_OR_B32_e64 $src1, (i32 16), (i32 (V_AND_B32_e64 (i32 (V_MOV_B32_e32 (i32 0xffff))), $src0))))
 >;

+multiclass V_SAT_PK_Pat<Instruction inst> {
+  def: GCNPatIgnoreCopies<


Do not use GCNPatIgnoreCopies. Anything related to the globalisel handling should be done separately

arsenm · 2025-01-15T08:51:30Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+  def: GCNPatIgnoreCopies<
+    (i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (clamp_s16_u8 i16:$hi))),
+    (inst
+      (V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)),


You don't need to materialize this constant, you can just directly use the inline immediate

arsenm · 2025-01-15T08:51:46Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+  def: GCNPatIgnoreCopies<
+    (i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (smax i16:$hi, (i16 0)))),
+    (inst
+      (V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)),


Same here, you can directly use the inline immediate in the output

arsenm · 2025-01-15T08:52:17Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+  def: GCNPatIgnoreCopies<
+    (i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (smax i16:$hi, (i16 0)))),
+    (inst
+      (V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)),


Suggested change

(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)),

(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (i32 16),

Same here, you can directly use the inline immediate in the output (maybe can drop the type annotation too)

arsenm · 2025-01-15T08:54:11Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+  >;
+
+  def: GCNPatIgnoreCopies<
+    (i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (smax i16:$hi, (i16 0)))),


I tihnk there are missing hasOneUse checks throughout this

Hi @arsenm , may I ask why since if there are other uses of the (clamp_s16_u8 i16:$lo), the DAG is just not going to fold...

arsenm · 2025-01-15T08:57:04Z

llvm/lib/Target/AMDGPU/AMDGPUInstructions.td

+>;
+
+def conc_lo_v2i16_i16 : PatFrags<
+  (ops node:$src),


These cases are stretching what should be done in patterns, and there are too many of them in one patch. Can you keep this to one pattern per patch, it's much harder to review the test coverage.

These are all implementing the same thing, so we should be canonicalizing to this form so you don't have as many variants to deal with. This is also implementing the same patterns as is matched for the truncating stores, which we should be trying to reuse.

Shoreshen · 2025-01-15T09:58:52Z

I think the approach of using these patterns should be revisited. These patterns are unwieldy and duplicate generic combiner logic.

We should be using the generic TRUNCATE__SAT_ nodes. The only issue is that we don't want to make v2i8 legal, but we do not have to. We can custom lower these nodes on the illegal v2i8 type, use a target specific node and bitcast from the packed-as-i16 form of the instruction to the v2i8

Hi @arsenm , maybe we can use TRUNCATE_SSAT_U node for the 2 of i16 case, but the v2i16 case maybe not, since the result of the truncation is v2i8, which causes compilation fail for the current backend I think.

Or maybe we can handle the vector case in another PR??

arsenm · 2025-01-15T10:26:20Z

Hi @arsenm , maybe we can use TRUNCATE_SSAT_U node for the 2 of i16 case, but the v2i16 case maybe not, since the result of the truncation is v2i8, which causes compilation fail for the current backend I think.

The illegal type doesn't mean you have to throw away the whole thing and implement your own pattern matching. You can still custom lower the illegal type, you'll just need to process it into a wrapper node that does use a legal type plus a cast.

Shoreshen · 2025-01-15T10:53:00Z

Hi @arsenm , maybe we can use TRUNCATE_SSAT_U node for the 2 of i16 case, but the v2i16 case maybe not, since the result of the truncation is v2i8, which causes compilation fail for the current backend I think.

The illegal type doesn't mean you have to throw away the whole thing and implement your own pattern matching. You can still custom lower the illegal type, you'll just need to process it into a wrapper node that does use a legal type plus a cast.

Hi @arsenm , maybe it is little bit hard for me to understand.

At the pattern matching stage of DAG (which is DoInstructionSelection), we are not going to have any type of v2i8. These are all optimized out during the combination stage.

So I think unless we modify the pre-selection optimization passes, otherwise only changing the TD file will not make use of v2i8.

arsenm · 2025-01-15T10:56:16Z

At the pattern matching stage of DAG (which is DoInstructionSelection), we are not going to have any type of v2i8. These are all optimized out during the combination stage.

They are not optimized out, they are legalized out.

So I think unless we modify the pre-DAG optimization passes, otherwise only changing the TD file will not make use of v2i8.

You need to use setOperationAction(ISD::TRUNCATE_SSAT_S, MVT::v2i8, then make ReplaceNodeResults handle TRUNCATE_SSAT_S by replacing it with some AMDGPU specific operation, plus a bitcast to the original type. Theoretically you can select direct to the machine node, but it's probably better to introduce an AMDGPU variant of the node and select that

llvmbot added the backend:AMDGPU label Dec 26, 2024

arsenm reviewed Dec 26, 2024

View reviewed changes

Shoreshen changed the title ~~select v_sat_pk from 2 i16~~ select v_sat_pk from two i16 or v2i16 Dec 26, 2024

Shoreshen requested a review from arsenm January 3, 2025 11:19

Pierre-vh reviewed Jan 6, 2025

View reviewed changes

Sisyph reviewed Jan 6, 2025

View reviewed changes

arsenm reviewed Jan 7, 2025

View reviewed changes

Shoreshen requested review from Pierre-vh, arsenm and Sisyph January 8, 2025 06:27

Pierre-vh reviewed Jan 8, 2025

View reviewed changes

Sisyph reviewed Jan 8, 2025

View reviewed changes

shiltian changed the title ~~select v_sat_pk from two i16 or v2i16~~ [AMDGPU] select v_sat_pk from two i16 or v2i16 Jan 8, 2025

Shoreshen force-pushed the Select-v_sat_pk branch from de0de84 to e4adc8f Compare January 9, 2025 03:32

Shoreshen mentioned this pull request Jan 10, 2025

[AMDGPU] Add tests for v_sat_pk_u8_i16 codegen #122438

Merged

Shoreshen force-pushed the Select-v_sat_pk branch from e37d342 to b7013f3 Compare January 15, 2025 02:27

Shoreshen added 8 commits January 15, 2025 10:32

select v_sat_pk from 2 i16

ab0d614

add test, update pattern

8ca654f

add more cases and patters

e8d88ba

fix bug and add case

889198d

fix globalisel, merge main

582c56a

add v_sat_pk pattern for fake16, add test cases for GFX12, merge main

5319c9f

change to VRegSrc_32

fe7c1c7

do not clamp && merge main

5249824

add predicate for fake16 inst

0e4c4ed

Shoreshen force-pushed the Select-v_sat_pk branch from b7013f3 to 0e4c4ed Compare January 15, 2025 03:07

Shoreshen added 3 commits January 15, 2025 11:10

fix test cases

f2dee69

fix test cases

010322a

fix lit.cfg.py

aa46174

Shoreshen requested review from Pierre-vh and Sisyph January 15, 2025 05:05

Pierre-vh reviewed Jan 15, 2025

View reviewed changes

arsenm reviewed Jan 15, 2025

View reviewed changes

Shoreshen mentioned this pull request Jan 17, 2025

[AMDGPU] selecting v_sat_pk instruction, version 2 #123297

Open

	(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)),
	(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (i32 16),

[AMDGPU] select v_sat_pk from two i16 or v2i16 #121124

Are you sure you want to change the base?

[AMDGPU] select v_sat_pk from two i16 or v2i16 #121124

Uh oh!

Conversation

Shoreshen commented Dec 26, 2024

Uh oh!

github-actions bot commented Dec 26, 2024

Uh oh!

llvmbot commented Dec 26, 2024

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shoreshen Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shoreshen Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sisyph left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shoreshen commented Jan 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pierre-vh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

arsenm Dec 26, 2024 •

edited

Loading

Shoreshen Dec 26, 2024 •

edited

Loading

Shoreshen Dec 26, 2024 •

edited

Loading

Shoreshen commented Jan 15, 2025 •

edited

Loading

Shoreshen commented Jan 15, 2025 •

edited

Loading