Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] select v_sat_pk from two i16 or v2i16 #121124

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Shoreshen
Copy link
Contributor

Selecting v_sat_pk instruction based on bit operation.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Dec 26, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: None (Shoreshen)

Changes

Selecting v_sat_pk instruction based on bit operation.


Full diff: https://github.com/llvm/llvm-project/pull/121124.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructions.td (+14)
  • (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+12)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructions.td b/llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
index 6a5065cd4a0e8f..0a7747b8736786 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
@@ -315,6 +315,20 @@ def srl_16 : PatFrag<
   (ops node:$src0), (srl_oneuse node:$src0, (i32 16))
 >;
 
+def clamp_s16_u8 : PatFrag<
+  (ops node:$src),
+  (i16 (AMDGPUsmed3 $src, (i16 0), (i16 255)))
+>;
+
+def conc_lo_u8_i16 : PatFrags<
+    (ops node:$src0, node:$src1),
+    [
+        (or
+            (and (i16 $src0), (i16 255)),
+            (shl (i16 $src1), (i16 8))
+        )
+    ]
+>;
 
 def hi_i16_elt : PatFrag<
   (ops node:$src0), (i16 (trunc (i32 (srl_16 node:$src0))))
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 789ce8815cf801..c0dd87fccfb7bb 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -3298,6 +3298,18 @@ def : GCNPat <
   (v2i16 (V_LSHL_OR_B32_e64 $src1, (i32 16), (i32 (V_AND_B32_e64 (i32 (V_MOV_B32_e32 (i32 0xffff))), $src0))))
 >;
 
+multiclass V_SAT_PK_Pat<Instruction inst> {
+    def: GCNPat<
+        (i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (clamp_s16_u8 i16:$hi))),
+        (inst
+            (V_LSHL_OR_B32_e64 VGPR_32:$hi, (S_MOV_B32 (i32 16)),
+            (V_AND_B32_e64 VGPR_32:$lo, (S_MOV_B32 (i32 0xFFFF)))))
+    >;
+}
+
+let OtherPredicates = [NotHasTrue16BitInsts] in 
+defm : V_SAT_PK_Pat<V_SAT_PK_U8_I16_e64>;
+
 // With multiple uses of the shift, this will duplicate the shift and
 // increase register pressure.
 def : GCNPat <

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tests

(i16 (AMDGPUsmed3 $src, (i16 0), (i16 255)))
>;

def conc_lo_u8_i16 : PatFrags<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just one Pat, Can use PatFrag not PatFrags and avoid the list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add another pattern of addition... sorry for the trouble its my first PR....

(i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (clamp_s16_u8 i16:$hi))),
(inst
(V_LSHL_OR_B32_e64 VGPR_32:$hi, (S_MOV_B32 (i32 16)),
(V_AND_B32_e64 VGPR_32:$lo, (S_MOV_B32 (i32 0xFFFF)))))
Copy link
Contributor

@arsenm arsenm Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it should have had a dag combine form a nicer pattern to start with. Is this just recognizing the expanded form of the operation which should be legal in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The combination of 2 patterns is because there are many ways of taking lower 8 bit of 2 clamped integer into i16

I think it can be continuously added into conc_lo_u8_i16 if we meet new patterns. I'll update my PR...

%src1.shl = shl i16 %src1.clamp, 8
%or = or i16 %src0.and, %src1.shl
ret i16 %or
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs vector tests, negative tests, unsigned versions, mismatched opcodes.

Is this the same operation as TRUNCATE_SSAT_S? Should that be made legal, and does that enable a combine that matches the complex case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I'm thinking of maybe not, because the instruction does SAT, truncate and pack (packing to v2u8) while TRUNCATE_SSAT_S does not do pack based on the description I read on the comment....

I'll add more tests based on comment. Thanks a lot~~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pack part is just build_vector. You could treat the vector operation as legal and/or match the saturate truncate + build_vector

Copy link
Contributor Author

@Shoreshen Shoreshen Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi it seems like AMDGPU backend doesn't support v2i8........

by returning v2i8, it will return two i16, and due to the existing of the other i16, one of the med instruction will remain.....

I think we need to modify multiple places to support v2i8, so maybe just focusing bit operations for now

Copy link
Contributor Author

@Shoreshen Shoreshen Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think for the unsigned case, it should not be selected.

Say we have: (umed i16:$src, 0, 255) and the $src = -1=0xffff then it should return 255

However, I think the v_sat_pk will return 0 instead as the description says:

Given two 16-bit signed integer inputs, saturate each input over an 8-bit unsigned range, pack the resulting values into a 16-bit word and store the result into a vector register.

The only case that it is equivalent should be (umin (smax i16:$src, 0), 255), but this would be optimized to (umed i16:$src, 0, 255)

So I think I may add a negative case (should not select v_sat_pk) for unsigned situation~

@Shoreshen Shoreshen changed the title select v_sat_pk from 2 i16 select v_sat_pk from two i16 or v2i16 Dec 26, 2024
@Shoreshen Shoreshen requested a review from arsenm January 3, 2025 11:19
]
>;

def conc_lo_v2i16_i16 : PatFrags<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm Isn't matching bitwise ops fragile?

Wouldn't it be better to make v2i8 legal, address codegen regressions (maybe by handling it in TargetLowering/CC stuff as well), then come back to this?

I'm afraid the patterns would constantly break whenever a new combine is added for the bitwise ops like shift/and/or/etc. if we try to match v2i8 ops that are lowered as bitwise ops

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends what you mean by "fragile" but for an optimization it doesn't require robustness. -100 for making v2i8 legal, that's a huge amount of effort for one operation.

Copy link
Contributor

@Sisyph Sisyph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add tests for GFX12 and implementation for GFX11 and GFX12? The V_SAT_PK_U8_I16 instruction exists on those subtargets as V_SAT_PK_U8_I16_fake16 and V_SAT_PK_U8_I16_t16. V_SAT_PK_U8_I16_fake16 should work equivalently to gfx9 and should work now. A true16 version using V_SAT_PK_U8_I16_t16 may or may not be testable at the current time, and could make sense to defer.


def: GCNPatIgnoreCopies<
(i16 (conc_lo_v2i16_i16 (clamp_v2i16_u8 v2i16:$src))),
(inst VGPR_32:$src)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use VGPRSrc_32 which is a RegisterOperand instead of VGPR_32 directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's probably VRegSrc_32 not VGPRSrc_32.

@Shoreshen
Copy link
Contributor Author

Can you please add tests for GFX12 and implementation for GFX11 and GFX12? The V_SAT_PK_U8_I16 instruction exists on those subtargets as V_SAT_PK_U8_I16_fake16 and V_SAT_PK_U8_I16_t16. V_SAT_PK_U8_I16_fake16 should work equivalently to gfx9 and should work now. A true16 version using V_SAT_PK_U8_I16_t16 may or may not be testable at the current time, and could make sense to defer.

Hi, it seems like the t16 instruction has more than 1 operand, so the patterns doesn't fit..... I added for the fake16 instructions, and also checks for GFX12

@@ -3298,6 +3301,32 @@ def : GCNPat <
(v2i16 (V_LSHL_OR_B32_e64 $src1, (i32 16), (i32 (V_AND_B32_e64 (i32 (V_MOV_B32_e32 (i32 0xffff))), $src0))))
>;

multiclass V_SAT_PK_Pat<Instruction inst> {
def: GCNPatIgnoreCopies<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know why you specifically need to ignore copies here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi for global isel, there would be some COPY MIR within the pattern, this ignored the COPY MIR so that it can be matched

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more hazards here and I'd rather leave that for a separate patch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arsenm , so should I do the following:

  1. do not ignore copy here
  2. change back for global isel cases

Or should I do:

  1. do not ignore copy here
  2. add the copy into the pattern

Thanks a lot, and sorry for the late reply

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use GCNPatIgnoreCopies. Anything related to the globalisel handling should be done separately

(i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (smax i16:$hi, (i16 0)))),
(inst
(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)),
(V_AND_B32_e64 VRegSrc_32:$lo, (S_MOV_B32 (i32 0xFFFF)))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to manually clamp values, this is the kind of pattern that should appear in the incoming DAG

%vec.trunc = trunc <2 x i16> %smed to <2 x i8>
%cast = bitcast <2 x i8> %vec.trunc to i16
ret i16 %cast
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you precommit the tests, or do a force push with the first commit only containing the tests + check lines generated without the changes? I would like to see the before/after

let OtherPredicates = [NotHasTrue16BitInsts] in {
defm : V_SAT_PK_Pat<V_SAT_PK_U8_I16_e64>;
} // End OtherPredicates = [NotHasTrue16BitInsts]
defm : V_SAT_PK_Pat<V_SAT_PK_U8_I16_fake16_e64>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have let True16Predicate = UseFakeTrue16Insts
But the test changes look good, thanks!

@shiltian shiltian changed the title select v_sat_pk from two i16 or v2i16 [AMDGPU] select v_sat_pk from two i16 or v2i16 Jan 8, 2025
kzhuravl pushed a commit that referenced this pull request Jan 15, 2025
Preparation for #121124 

This PR provides tests added into
[PR](#121124) that add
selection patterns for instruction `v_sat_pk`, in order to specify the
change of the tests before and after the commit.

Pre-commit tests PR for #121124 : Add selection patterns for instruction
`v_sat_pk`
github-actions bot pushed a commit to arm/arm-toolchain that referenced this pull request Jan 15, 2025
Preparation for #121124

This PR provides tests added into
[PR](llvm/llvm-project#121124) that add
selection patterns for instruction `v_sat_pk`, in order to specify the
change of the tests before and after the commit.

Pre-commit tests PR for #121124 : Add selection patterns for instruction
`v_sat_pk`
@Shoreshen Shoreshen requested review from Pierre-vh and Sisyph January 15, 2025 05:05
Copy link
Contributor

@Pierre-vh Pierre-vh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me but I will leave final approval to @arsenm because it's mostly codegen


let OtherPredicates = [NotHasTrue16BitInsts] in {
defm : V_SAT_PK_Pat<V_SAT_PK_U8_I16_e64>;
} // End OtherPredicates = [NotHasTrue16BitInsts]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a blank line between the two for readability

Also the // End comment is not really needed if the whole thing is just 3 lines IMO, so I'd remove it, but that's really a small nit

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the approach of using these patterns should be revisited. These patterns are unwieldy and duplicate generic combiner logic.

We should be using the generic TRUNCATE_SAT nodes. The only issue is that we don't want to make v2i8 legal, but we do not have to. We can custom lower these nodes on the illegal v2i8 type, use a target specific node and bitcast from the packed-as-i16 form of the instruction to the v2i8

@@ -3298,6 +3301,32 @@ def : GCNPat <
(v2i16 (V_LSHL_OR_B32_e64 $src1, (i32 16), (i32 (V_AND_B32_e64 (i32 (V_MOV_B32_e32 (i32 0xffff))), $src0))))
>;

multiclass V_SAT_PK_Pat<Instruction inst> {
def: GCNPatIgnoreCopies<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use GCNPatIgnoreCopies. Anything related to the globalisel handling should be done separately

def: GCNPatIgnoreCopies<
(i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (clamp_s16_u8 i16:$hi))),
(inst
(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to materialize this constant, you can just directly use the inline immediate

def: GCNPatIgnoreCopies<
(i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (smax i16:$hi, (i16 0)))),
(inst
(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, you can directly use the inline immediate in the output

def: GCNPatIgnoreCopies<
(i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (smax i16:$hi, (i16 0)))),
(inst
(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)),
(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (i32 16),

Same here, you can directly use the inline immediate in the output (maybe can drop the type annotation too)

>;

def: GCNPatIgnoreCopies<
(i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (smax i16:$hi, (i16 0)))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tihnk there are missing hasOneUse checks throughout this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arsenm , may I ask why since if there are other uses of the (clamp_s16_u8 i16:$lo), the DAG is just not going to fold...

>;

def conc_lo_v2i16_i16 : PatFrags<
(ops node:$src),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cases are stretching what should be done in patterns, and there are too many of them in one patch. Can you keep this to one pattern per patch, it's much harder to review the test coverage.

These are all implementing the same thing, so we should be canonicalizing to this form so you don't have as many variants to deal with. This is also implementing the same patterns as is matched for the truncating stores, which we should be trying to reuse.

@Shoreshen
Copy link
Contributor Author

Shoreshen commented Jan 15, 2025

I think the approach of using these patterns should be revisited. These patterns are unwieldy and duplicate generic combiner logic.

We should be using the generic TRUNCATE__SAT_ nodes. The only issue is that we don't want to make v2i8 legal, but we do not have to. We can custom lower these nodes on the illegal v2i8 type, use a target specific node and bitcast from the packed-as-i16 form of the instruction to the v2i8

Hi @arsenm , maybe we can use TRUNCATE_SSAT_U node for the 2 of i16 case, but the v2i16 case maybe not, since the result of the truncation is v2i8, which causes compilation fail for the current backend I think.

Or maybe we can handle the vector case in another PR??

@arsenm
Copy link
Contributor

arsenm commented Jan 15, 2025

Hi @arsenm , maybe we can use TRUNCATE_SSAT_U node for the 2 of i16 case, but the v2i16 case maybe not, since the result of the truncation is v2i8, which causes compilation fail for the current backend I think.

The illegal type doesn't mean you have to throw away the whole thing and implement your own pattern matching. You can still custom lower the illegal type, you'll just need to process it into a wrapper node that does use a legal type plus a cast.

@Shoreshen
Copy link
Contributor Author

Shoreshen commented Jan 15, 2025

Hi @arsenm , maybe we can use TRUNCATE_SSAT_U node for the 2 of i16 case, but the v2i16 case maybe not, since the result of the truncation is v2i8, which causes compilation fail for the current backend I think.

The illegal type doesn't mean you have to throw away the whole thing and implement your own pattern matching. You can still custom lower the illegal type, you'll just need to process it into a wrapper node that does use a legal type plus a cast.

Hi @arsenm , maybe it is little bit hard for me to understand.

At the pattern matching stage of DAG (which is DoInstructionSelection), we are not going to have any type of v2i8. These are all optimized out during the combination stage.

So I think unless we modify the pre-selection optimization passes, otherwise only changing the TD file will not make use of v2i8.

@arsenm
Copy link
Contributor

arsenm commented Jan 15, 2025

At the pattern matching stage of DAG (which is DoInstructionSelection), we are not going to have any type of v2i8. These are all optimized out during the combination stage.

They are not optimized out, they are legalized out.

So I think unless we modify the pre-DAG optimization passes, otherwise only changing the TD file will not make use of v2i8.

You need to use setOperationAction(ISD::TRUNCATE_SSAT_S, MVT::v2i8, then make ReplaceNodeResults handle TRUNCATE_SSAT_S by replacing it with some AMDGPU specific operation, plus a bitcast to the original type. Theoretically you can select direct to the machine node, but it's probably better to introduce an AMDGPU variant of the node and select that

paulhuggett pushed a commit to paulhuggett/llvm-project that referenced this pull request Jan 16, 2025
Preparation for llvm#121124 

This PR provides tests added into
[PR](llvm#121124) that add
selection patterns for instruction `v_sat_pk`, in order to specify the
change of the tests before and after the commit.

Pre-commit tests PR for llvm#121124 : Add selection patterns for instruction
`v_sat_pk`
DKLoehr pushed a commit to DKLoehr/llvm-project that referenced this pull request Jan 17, 2025
Preparation for llvm#121124 

This PR provides tests added into
[PR](llvm#121124) that add
selection patterns for instruction `v_sat_pk`, in order to specify the
change of the tests before and after the commit.

Pre-commit tests PR for llvm#121124 : Add selection patterns for instruction
`v_sat_pk`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants