-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] select v_sat_pk from two i16 or v2i16 #121124
base: main
Are you sure you want to change the base?
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-amdgpu Author: None (Shoreshen) ChangesSelecting v_sat_pk instruction based on bit operation. Full diff: https://github.com/llvm/llvm-project/pull/121124.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructions.td b/llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
index 6a5065cd4a0e8f..0a7747b8736786 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
@@ -315,6 +315,20 @@ def srl_16 : PatFrag<
(ops node:$src0), (srl_oneuse node:$src0, (i32 16))
>;
+def clamp_s16_u8 : PatFrag<
+ (ops node:$src),
+ (i16 (AMDGPUsmed3 $src, (i16 0), (i16 255)))
+>;
+
+def conc_lo_u8_i16 : PatFrags<
+ (ops node:$src0, node:$src1),
+ [
+ (or
+ (and (i16 $src0), (i16 255)),
+ (shl (i16 $src1), (i16 8))
+ )
+ ]
+>;
def hi_i16_elt : PatFrag<
(ops node:$src0), (i16 (trunc (i32 (srl_16 node:$src0))))
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 789ce8815cf801..c0dd87fccfb7bb 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -3298,6 +3298,18 @@ def : GCNPat <
(v2i16 (V_LSHL_OR_B32_e64 $src1, (i32 16), (i32 (V_AND_B32_e64 (i32 (V_MOV_B32_e32 (i32 0xffff))), $src0))))
>;
+multiclass V_SAT_PK_Pat<Instruction inst> {
+ def: GCNPat<
+ (i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (clamp_s16_u8 i16:$hi))),
+ (inst
+ (V_LSHL_OR_B32_e64 VGPR_32:$hi, (S_MOV_B32 (i32 16)),
+ (V_AND_B32_e64 VGPR_32:$lo, (S_MOV_B32 (i32 0xFFFF)))))
+ >;
+}
+
+let OtherPredicates = [NotHasTrue16BitInsts] in
+defm : V_SAT_PK_Pat<V_SAT_PK_U8_I16_e64>;
+
// With multiple uses of the shift, this will duplicate the shift and
// increase register pressure.
def : GCNPat <
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing tests
(i16 (AMDGPUsmed3 $src, (i16 0), (i16 255))) | ||
>; | ||
|
||
def conc_lo_u8_i16 : PatFrags< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just one Pat, Can use PatFrag not PatFrags and avoid the list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add another pattern of addition... sorry for the trouble its my first PR....
(i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (clamp_s16_u8 i16:$hi))), | ||
(inst | ||
(V_LSHL_OR_B32_e64 VGPR_32:$hi, (S_MOV_B32 (i32 16)), | ||
(V_AND_B32_e64 VGPR_32:$lo, (S_MOV_B32 (i32 0xFFFF))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it should have had a dag combine form a nicer pattern to start with. Is this just recognizing the expanded form of the operation which should be legal in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The combination of 2 patterns is because there are many ways of taking lower 8 bit of 2 clamped integer into i16
I think it can be continuously added into conc_lo_u8_i16
if we meet new patterns. I'll update my PR...
%src1.shl = shl i16 %src1.clamp, 8 | ||
%or = or i16 %src0.and, %src1.shl | ||
ret i16 %or | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs vector tests, negative tests, unsigned versions, mismatched opcodes.
Is this the same operation as TRUNCATE_SSAT_S? Should that be made legal, and does that enable a combine that matches the complex case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I'm thinking of maybe not, because the instruction does SAT, truncate and pack (packing to v2u8) while TRUNCATE_SSAT_S
does not do pack based on the description I read on the comment....
I'll add more tests based on comment. Thanks a lot~~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pack part is just build_vector. You could treat the vector operation as legal and/or match the saturate truncate + build_vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi it seems like AMDGPU backend doesn't support v2i8........
by returning v2i8, it will return two i16, and due to the existing of the other i16, one of the med instruction will remain.....
I think we need to modify multiple places to support v2i8, so maybe just focusing bit operations for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I think for the unsigned case, it should not be selected.
Say we have: (umed i16:$src, 0, 255)
and the $src = -1=0xffff
then it should return 255
However, I think the v_sat_pk
will return 0 instead as the description says:
Given two 16-bit signed integer inputs, saturate each input over an 8-bit unsigned range, pack the resulting values into a 16-bit word and store the result into a vector register.
The only case that it is equivalent should be (umin (smax i16:$src, 0), 255)
, but this would be optimized to (umed i16:$src, 0, 255)
So I think I may add a negative case (should not select v_sat_pk
) for unsigned situation~
] | ||
>; | ||
|
||
def conc_lo_v2i16_i16 : PatFrags< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arsenm Isn't matching bitwise ops fragile?
Wouldn't it be better to make v2i8 legal, address codegen regressions (maybe by handling it in TargetLowering/CC stuff as well), then come back to this?
I'm afraid the patterns would constantly break whenever a new combine is added for the bitwise ops like shift/and/or/etc. if we try to match v2i8 ops that are lowered as bitwise ops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends what you mean by "fragile" but for an optimization it doesn't require robustness. -100 for making v2i8 legal, that's a huge amount of effort for one operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add tests for GFX12 and implementation for GFX11 and GFX12? The V_SAT_PK_U8_I16 instruction exists on those subtargets as V_SAT_PK_U8_I16_fake16 and V_SAT_PK_U8_I16_t16. V_SAT_PK_U8_I16_fake16 should work equivalently to gfx9 and should work now. A true16 version using V_SAT_PK_U8_I16_t16 may or may not be testable at the current time, and could make sense to defer.
|
||
def: GCNPatIgnoreCopies< | ||
(i16 (conc_lo_v2i16_i16 (clamp_v2i16_u8 v2i16:$src))), | ||
(inst VGPR_32:$src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use VGPRSrc_32 which is a RegisterOperand instead of VGPR_32 directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it's probably VRegSrc_32 not VGPRSrc_32.
Hi, it seems like the t16 instruction has more than 1 operand, so the patterns doesn't fit..... I added for the fake16 instructions, and also checks for GFX12 |
@@ -3298,6 +3301,32 @@ def : GCNPat < | |||
(v2i16 (V_LSHL_OR_B32_e64 $src1, (i32 16), (i32 (V_AND_B32_e64 (i32 (V_MOV_B32_e32 (i32 0xffff))), $src0)))) | |||
>; | |||
|
|||
multiclass V_SAT_PK_Pat<Instruction inst> { | |||
def: GCNPatIgnoreCopies< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know why you specifically need to ignore copies here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi for global isel, there would be some COPY MIR within the pattern, this ignored the COPY MIR so that it can be matched
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are more hazards here and I'd rather leave that for a separate patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @arsenm , so should I do the following:
- do not ignore copy here
- change back for global isel cases
Or should I do:
- do not ignore copy here
- add the copy into the pattern
Thanks a lot, and sorry for the late reply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use GCNPatIgnoreCopies. Anything related to the globalisel handling should be done separately
(i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (smax i16:$hi, (i16 0)))), | ||
(inst | ||
(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)), | ||
(V_AND_B32_e64 VRegSrc_32:$lo, (S_MOV_B32 (i32 0xFFFF))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need to manually clamp values, this is the kind of pattern that should appear in the incoming DAG
%vec.trunc = trunc <2 x i16> %smed to <2 x i8> | ||
%cast = bitcast <2 x i8> %vec.trunc to i16 | ||
ret i16 %cast | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you precommit the tests, or do a force push with the first commit only containing the tests + check lines generated without the changes? I would like to see the before/after
let OtherPredicates = [NotHasTrue16BitInsts] in { | ||
defm : V_SAT_PK_Pat<V_SAT_PK_U8_I16_e64>; | ||
} // End OtherPredicates = [NotHasTrue16BitInsts] | ||
defm : V_SAT_PK_Pat<V_SAT_PK_U8_I16_fake16_e64>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have let True16Predicate = UseFakeTrue16Insts
But the test changes look good, thanks!
de0de84
to
e4adc8f
Compare
Preparation for #121124 This PR provides tests added into [PR](llvm/llvm-project#121124) that add selection patterns for instruction `v_sat_pk`, in order to specify the change of the tests before and after the commit. Pre-commit tests PR for #121124 : Add selection patterns for instruction `v_sat_pk`
e37d342
to
b7013f3
Compare
b7013f3
to
0e4c4ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me but I will leave final approval to @arsenm because it's mostly codegen
|
||
let OtherPredicates = [NotHasTrue16BitInsts] in { | ||
defm : V_SAT_PK_Pat<V_SAT_PK_U8_I16_e64>; | ||
} // End OtherPredicates = [NotHasTrue16BitInsts] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a blank line between the two for readability
Also the // End
comment is not really needed if the whole thing is just 3 lines IMO, so I'd remove it, but that's really a small nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the approach of using these patterns should be revisited. These patterns are unwieldy and duplicate generic combiner logic.
We should be using the generic TRUNCATE_SAT nodes. The only issue is that we don't want to make v2i8 legal, but we do not have to. We can custom lower these nodes on the illegal v2i8 type, use a target specific node and bitcast from the packed-as-i16 form of the instruction to the v2i8
@@ -3298,6 +3301,32 @@ def : GCNPat < | |||
(v2i16 (V_LSHL_OR_B32_e64 $src1, (i32 16), (i32 (V_AND_B32_e64 (i32 (V_MOV_B32_e32 (i32 0xffff))), $src0)))) | |||
>; | |||
|
|||
multiclass V_SAT_PK_Pat<Instruction inst> { | |||
def: GCNPatIgnoreCopies< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use GCNPatIgnoreCopies. Anything related to the globalisel handling should be done separately
def: GCNPatIgnoreCopies< | ||
(i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (clamp_s16_u8 i16:$hi))), | ||
(inst | ||
(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to materialize this constant, you can just directly use the inline immediate
def: GCNPatIgnoreCopies< | ||
(i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (smax i16:$hi, (i16 0)))), | ||
(inst | ||
(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, you can directly use the inline immediate in the output
def: GCNPatIgnoreCopies< | ||
(i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (smax i16:$hi, (i16 0)))), | ||
(inst | ||
(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (S_MOV_B32 (i32 16)), | |
(V_LSHL_OR_B32_e64 VRegSrc_32:$hi, (i32 16), |
Same here, you can directly use the inline immediate in the output (maybe can drop the type annotation too)
>; | ||
|
||
def: GCNPatIgnoreCopies< | ||
(i16 (conc_lo_u8_i16 (clamp_s16_u8 i16:$lo), (smax i16:$hi, (i16 0)))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tihnk there are missing hasOneUse checks throughout this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @arsenm , may I ask why since if there are other uses of the (clamp_s16_u8 i16:$lo)
, the DAG is just not going to fold...
>; | ||
|
||
def conc_lo_v2i16_i16 : PatFrags< | ||
(ops node:$src), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These cases are stretching what should be done in patterns, and there are too many of them in one patch. Can you keep this to one pattern per patch, it's much harder to review the test coverage.
These are all implementing the same thing, so we should be canonicalizing to this form so you don't have as many variants to deal with. This is also implementing the same patterns as is matched for the truncating stores, which we should be trying to reuse.
Hi @arsenm , maybe we can use TRUNCATE_SSAT_U node for the 2 of i16 case, but the v2i16 case maybe not, since the result of the truncation is v2i8, which causes compilation fail for the current backend I think. Or maybe we can handle the vector case in another PR?? |
The illegal type doesn't mean you have to throw away the whole thing and implement your own pattern matching. You can still custom lower the illegal type, you'll just need to process it into a wrapper node that does use a legal type plus a cast. |
Hi @arsenm , maybe it is little bit hard for me to understand. At the pattern matching stage of DAG (which is So I think unless we modify the pre-selection optimization passes, otherwise only changing the TD file will not make use of v2i8. |
They are not optimized out, they are legalized out.
You need to use setOperationAction(ISD::TRUNCATE_SSAT_S, MVT::v2i8, then make ReplaceNodeResults handle TRUNCATE_SSAT_S by replacing it with some AMDGPU specific operation, plus a bitcast to the original type. Theoretically you can select direct to the machine node, but it's probably better to introduce an AMDGPU variant of the node and select that |
Preparation for llvm#121124 This PR provides tests added into [PR](llvm#121124) that add selection patterns for instruction `v_sat_pk`, in order to specify the change of the tests before and after the commit. Pre-commit tests PR for llvm#121124 : Add selection patterns for instruction `v_sat_pk`
Preparation for llvm#121124 This PR provides tests added into [PR](llvm#121124) that add selection patterns for instruction `v_sat_pk`, in order to specify the change of the tests before and after the commit. Pre-commit tests PR for llvm#121124 : Add selection patterns for instruction `v_sat_pk`
Selecting v_sat_pk instruction based on bit operation.