Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RISC-V EVL tail folding #123069

Open
2 of 16 tasks
lukel97 opened this issue Jan 15, 2025 · 13 comments
Open
2 of 16 tasks

RISC-V EVL tail folding #123069

lukel97 opened this issue Jan 15, 2025 · 13 comments

Comments

@lukel97
Copy link
Contributor

lukel97 commented Jan 15, 2025

On the spacemit-x60, GCC 14 is ~24% faster on the 525.x264_r SPEC CPU 2017 benchmark than a recent build of Clang.

A big chunk of this difference is due to GCC tail folding its loops with VL, whereas LLVM doesn't by default.

Because LLVM doesn't tail fold its loops, it generates both a vectorized body and a scalar epilogue. There is a minimum trip count >= VF required to execute the vectorized body, otherwise it can only run the scalar epilogue.

On 525.x264_r, there are some very hot functions (e.g. get_ref) which never meet the minimum trip count and so the vector code is never ran. Tail folding avoids this issue and allows us to run the vectorized body every time.

There are likely other performance benefits to be had with tail folding with VL, so it seems worthwhile exploring.

"EVL tail folding" (LLVM's vector-predication terminology for VL tail folding), can be enabled from Clang with -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -mllvm -force-tail-folding-style=data-with-evl. It initially landed in #76172 but it isn't enabled by default yet due to support for it not being fully complete, both in the loop vectorizer and elsewhere in the RISC-V backend.

This issue aims to track what work is needed across the LLVM project to bring it up to a stable state, at which point we can evaluate its performance to see if it should be enabled by default.

It's not a complete list and only contains the tasks that I've noticed so far. Please feel free to edit and add to it!
I presume we will find more things that need addressed as time goes on.

@llvmbot
Copy link
Member

llvmbot commented Jan 15, 2025

@llvm/issue-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

On the spacemit-x60, [GCC 14 is ~24% faster on the 525.x264_r SPEC CPU 2017 benchmark than a recent build of Clang](https://lnt.lukelau.me/db_default/v4/nts/profile/13/18/15).

A big chunk of this difference is due to GCC tail folding its loops with VL, whereas LLVM doesn't by default.

Because LLVM doesn't tail fold its loops, it generates both a vectorized body and a scalar epilogue. There is a minimum trip count >= VF required to execute the vectorized body, otherwise it can only run the scalar epilogue.

On 525.x264_r, there are some very hot functions (e.g. get_ref) which never meet the minimum trip count and so the vector code is never ran. Tail folding avoids this issue and allows us to run the vectorized body every time.

There are likely other performance benefits to be had with tail folding with VL, so it seems worthwhile exploring.

"EVL tail folding" (LLVM's vector-predication terminology for VL tail folding), can be enabled from Clang with -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -mllvm -force-tail-folding-style=data-with-evl. It initially landed in #76172 but it isn't enabled by default yet due to support for it not being fully complete, both in the loop vectorizer and elsewhere in the RISC-V backend.

This issue aims to track what work is needed across the LLVM project to bring it up to a stable state, at which point we can evaluate its performance to see if it should be enabled by default.

It's not a complete list and only contains the tasks that I've noticed so far. Please feel free to edit and add to it!
I presume we will find more things that need addressed as time goes on.

  • Set up CI infrastructure for -force-tail-folding-style=data-with-evl
  • Address known miscompiles
    • #122461
  • Fix cases that abort vectorization entirely
    • On SPEC CPU 2017 as of 02403f4, with EVL tail folding we vectorize 57% less loops that were previously vectorized. This is likely due to vectorization aborting when it encounters unimplemented cases:
    • VPWidenIntOrFpInductionRecipe
      • #115274
      • #118638
    • VPWidenPointerInductionRecipe
    • Fixed-length VFs: There are cases where scalable vectorization isn’t possible and we currently don't allow fixed-length VFs, so presumably nothing gets vectorized in this case.
    • Cases where the RISC-V cost model may have become unprofitable with EVL tail folding
  • Implement support for EVL tail folding in other parts of the loop vectorizer
    • Fixed-order recurrences (will fall back to DataWithoutLaneMask style after #122458)
    • #100755
    • #114205 (see note on RISCVVLOptimizer below)
  • Extend RISC-V VP intrinsic codegen
    • Segmented accesses #120490
    • Strided accesses in RISCVGatherScatterLowering
      • #122244
      • #122232
      • Eventually, the loop vectorizer should be taught to emit vp.strided.{load,store} intrinsics directly cc) @nikolaypanchenko
    • RISCVVLOptimizer
      • The VL optimizer may have made non-trapping VP intrinsics redundant. We should evaluate if we still need to transform intrinsics/calls/binops to VP intrinsics in the LV
    • #91796

@lukel97
Copy link
Contributor Author

lukel97 commented Jan 15, 2025

cc'ing people I'm aware who have worked in this area, feel free to tag anyone I've missed: @alexey-bataev @ElvisWang123 @nikolaypanchenko @michaelmaitland @mshockwave @Mel-Chen @arcbbb @LiqinWeng @fhahn @preames @wangpc-pp

@preames
Copy link
Collaborator

preames commented Jan 15, 2025

For some additional context on functional correctness.

We had up until this week, a compile time crash when trying to cross compile spec2017. This has now been fixed. I have successfully cross built spec2017 intrate with EVL vectorization enabled for both sifive-x280 and spacemit-x60. @mshockwave tells me he successfully built both spec2017 and spec2006. I believe I saw mention of someone successfully cross building llvm-test-suite in this configuration, but don't currently remember who, so this should be confirmed. I have also cross built my simple-vector-riscv tests. I plan to (but have not yet) cross build polybench, sqlite3, and TSVC. I haven't seen anyone report the status of a stage2 clang build, we should probably also do that.

In terms of runtime failures, we had one identified and fixed earlier this week by @lukel97. @mshockwave told me yesterday that he was still observing a miscompile in several spec2006 and spec2017 workloads. I don't believe these have been triaged yet. @mshockwave Could you share details on your findings?

@lukel97
Copy link
Contributor Author

lukel97 commented Jan 15, 2025

I haven't seen anyone report the status of a stage2 clang build, we should probably also do that.

I forgot to mention that there is a buildbot maintained by @asb for that, I've added it to the list: https://lab.llvm.org/staging/#/builders/16/builds/668

@mshockwave
Copy link
Member

. @mshockwave Could you share details on your findings?

Of course, I'm still triaging though. Since it's likely that some of these might be indirectly caused by the vectorizer:

SPEC2006:

  • 471.omnetpp (incorrect result)
  • 445.gobmk (runtime error - benchmark's own assertion faillure)

SPEC2017:

  • 500.perlbench_r (incorrect result)
  • 520.omnetpp_r (runtime error)

I'll update the issues links here once I have some leads into these problems.

@lukel97
Copy link
Contributor Author

lukel97 commented Jan 16, 2025

All SPEC CPU 2017 rate tests are now passing on https://lnt.lukelau.me/db_default/v4/nts/107, including 500.perlbench_r and 520.omnetpp_r. These are only running the train dataset though, @mshockwave are you seeing the errors with ref?

It also looks like this confirms the performance improvement on 525.x264_r, it's ~17.5% faster vs no tail folding:
https://lnt.lukelau.me/db_default/v4/nts/profile/13/107/106

@lukel97
Copy link
Contributor Author

lukel97 commented Jan 16, 2025

I've been told that the spacemit-x60 doesn't implement the [ceil(AVL/2),VLMAX] if VLMAX < AVL < 2*VLMAX behaviour, so it's likely that that's the reason why SPEC isn't failing here.

@alexey-bataev
Copy link
Member

I've been told that the spacemit-x60 doesn't implement the [ceil(AVL/2),VLMAX] if VLMAX < AVL < 2*VLMAX behaviour, so it's likely that that's the reason why SPEC isn't failing here.

Many targets do not implement it, some tests still fail on qemu for first-order recurrences because of this. Mel has a patch to disable it unless it is fully supported

@Mel-Chen
Copy link
Contributor

Thanks for sorting it out so clearly!

I added #123201 under Implement support for EVL tail folding in other parts of the loop vectorizer.
My plan is supporting fixed-order recurrence first, and then interleaved accesses.

Segmented accesses: #120490 for power-of-two cases
@mshockwave will add llvm.vector.(de)interleave3/5/7 so that we can handle the codegen of all factors. We need to teach vectorizers to synthesize factor of 6 from factor of 2 and 3 though.

Could we use vector array to implement interleave/ deinterleave?

declare [4 x <vscale x 2 x i32>]  @llvm.vector.deinterleave.<overloaded str> (<vscale x 8 x i32> %vec)
declare <vscale x 8 x i32> @llvm.vector.interleave.<overloaded str> ([4 x <vscale x 2 x i32>] %vec.array)

@preames
Copy link
Collaborator

preames commented Jan 16, 2025

This got mentioned at today's RISCV sync-up, but it's an important bit of context, so repeating it here.

As of today, we do not have any known compile time failures (crashes or assertions) with EVL vectorization. We are also free of miscompiles (in at least spec2017) on the spacemit-x60. The remaining miscompiles are believed to be specific to the special case in the specification which allows VL!=VLMAX on the second to last iteration (i.e. the "[ceil(AVL/2),VLMAX] if VLMAX < AVL < 2*VLMAX" case Alexey mentioned above)

Let's keep this quality bar, and try to get CI/testing flows in place to ensure we don't backslide here while working on the remaining miscompiles specific to the special case described above.

@mshockwave
Copy link
Member

mshockwave commented Jan 16, 2025

All SPEC CPU 2017 rate tests are now passing on https://lnt.lukelau.me/db_default/v4/nts/107, including 500.perlbench_r and 520.omnetpp_r. These are only running the train dataset though, @mshockwave are you seeing the errors with ref?

Regarding SPEC2017, it's still failing on 500.perlbench_r and 520.omnetpp_r on my side with some cryptic error messages. Which makes me start to think it's due to some environment issue from my test harness. So please don't let me block you on SPEC2017.

For SPEC2006 INT and FP, the only failing case is 445.gobmk but it's not related to EVL vectorizer (or even vectorizer): #123151

@lukel97
Copy link
Contributor Author

lukel97 commented Jan 17, 2025

All SPEC CPU 2017 rate tests are now passing on https://lnt.lukelau.me/db_default/v4/nts/107, including 500.perlbench_r and 520.omnetpp_r. These are only running the train dataset though, @mshockwave are you seeing the errors with ref?

Regarding SPEC2017, it's still failing on 500.perlbench_r and 520.omnetpp_r on my side with some cryptic error messages. Which makes me start to think it's due to some environment issue from my test harness. So please don't let me block you on SPEC2017.

I think it may also be due to the hardware I was running it on (spacemit-x60) not implementing the [ceil(AVL/2),VLMAX] if VLMAX < AVL < 2*VLMAX behaviour, so it's very likely that the errors you're seeing are genuine miscompiles that are missed on my setup. I'll try and see if I can reproduce these on qemu.

@kito-cheng
Copy link
Member

I think it may also be due to the hardware I was running it on (spacemit-x60) not implementing the [ceil(AVL/2),VLMAX] if VLMAX < AVL < 2*VLMAX behaviour, so it's very likely that the errors you're seeing are genuine miscompiles that are missed on my setup. I'll try and see if I can reproduce these on qemu.

Just remind you that is not enabled by default and need rvv_vl_half_avl=true for the cpu option in qemu and require at least qemu 9.2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants