-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RISC-V EVL tail folding #123069
Comments
@llvm/issue-subscribers-backend-risc-v Author: Luke Lau (lukel97)
On the spacemit-x60, [GCC 14 is ~24% faster on the 525.x264_r SPEC CPU 2017 benchmark than a recent build of Clang](https://lnt.lukelau.me/db_default/v4/nts/profile/13/18/15).
A big chunk of this difference is due to GCC tail folding its loops with VL, whereas LLVM doesn't by default. Because LLVM doesn't tail fold its loops, it generates both a vectorized body and a scalar epilogue. There is a minimum trip count >= VF required to execute the vectorized body, otherwise it can only run the scalar epilogue. On 525.x264_r, there are some very hot functions (e.g. There are likely other performance benefits to be had with tail folding with VL, so it seems worthwhile exploring. "EVL tail folding" (LLVM's vector-predication terminology for VL tail folding), can be enabled from Clang with This issue aims to track what work is needed across the LLVM project to bring it up to a stable state, at which point we can evaluate its performance to see if it should be enabled by default. It's not a complete list and only contains the tasks that I've noticed so far. Please feel free to edit and add to it!
|
cc'ing people I'm aware who have worked in this area, feel free to tag anyone I've missed: @alexey-bataev @ElvisWang123 @nikolaypanchenko @michaelmaitland @mshockwave @Mel-Chen @arcbbb @LiqinWeng @fhahn @preames @wangpc-pp |
For some additional context on functional correctness. We had up until this week, a compile time crash when trying to cross compile spec2017. This has now been fixed. I have successfully cross built spec2017 intrate with EVL vectorization enabled for both sifive-x280 and spacemit-x60. @mshockwave tells me he successfully built both spec2017 and spec2006. I believe I saw mention of someone successfully cross building llvm-test-suite in this configuration, but don't currently remember who, so this should be confirmed. I have also cross built my simple-vector-riscv tests. I plan to (but have not yet) cross build polybench, sqlite3, and TSVC. I haven't seen anyone report the status of a stage2 clang build, we should probably also do that. In terms of runtime failures, we had one identified and fixed earlier this week by @lukel97. @mshockwave told me yesterday that he was still observing a miscompile in several spec2006 and spec2017 workloads. I don't believe these have been triaged yet. @mshockwave Could you share details on your findings? |
I forgot to mention that there is a buildbot maintained by @asb for that, I've added it to the list: https://lab.llvm.org/staging/#/builders/16/builds/668 |
Of course, I'm still triaging though. Since it's likely that some of these might be indirectly caused by the vectorizer: SPEC2006:
SPEC2017:
I'll update the issues links here once I have some leads into these problems. |
All SPEC CPU 2017 rate tests are now passing on https://lnt.lukelau.me/db_default/v4/nts/107, including 500.perlbench_r and 520.omnetpp_r. These are only running the train dataset though, @mshockwave are you seeing the errors with ref? It also looks like this confirms the performance improvement on 525.x264_r, it's ~17.5% faster vs no tail folding: |
I've been told that the spacemit-x60 doesn't implement the |
Many targets do not implement it, some tests still fail on qemu for first-order recurrences because of this. Mel has a patch to disable it unless it is fully supported |
Thanks for sorting it out so clearly! I added #123201 under
Could we use vector array to implement interleave/ deinterleave?
|
This got mentioned at today's RISCV sync-up, but it's an important bit of context, so repeating it here. As of today, we do not have any known compile time failures (crashes or assertions) with EVL vectorization. We are also free of miscompiles (in at least spec2017) on the spacemit-x60. The remaining miscompiles are believed to be specific to the special case in the specification which allows VL!=VLMAX on the second to last iteration (i.e. the "[ceil(AVL/2),VLMAX] if VLMAX < AVL < 2*VLMAX" case Alexey mentioned above) Let's keep this quality bar, and try to get CI/testing flows in place to ensure we don't backslide here while working on the remaining miscompiles specific to the special case described above. |
Regarding SPEC2017, it's still failing on 500.perlbench_r and 520.omnetpp_r on my side with some cryptic error messages. Which makes me start to think it's due to some environment issue from my test harness. So please don't let me block you on SPEC2017. For SPEC2006 INT and FP, the only failing case is 445.gobmk but it's not related to EVL vectorizer (or even vectorizer): #123151 |
I think it may also be due to the hardware I was running it on (spacemit-x60) not implementing the |
Just remind you that is not enabled by default and need |
On the spacemit-x60, GCC 14 is ~24% faster on the 525.x264_r SPEC CPU 2017 benchmark than a recent build of Clang.
A big chunk of this difference is due to GCC tail folding its loops with VL, whereas LLVM doesn't by default.
Because LLVM doesn't tail fold its loops, it generates both a vectorized body and a scalar epilogue. There is a minimum trip count >= VF required to execute the vectorized body, otherwise it can only run the scalar epilogue.
On 525.x264_r, there are some very hot functions (e.g.
get_ref
) which never meet the minimum trip count and so the vector code is never ran. Tail folding avoids this issue and allows us to run the vectorized body every time.There are likely other performance benefits to be had with tail folding with VL, so it seems worthwhile exploring.
"EVL tail folding" (LLVM's vector-predication terminology for VL tail folding), can be enabled from Clang with
-mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -mllvm -force-tail-folding-style=data-with-evl
. It initially landed in #76172 but it isn't enabled by default yet due to support for it not being fully complete, both in the loop vectorizer and elsewhere in the RISC-V backend.This issue aims to track what work is needed across the LLVM project to bring it up to a stable state, at which point we can evaluate its performance to see if it should be enabled by default.
It's not a complete list and only contains the tasks that I've noticed so far. Please feel free to edit and add to it!
I presume we will find more things that need addressed as time goes on.
rvv_vl_half_avl
to catch[ceil(AVL/2),VLMAX] if VLMAX < AVL < 2*VLMAX
bugs[ceil(AVL/2),VLMAX] if VLMAX < AVL < 2*VLMAX
behaviour, so we may be missing bugs here. We probably want to also test SPEC on qemu withrvv_vl_half_avl
vp.strided.{load,store}
intrinsics directly cc) @nikolaypanchenkoThe text was updated successfully, but these errors were encountered: