Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalise setup of buildbot for RISC-V RVA23 EVL tail folding #123947

Open
13 of 21 tasks
asb opened this issue Jan 22, 2025 · 6 comments
Open
13 of 21 tasks

Finalise setup of buildbot for RISC-V RVA23 EVL tail folding #123947

asb opened this issue Jan 22, 2025 · 6 comments
Labels
backend:RISC-V infrastructure Bugs about LLVM infrastructure

Comments

@asb
Copy link
Contributor

asb commented Jan 22, 2025

This requires a builder with:
-march=rva23u64 -mllvm -force-tail-folding-style=data-with-evl -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue' and ideally qemu settings rvv_ta_all_1s=true,rvv_ma_all_1s=true,rvv_vl_half_avl=true to maximise the chance of finding bugs. This will be done using the same cross-compile and then execute under qemu-system setup used for the RVA20 bot. Not all items below are specific to the RVA23 bot.

This requires:

  • Update to QEMU 9.2.0 and check for no regressions
  • Redeploy x86-64 host with appropriate config
  • Resolve sporadic failures due to running out of disk space.
    • Bumping the size of llvm-project.img worked. Issues were sporadic seemingly due to varying test order meaning disk size limits were sometimes reached with temporary files but sometimes not.
  • Get a working local debug flow for subsets of the LLVM tests (ninja check-llvm-executionengine for instance fails to work due to llvm-lit being invoked from a different subdirectory and lit-on-qemu not handling this)
  • Investigate and fix failures for MCJIT/ExecutionEngine tests
    • Issue was a failure to set -DLLVM_HOST_TRIPLE=riscv64-linux-gnu leading to a confusing compilation flow for mcjit/executionengine
  • Resolve issues with host python3 path not matching the one under qemu-system (e.g. when using pip on the host)
    • Explicitly passing -DPython3_EXECUTABLE=/usr/bin/python3 resolves this
  • Resolve issues with buildbot running under python3.13 on the host
    • Manual fix for pipes.quote usage and depend on legacy-cgi installed via pip
  • (non-blocking issue) Document Python 3.13 workarounds in docs on local builder testing
  • Resolve test failures for small subset of tests that try to use lit-on-qemu (set through -DLLVM_EXTERNAL_LIT) internally. Seems to primarily be the update_cc_test_checks tests.
  • (non-blocking issue) Figure out why MCJIT/ExecutionEngine tests aren't running with e.g. ninja check-llvm-executionengine (marked as 'unsupported', even the RISC-V ones).
  • Receive review on PR to switch over rva23 evl builder [RISCV] Move rva23 evl builder over to cross-compile and execute under qemu-system setup llvm-zorg#358
  • Finalise x86-64 host deployment for rva23 evl builder once llvm-zorg#358 lands
  • Chase up issue with the staging buildmaster seemingly not automatically redeploying (email sent to Galina)
  • Test enabling the test suite locally and resolve any issues
  • Author and post patch for llvm-zorg to enable configuring the lit binary used for test-suite execution (see test_suite_cmd in ClangBuilder.py)
  • Evaluate if there's any advantage in picking up Support testsuite builds without LNT llvm-zorg#245 to do testsuite builds without LNT (e.g. if it's more compatible with our cross-build and then execute setup)
  • Enable the running of the test suite on rva23 evl builder
  • Evaluate what other LLVM subprojects can/should be enabled in this setup (and expand this list to cover that work)
  • (non-blocking issue) Find a way to get ccache to work for the stage2 build in a non-CI configuration
    • ccache for stage2 makes no sense in CI, but can help iteration time a lot if using a fixed stage1 and investigation an issue. My attempts to enable it seem to be ignored right now.
  • Address failure in clang/test/Modules/empty.modulemap (the empty modulemap seems to be slightly above the expected 60KB for some reason). PR up for review: [clang][Modules] Raise empty.modulemap expected size to <70KB to fix RISC-V failure #123959
  • (non-blocking) Add preservation of the most recent build artefacts in order to make it easier (at least for the bot owner) to investigate a failure without waiting for a multi-stage rebuild.
@llvmbot
Copy link
Member

llvmbot commented Jan 22, 2025

@llvm/issue-subscribers-backend-risc-v

Author: Alex Bradbury (asb)

This requires a builder with: `-march=rva23u64 -mllvm -force-tail-folding-style=data-with-evl -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue'` and ideally qemu settings `rvv_ta_all_1s=true,rvv_ma_all_1s=true,rvv_vl_half_avl=true` to maximise the chance of finding bugs. This will be done using the same cross-compile and then execute under qemu-system setup used for the RVA20 bot. Not all items below are specific to the RVA23 bot.

This requires:

  • Update to QEMU 9.2.0 and check for no regressions
  • Redeploy x86-64 host with appropriate config
  • Resolve sporadic failures due to running out of disk space.
    • Bumping the size of llvm-project.img worked. Issues were sporadic seemingly due to varying test order meaning disk size limits were sometimes reached with temporary files but sometimes not.
  • Get a working local debug flow for subsets of the LLVM tests (ninja check-llvm-executionengine for instance fails to work due to llvm-lit being invoked from a different subdirectory and lit-on-qemu not handling this)
  • Investigate and fix failures for MCJIT/ExecutionEngine tests
    • Issue was a failure to set -DLLVM_HOST_TRIPLE=riscv64-linux-gnu leading to a confusing compilation flow for mcjit/executionengine
  • Resolve issues with host python3 path not matching the one under qemu-system (e.g. when using pip on the host)
    • Explicitly passing -DPython3_EXECUTABLE=/usr/bin/python3 resolves this
  • Resolve issues with buildbot running under python3.13 on the host
    • Manual fix for pipes.quote usage and depend on legacy-cgi installed via pip
  • (non-blocking issue) Document Python 3.13 workarounds in docs on local builder testing
  • Resolve test failures for small subset of tests that try to use lit-on-qemu (set through -DLLVM_EXTERNAL_LIT) internally. Seems to primarily be the update_test_checks tests.
    • Could potentially mask these tests, or alternatively find a way to override the lit path for just these tests, or set up lit-on-qemu in the correct path under qemu-system that just forwards to lit.
  • (non-blocking issue) Figure out why MCJIT/ExecutionEngine tests aren't running with e.g. ninja check-llvm-executionengine (marked as 'unsupported', even the RISC-V ones).
  • Receive review on PR to switch over rva23 evl builder [RISCV] Move rva23 evl builder over to cross-compile and execute under qemu-system setup llvm-zorg#358
  • Finalise x86-64 host deployment for rva23 evl builder once llvm-zorg#358 lands
  • Test enabling the test suite locally and resolve any issues
  • Enable the running of the test suite on rva23 evl builder
  • Evaluate what other LLVM subprojects can/should be enabled in this setup (and expand this list to cover that work)

@asb asb mentioned this issue Jan 21, 2025
16 tasks
@EugeneZelenko EugeneZelenko added the infrastructure Bugs about LLVM infrastructure label Jan 22, 2025
@llvmbot
Copy link
Member

llvmbot commented Jan 22, 2025

@llvm/issue-subscribers-infrastructure

Author: Alex Bradbury (asb)

This requires a builder with: `-march=rva23u64 -mllvm -force-tail-folding-style=data-with-evl -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue'` and ideally qemu settings `rvv_ta_all_1s=true,rvv_ma_all_1s=true,rvv_vl_half_avl=true` to maximise the chance of finding bugs. This will be done using the same cross-compile and then execute under qemu-system setup used for the RVA20 bot. Not all items below are specific to the RVA23 bot.

This requires:

  • Update to QEMU 9.2.0 and check for no regressions
  • Redeploy x86-64 host with appropriate config
  • Resolve sporadic failures due to running out of disk space.
    • Bumping the size of llvm-project.img worked. Issues were sporadic seemingly due to varying test order meaning disk size limits were sometimes reached with temporary files but sometimes not.
  • Get a working local debug flow for subsets of the LLVM tests (ninja check-llvm-executionengine for instance fails to work due to llvm-lit being invoked from a different subdirectory and lit-on-qemu not handling this)
  • Investigate and fix failures for MCJIT/ExecutionEngine tests
    • Issue was a failure to set -DLLVM_HOST_TRIPLE=riscv64-linux-gnu leading to a confusing compilation flow for mcjit/executionengine
  • Resolve issues with host python3 path not matching the one under qemu-system (e.g. when using pip on the host)
    • Explicitly passing -DPython3_EXECUTABLE=/usr/bin/python3 resolves this
  • Resolve issues with buildbot running under python3.13 on the host
    • Manual fix for pipes.quote usage and depend on legacy-cgi installed via pip
  • (non-blocking issue) Document Python 3.13 workarounds in docs on local builder testing
  • Resolve test failures for small subset of tests that try to use lit-on-qemu (set through -DLLVM_EXTERNAL_LIT) internally. Seems to primarily be the update_test_checks tests.
    • Could potentially mask these tests, or alternatively find a way to override the lit path for just these tests, or set up lit-on-qemu in the correct path under qemu-system that just forwards to lit.
  • (non-blocking issue) Figure out why MCJIT/ExecutionEngine tests aren't running with e.g. ninja check-llvm-executionengine (marked as 'unsupported', even the RISC-V ones).
  • Receive review on PR to switch over rva23 evl builder [RISCV] Move rva23 evl builder over to cross-compile and execute under qemu-system setup llvm-zorg#358
  • Finalise x86-64 host deployment for rva23 evl builder once llvm-zorg#358 lands
  • Test enabling the test suite locally and resolve any issues
  • Enable the running of the test suite on rva23 evl builder
  • Evaluate what other LLVM subprojects can/should be enabled in this setup (and expand this list to cover that work)
  • (non-blocking issue) Find a way to get ccache to work for the stage2 build in a non-CI configuration
    • ccache for stage2 makes no sense in CI, but can help iteration time a lot if using a fixed stage1 and investigation an issue. My attempts to enable it seem to be ignored right now.
  • Address failure in clang/test/Modules/empty.modulemap (the empty modulemap seems to be slightly above the expected 60KB for some reason)

@mshockwave
Copy link
Member

(non-blocking issue) Figure out why MCJIT/ExecutionEngine tests aren't running with e.g. ninja check-llvm-executionengine (marked as 'unsupported', even the RISC-V ones).

(I think we're no longer using MCJIT, it's OrcJIT that's enabled by default) If this is a cross-compiling setting, then we probably need to set LLVM_TARGET_ARCH to riscv64 for JIT to emit RISC-V code.

@asb
Copy link
Contributor Author

asb commented Jan 27, 2025

(non-blocking issue) Figure out why MCJIT/ExecutionEngine tests aren't running with e.g. ninja check-llvm-executionengine (marked as 'unsupported', even the RISC-V ones).

(I think we're no longer using MCJIT, it's OrcJIT that's enabled by default) If this is a cross-compiling setting, then we probably need to set LLVM_TARGET_ARCH to riscv64 for JIT to emit RISC-V code.

Yes, there were two issues here:

@asb
Copy link
Contributor Author

asb commented Jan 27, 2025

By way of update: I think the only blocker to the initial build config that doesn't include the full test-suite is that the automatic redeploy of llvm-zorg to LLVM's staging buildmaster appears to be broken right now (as far as I can see at least - it used to happen every hour or so, but I've had a change that landed a couple of days a go that's definitely not reflected on the upstream buildmaster, while it works fine with a local checkout of llvm-zorg HEAD and my buildmaster testing mode). I've dropped Galina an email to check on this. I know there were problems with this some months back as well.

UPDATE: Galina has fixed the issue with the staging buildmaster (thanks!).

@asb
Copy link
Contributor Author

asb commented Jan 28, 2025

The builder is now running and giving results in a ~1h40-2h cycle time.

Keeping an eye on https://lab.llvm.org/staging/#/builders/16 should let you see if there appears to be a problem.

I'm not sure if the Galina's llvm-zorg staging redeploy was a one-off or if the automated deploy is enabled again - we'll see less than ideal queue merging behaviour until llvm/llvm-zorg@b272d2f is deployed on the staging buildmaster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:RISC-V infrastructure Bugs about LLVM infrastructure
Projects
None yet
Development

No branches or pull requests

4 participants