Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot boot linux with RocketChip+Vector Config #2103

Closed
3 tasks done
franktaTian opened this issue Oct 28, 2024 · 11 comments
Closed
3 tasks done

Cannot boot linux with RocketChip+Vector Config #2103

franktaTian opened this issue Oct 28, 2024 · 11 comments
Labels

Comments

@franktaTian
Copy link

Background Work

Chipyard Version and Hash

Release: 1.13.0
Hash: 86ec78

OS Setup

Ex: Output of uname -a + lsb_release -a + printenv + conda list
Linux i7700 5.4.0-198-generic #218-Ubuntu SMP Fri Sep 27 20:18:53 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
LSB Version: core-11.1.0ubuntu2-noarch:printing-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal

Other Setup

Ex: Prior steps taken / Documentation Followed / etc...

Current Behavior

I added "new saturn.rocket.WithRocketVectorUnit(256, 64, VectorParams.refParams) ++" in FireSimRocketConfig and build bit stream and linux(Firemarshal comes with this Chipyard version) follwing guide from firesim.When I try to boot Linux kernal,It pacnic.
When I reverse FireSimRocketConfig back , and everything works fine with the same Linux kernal.

Expected Behavior

Boot Linux correctly with rocket vector added.

Other Information

`
[ 26.138205] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 26.159213] Oops [#1]
[ 26.164645] Modules linked in:
[ 26.171809] CPU: 0 PID: 20 Comm: kworker/u2:1 Not tainted 6.6.0-00004-g67bc4513761f-dirty #32
[ 26.190212] Hardware name: ucb-bar,chipyard (DT)
[ 26.200331] Workqueue: events_unbound async_run_entry_fn
[ 26.212504] epc : 0x0
[ 26.217911] ra : __vm_enough_memory+0x2e/0x136
[ 26.228327] epc : 0000000000000000 ra : ffffffff801512b6 sp : ffffffc8001a38e0
[ 26.243939] gp : ffffffff852f26f8 tp : ffffffd880186c00 t0 : ffffffff84d6cd48
[ 26.259554] t1 : 0000000000000001 t2 : 0000000000000000 s0 : ffffffc8001a3920
[ 26.275144] s1 : 0000000000000001 a0 : ffffffff8532ac40 a1 : 0000000000000001
[ 26.290730] a2 : 000000000007b39f a3 : ffffffff85212b70 a4 : 8000000000000000
[ 26.306341] a5 : ffffffff85212b70 a6 : 0000000000000000 a7 : ffffffff85290c78
[ 26.321937] s2 : 0000000000000000 s3 : 0000000000000001 s4 : 0000000000000000
[ 26.337521] s5 : ffffffff852f22bc s6 : 0000000000000000 s7 : 0000000000000000
[ 26.353104] s8 : ffffffffffffffff s9 : 0000000000000003 s10: 0000000000000000
[ 26.368667] s11: 0000000000000fff t3 : ffffffffffffffff t4 : ffffffffffffffff
[ 26.384284] t5 : ffffffffffffffff t6 : 000000000000ffff
[ 26.395815] status: 0000000200000120 badaddr: 0000000000000000 cause: 000000000000000c
[ 26.412959] Code: Unable to access instruction at 0xffffffffffffffec.
[ 26.428124] ---[ end trace 0000000000000000 ]---

`

@franktaTian
Copy link
Author

I think, the new load/store mechanic after adding vector causes this problem.
Do you ever boot linux successfully with configurations having saturn vector ?

@jerryz123
Copy link
Contributor

I will investigate. This worked on a FPGA prototype, but likely firesim exposed some other bug

@jerryz123
Copy link
Contributor

I struggle to see how the kernel panic report would indicate any problem due to vectors... it reports a fetch page fault, and the vector support made no modifications to the frontend.
Additionally, there is no vector code in the kernel by default, so it seems unlikely that errant vector instructions would have corrupted something.

I will attempt to reproduce

@franktaTian
Copy link
Author

Yes, it is strange.I also know there is no vector code in the kernal by default.

@franktaTian
Copy link
Author

HI,
I reversed the modification in the TargetConfigs.scala.And generate rocketchip+vector by modifying the build_receipes.yaml as follow:
alveo_u250_firesim_rocket_singlecore_vector_no_nic: PLATFORM: xilinx_alveo_u250 TARGET_PROJECT: firesim TARGET_PROJECT_MAKEFRAG: ../../generators/firechip/chip/src/main/makefrag/firesim DESIGN: FireSim TARGET_CONFIG: WithDefaultFireSimBridges_WithFireSimConfigTweaks_chipyard.REFV256D128RocketConfig PLATFORM_CONFIG: BaseXilinxAlveoU250Config deploy_quintuplet: null platform_config_args: fpga_frequency: 60 build_strategy: TIMING post_build_hook: null metasim_customruntimeconfig: null bit_builder_recipe: bit-builder-recipes/xilinx_alveo_u250.yaml
Everything works fine ---Linux kernal boots without panic .

@franktaTian
Copy link
Author

But when I try to add another recipe as follow:
alveo_u250_firesim_rocket_singlecore_vector_clock_crossing_no_nic: PLATFORM: xilinx_alveo_u250 TARGET_PROJECT: firesim TARGET_PROJECT_MAKEFRAG: ../../generators/firechip/chip/src/main/makefrag/firesim DESIGN: FireSim TARGET_CONFIG: WithDefaultFireSimBridges_WithFireSimTestChipConfigTweaks_chipyard.REFV256D128RocketConfig PLATFORM_CONFIG: BaseXilinxAlveoU250Config deploy_quintuplet: null platform_config_args: fpga_frequency: 60 build_strategy: TIMING post_build_hook: null metasim_customruntimeconfig: null bit_builder_recipe: bit-builder-recipes/xilinx_alveo_u250.yaml
and build bitstream successfully, the same Linux boot with panic ,but can continue to login :

`running /etc/init.d/S10mdev
Starting mdev: OK
[ 0.830316] find[81]: unhandled signal 11 code 0x1 at 0xffffffff80060004
[ 0.830362] CPU: 0 PID: 81 Comm: find Tainted: G O 6.6.0-00004-g67bc4513761f #2
[ 0.830386] Hardware name: ucb-bar,chipyard (DT)
[ 0.830400] epc : ffffffff80060004 ra : 00000000000bf26c sp : 0000003fd9c545f0
[ 0.830506] gp : 00000000001be3f8 tp : 00000000001c5760 t0 : 0000000000000002
[ 0.830732] t1 : 62616c732f6c656e t2 : 00000000001de6a0 s0 : 0000003fd9c549b0
[ 0.830958] s1 : 0000000000000001 a0 : 0000000000000000 a1 : 00000000001de680
[ 0.831184] a2 : 0000003fd9c545f0 a3 : 0000000000000100 a4 : 0000000000000000
[ 0.831410] a5 : fffffffffffff000 a6 : 62616c732f6c656e a7 : 000000000000004f
[ 0.831636] s2 : 00000000001de680 s3 : 0000003fd9c545f0 s4 : 0000003fadac8010
[ 0.831862] s5 : 0000000000000001 s6 : 00000000001de680 s7 : 0000000000010248
[ 0.832088] s8 : 0000002ae58185c0 s9 : 0000002ae5821cd0 s10: 0000002ae5824460
[ 0.832314] s11: 0000002ae5809bc8 t3 : 2f2f2f2f2f2f2f2f t4 : 0000003fd9c54630
[ 0.832540] t5 : 0000000000000001 t6 : 0000000000000000
[ 0.832706] status: 8000000200006020 badaddr: ffffffff80060004 cause: 000000000000000c
running /etc/init.d/S40network
Starting network: OK
running /etc/init.d/S99run
running /etc/init.d/S40network
Starting network: OK
running /etc/init.d/S99run
launching firemarshal workload run/command
firemarshal workload run/command done

Welcome to Buildroot
buildroot login: root

cat /proc/cpuinfo

processor : 0
hart : 0
isa : rv64imafdcbv_zicntr_zicsr_zifencei_zihpm_zba_zbb_zbs
mmu : sv39
uarch : sifive,rocket0
mvendorid : 0x0
marchid : 0x1
mimpid : 0x20181004

`
Any help?

@jerryz123
Copy link
Contributor

Thanks for investigating. This points at a bug in the multi-clock handling (The difference between TestChipConfigTweaks and ConfigTweaks is that the "test chip" variant adds CDCs and simulates multi-clock in firesim).

I suspect just a base Rocket with multi-clock will also fail. I can investigate this specifically.

@jerryz123
Copy link
Contributor

It looks like the default rational crossing direction for Rocket's Rational CDCs did not match the clocking configuration in TestChipConfigTweaks.

This PR changes the default Rocket RationalCrossing to support both fast-to-slow and slow-to-fast directions: chipsalliance/rocket-chip#3693

Alternatively, you can change WithTestChipConfigTweaks to add async CDCs to the RocketTiles.

@franktaTian
Copy link
Author

It looks like the default rational crossing direction for Rocket's Rational CDCs did not match the clocking configuration in TestChipConfigTweaks.

This PR changes the default Rocket RationalCrossing to support both fast-to-slow and slow-to-fast directions: chipsalliance/rocket-chip#3693

Alternatively, you can change WithTestChipConfigTweaks to add async CDCs to the RocketTiles.

Ok , I will try it.

@franktaTian
Copy link
Author

It looks like the default rational crossing direction for Rocket's Rational CDCs did not match the clocking configuration in TestChipConfigTweaks.

This PR changes the default Rocket RationalCrossing to support both fast-to-slow and slow-to-fast directions: chipsalliance/rocket-chip#3693

Alternatively, you can change WithTestChipConfigTweaks to add async CDCs to the RocketTiles.

Yes. Everything works fine.
I modified Configs.scala as [chipsalliance/rocket-chip#3693], and generate bitstream again . Linux kernal can boot without panic.
I also generate another version " add async CDCs to the RocketTiles" , Linux kernal can boot without panic , but when try poweroff , it panic.
`

poweroff

Stopping network: OK

Stopping mdev: stopped process in pidfile '/var/run/mdev.pid' (pid 80)
OK
Stopping klogd: [ 68.403438] Oops - illegal instruction [#1]
[ 68.403454] Modules linked in: iceblk(O) icenet(O)
[ 68.403476] CPU: 0 PID: 123 Comm: rm Tainted: G O 6.6.0-00004-g67bc4513761f #2
[ 68.403486] Hardware name: ucb-bar,chipyard (DT)
[ 68.403492] epc : do_raw_spin_unlock+0x88/0x11e
[ 68.403510] ra : handle_page_fault+0x128/0x390
[ 68.403532] epc : ffffffff80060004 ra : ffffffff8000a194 sp : ffffffc800573e70
[ 68.403540] gp : ffffffff812f26f8 tp : ffffffd8814f3c00 t0 : 0000000000000040
[ 68.403548] t1 : 0000000000001000 t2 : 0000020000000000 s0 : ffffffc800573ec0
[ 68.403554] s1 : ffffffc800573ee0 a0 : 0000000000000400 a1 : 0000000000000000
[ 68.403562] a2 : ffffffd8814f3c01 a3 : 0000000000000000 a4 : ffffffd8814f4c00
[ 68.403568] a5 : 0000000000000400 a6 : 0000000000000402 a7 : 0000000000000406
[ 68.403574] s2 : 000000000000000d s3 : 0000000000000001 s4 : 0000002ad0fd3b68
[ 68.403582] s5 : ffffffd8814f3c00 s6 : ffffffd880ddcb80 s7 : 0000000000000254
[ 68.403588] s8 : 000000000000000d s9 : 0000000000000076 s10: 0000000000000003
[ 68.403596] s11: 0000002ac481ebc8 t3 : 0000000000000000 t4 : 000000000000003f
[ 68.403602] t5 : ffffffff81213058 t6 : ffffffff81213078
[ 68.403608] status: 0000000200000120 badaddr: 00000000000f0007 cause: 0000000000000002
[ 68.403616] [] do_raw_spin_unlock+0x88/0x11e
[ 68.403630] [] do_page_fault+0x1e/0x36
[ 68.403642] [] ret_from_exception+0x0/0x64
[ 68.403666] Code: 17c2 93c1 9023 00f4 60e2 6442 64a2 6105 8082 9123 (0007) 000f
[ 68.403672] ---[ end trace 0000000000000000 ]---
[ 68.403678] Kernel panic - not syncing: Fatal exception in interrupt
[ 68.407748] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---`

I just replace "WithRationalCDCs" with "WithAsynchronousCDCs(depth=8, sync=3)"

But anyway , by now we have workable version with Rocket+Vector and Clock Crossing(Rational) . I think , clock crossing in ASIC design version is a must-be, although in firesim ,all clock input are connected to one host clock.

@wadidf
Copy link

wadidf commented Nov 8, 2024

I will investigate. This worked on a FPGA prototype, but likely firesim exposed some other bug

@jerryz123, yes I tried on VCU118 and it works : Bitstream generated + Linux booted.
The issue is that all Vector instructions generate exceptions when trying to run vector code.
Can you try any example code and see ? (maybe also check from BR/busybox side)
thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants