Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result for Banana Pi BPI-F3 #1

Open
Glavo opened this issue May 14, 2024 · 9 comments
Open

Result for Banana Pi BPI-F3 #1

Glavo opened this issue May 14, 2024 · 9 comments

Comments

@Glavo
Copy link

Glavo commented May 14, 2024

        #####           glavo@k1
       #######          --------
       ##O#O##          OS: Bianbu 1.0rc1 riscv64
       #######          Host: spacemit k1-x deb1 board
     ###########        Kernel: 6.1.15
    #############       Uptime: 1 min
   ###############      Packages: 782 (dpkg)
   ################     Shell: fish 3.6.1
  #################     Terminal: /dev/pts/0
#####################   CPU: Spacemit X60 (8) @ 1.600GHz
#####################   Memory: 187MiB / 3807MiB
  #################

out.json

I noticed the results are weird, does anyone know what could be the reason for this?

@camel-cdr
Copy link
Owner

camel-cdr commented May 14, 2024

Looks like the kernel doesn't expose rdcycle. I think that was changed in recent kernels, and I have to look into how to best access it via the perf api.
Thanks for your help.
I'll get my BPI-F3 in a few days, and I also need to fix the instruction cycle count benchmark, as I"ve learned that processors that don't predict vl have a dependency on the destination register.

@camel-cdr
Copy link
Owner

@Glavo My BPI-F3 actually arrived today, so I was able to test a few things.

Apparently the kernel disabled rdcycle userspace access, but since kernel version 6.5 you can re-enable that using the perf_user_access sysctl, see: https://lwn.net/Articles/939436/

The BPI-F3 image however is on an older kernel. On this kernel you can enable rdcycle access by enabling the PERF_COUNT_HW_CPU_CYCLES perf event (see SO post).

Using the perf event API directly would probably be cleaner, however I need to support bare metal as well, so I think I'll keep to code for now, but provide instructions on how to run it on different kernel versions.

For kernel version <6.5, I'll add a small utility program that can be used to start a process with user-space rdcycle enabled, via the perf_event_attr.inherit flag.

I still need to rewrite the instruction cycle count benchmark, once that's done I'll upload the measurements. The performance looks quite good so far.

@MarekPikula
Copy link

MarekPikula commented Jun 6, 2024

Another option is to disable PMU handling in the kernel alotogether. I'm currently testing PULP Ara on FPGA, and I had to disable CONFIG_RISCV_PMU in kernel. Then kernel doesn't "own" the PMU, thus enabling applications to directly issue rdcycles.

I think that you might also need to disable the PMU handler in OpenSBI, as it might disable the cycle counter by default (I think it happened to me, but I don't have enough time to reproduce it).

This, of course, prevents you form accessing perf in other places, but to run the benchmark alone, it shouldn't be a problem.

@camel-cdr
Copy link
Owner

camel-cdr commented Jun 6, 2024

@MarekPikula The README now has an overview on how to do enable the counters on different kernel versions, but that could be another method.

Does ara work for you? I had a lot of trouble with it when I tried it. I've been following the code chainges since, or rather the lack there of. From what I can tell this hasn't been fixes yet, but it may also only occur on verilator.

Also: How big of an fpga is needed to run it?

@MarekPikula
Copy link

Yeah, I tried the ENABLE_RDCYCLE_HACK approach, but it didn't work (i.e., it crashed with a kernel error – I should have a log somewhere, but I can't find it now). I'm running Ara under FireSim with a basic Buildroot [FireMarshal](https://github.com/firesim/FireMarshal] image with 6.2 kernel. There's no reason not to upgrade to something newer, as there are no custom patches (besides two out-of-tree modules for block device and network), but I wanted to have as few moving parts as possible for the initial tests.

Regarding issues with Ara, indeed, it seems somewhat buggy. I tried to run rvv-bench tests on it, but after a few failed benches (either freeze or illegal instruction error), I let go. Right now, I'm running an instruction test to have at least a glimpse into the cycle performance of different instructions. Even on FPGA, it's running rather slowly (80 MHz is the fmax in my configuration), so maybe I'll have some results tomorrow. Once I have anything of value, I'll open a PR with results so far.

I'm running it on AWS EC2 F1 instance with FireSim (so Xilinx VU9P) and the complete design (including the AWS wrapper and FireSim stuff) takes 31% LUTs, 12% FFs, 19% RAMB36, 5% of URAMs and 2% of DSP blocks, so it's not that bad 😛 But, granted, it's a pretty beefy FPGA. I have it configured in the most default, 2-lane, 2048 VLEN configuration (so 64b AXI, with no need for width conversion and such), but I'm planning to try to build it in some other configurations as well.

Besides, I'll be presenting a poster about this project at the upcoming RISC-V Summit Europe this month (title: Accelerating software development for emerging ISA extensions with cloud-based FPGAs: RVV case study).

@camel-cdr
Copy link
Owner

@MarekPikula

Yeah, I tried the ENABLE_RDCYCLE_HACK approach, but it didn't work

Interesting, I'll add your option to the README.

Once I have anything of value, I'll open a PR with results so far.
Right now, I'm running an instruction test to have at least a glimpse into the cycle performance of different instructions

Sounds like it runs for you now, but if it doesn't, try commenting out the call randomize in rvv/main.S, that seemed to help me simulate on XiangShan, although it was to slow to do a full run.

Besides, I'll be presenting a poster about this project at the upcoming RISC-V Summit Europe this month (title: Accelerating software development for emerging ISA extensions with cloud-based FPGAs: RVV case study).

Oh, great, I guess well meet then. I'll also present a poster, right next to yours coincidentally: "Accelerating Unicode Conversions using the RISC-V Vector Extension". So we are poster buddies ^w^

@mp-17
Copy link

mp-17 commented Jun 27, 2024

Hello @MarekPikula and @camel-cdr,

I am now dedicating some time every week to fixing issues in Ara. If you want, we can schedule a brief call to discuss them. Let me know if you are interested :-)

@MarekPikula
Copy link

Hi @mp-17, sorry for the late reply. You can find my poster from RV Summit and benchmark results here: https://github.com/MarekPikula/RISC-V-Summit-Europe-2024

This week, I'm planning to revisit my setup, rebase onto the latest Ara sources, and see what has changed. I'll keep you posted 😃

BTW, I'm coming to this year's ORConf, and this time, I will give a full talk on the same topic as on RV Summit, but hopefully, this time with better results 😉

@camel-cdr
Copy link
Owner

@MarekPikula

BTW, I got RVV on XiangShan working ona specific commit: OpenXiangShan/XiangShan#3200

AFAIK the vsetvli performance should be better now, but I couldn't test it, because the simulation started hanging again.

There is also now another open source RVV implementation: https://github.com/ucb-bar/saturn-vectors

Last time I tried most things worked but some didn't (e.g. strlen), I still have to report that, but I'm quite occupied this month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants