Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RVV Bench on ARA #17

Open
rseac opened this issue Jan 8, 2025 · 11 comments
Open

RVV Bench on ARA #17

rseac opened this issue Jan 8, 2025 · 11 comments

Comments

@rseac
Copy link

rseac commented Jan 8, 2025

Is there interest in trying to get these benchmarks work on the ARA vector engine? Has this already been started?

@camel-cdr
Copy link
Owner

camel-cdr commented Jan 8, 2025

Yes, I tried about half a year ago, but ara still had some bugs, which broke most of the benchmarks.
I think they should be fixes now, but I haven't gotten my old build scripts working again.
I'll try to get it working again. Have you works with the project? If so, I'd appreciate some help in getting a docker environment setup, similar to the scripts in the wiki: https://github.com/camel-cdr/rvv-bench/wiki

@rseac
Copy link
Author

rseac commented Jan 8, 2025

@camel-cdr I do have a docker container for the ARA. Here is the repo for it. https://github.com/rseac/pulp-ara-docker

@camel-cdr
Copy link
Owner

@rseac Thanks a lot! I'll try to use it.

@camel-cdr
Copy link
Owner

@rseac You docker worked great, ara unfortunately didn't. I tried running my instruction throughout benchmark, but about 5% of instructions hang the simulation with certain valid vtype configurations.

The most problematic examples were the integer comparisons (vmseq,...), except for vmsgtu/vmsgt, which hung with SEW=16 and logical mask instructions. Otheres are vrgather.vx, vcompress.vm, vredsum.vs, vmadc, zext

vrgather.vv works, but is extremely slow at 6-8 cycles per element.

It's still measuring the last handful of instructions, but if you want, I can share the results of the instructions that worked once it completed.
Most regular instructions executed in 16/32/64/128 cycles for LMUL=1/2/4/8. The masked variants are 25/41/73/137, most floats instructions are 29/45/77/141.

@rseac
Copy link
Author

rseac commented Jan 10, 2025

@camel-cdr Yes, I'd be happy to take a look at the results. I could even try to post some of these problems you notice on the ARA issues once I understand them myself.

@rseac
Copy link
Author

rseac commented Jan 14, 2025

@camel-cdr Did you use the same scripts as linked earlier? Or specific ones to target the ARA core?

@camel-cdr
Copy link
Owner

camel-cdr commented Jan 16, 2025

@rseac Sorry for the late response.

Here are the measurements I managed to run: log.txt
A few instructions that probably work are missing, but it should have most of them that work.

I used your buildscript with small modifications:

FROM ubuntu:22.04

ENV DEBIAN_FRONTEND=noninteractive
RUN apt -y update && apt -y upgrade && apt-get --no-install-recommends -y install \
		build-essential git zlib1g zlib1g-dev pkg-config cmake vim \
		ninja-build python3 texinfo device-tree-compiler \
		autoconf automake bc bison clang flex \
		ca-certificates ccache libfl2 libfl-dev help2man \
		curl libelf-dev python3-numpy \
	&& apt-get clean \
	&& rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/pulp-platform/ara.git

WORKDIR /ara

RUN git config --global url."https://github.com/".insteadOf "[email protected]:";
RUN git submodule update --init --recursive
RUN make toolchain-llvm
RUN make riscv-isa-sim
RUN make verilator

RUN /usr/bin/install -c /ara/install/verilator/bin/verilator_bin /ara/install/verilator/share/verilator/
RUN cd hardware && make checkout && make apply-patches && make verilate

WORKDIR /ara/apps

RUN git clone https://github.com/camel-cdr/rvv-bench \
	&& cp rvv-bench/nolibc.h . \
	&& mkdir rvv \
	&& cp rvv-bench/instructions/rvv/gen.S . \
	&& cp rvv-bench/instructions/rvv/config.h rvv-bench/instructions/rvv/main.c rvv \
	&& sed -e '2a#define CUSTOM_HOST 1' -e '2a#include "printf.h"' -e '2a#include <string.h>' -i nolibc.h \
	&& sed 's/main/nolibc_main/g;s/_start/main/g;s/nolibc_main();/\0\n#define main nolibc_main/g' -i nolibc.h \
	&& sed 's/\(memwrite(.*\)}/\1printf("%.*s",len,ptr);}/g' -i nolibc.h \
	&& sed 's/WARMUP.*$/WARMUP 1/g;s/UNROLL.*$/UNROLL 4/g;s/LOOP.*$/LOOP 8/1;s/RUNS.*$/RUNS 1/g' -i rvv/config.h \
	&& sed 's/\.\.\/nolibc/nolibc/g' -i rvv/main.c

RUN echo 'echo "vim gen.S && m4 gen.S > rvv/gen.S && ( make clean; make bin/rvv ) 2>&1 >/dev/null && app=rvv make -C /ara/hardware simv"' > ~/.bashrc

You can just execute the command printed once you run the container.
It opens gen.S, where you need to modify the contents of m_bench_all, which specify which instructions to measure. The binary gets to big when you build all of them at once, so I recommend selecting 10-30 at a time.

The following were the ones that failed to execute correctly:

	m_mask($1, bench_vrgathervx,     T_A,  m_mod_t0_vl,      vrgather.vx,     v8, v16, t0)
	m_mask($1, bench_vrgathervi,     T_A,  m_nop,            vrgather.vi,     v8, v16, 3)
	m_mask($1, bench_vredsumvs,  T_A, m_nop, vredsum.vs,  v8, v16, v24)
	m_mask($1, bench_vredandvs,  T_A, m_nop, vredand.vs,  v8, v16, v24)
	m_mask($1, bench_vredorvs,   T_A, m_nop, vredor.vs,   v8, v16, v24)
	m_mask($1, bench_vredxorvs,  T_A, m_nop, vredxor.vs,  v8, v16, v24)
	m_mask($1, bench_vredminuvs, T_A, m_nop, vredminu.vs, v8, v16, v24)
	m_mask($1, bench_vredminvs,  T_A, m_nop, vredmin.vs,  v8, v16, v24)
	m_mask($1, bench_vredmaxuvs, T_A, m_nop, vredmaxu.vs, v8, v16, v24)
	m_mask($1, bench_vredmaxvs,  T_A, m_nop, vredmax.vs,  v8, v16, v24)
	m_bench_vxim($1, T_A, vmadc)
	m_bench_vxm($1,  T_A, vmsbc)
	m_bench_vxi($1, T_A, vmseq)
	m_bench_vxi($1, T_A, vmsne)
	m_bench_vx($1,  T_A, vmsltu)
	m_bench_vx($1,  T_A, vmslt)
	m_bench_vxi($1, T_A, vmsleu)
	m_bench_vxi($1, T_A, vmsle)
	m_$1(bench_vcompressvm, T_A, m_nop, vcompress.vm, v8, v16, v24)
	m_$1(bench_vmandnmm, T_m1, m_nop, vmandn.mm, v8, v16, v24)
	m_$1(bench_vmandmm,  T_m1, m_nop, vmand.mm,  v8, v16, v24)
	m_$1(bench_vmormm,   T_m1, m_nop, vmor.mm,   v8, v16, v24)
	m_$1(bench_vmxormm,  T_m1, m_nop, vmxor.mm,  v8, v16, v24)
	m_$1(bench_vmornmm,  T_m1, m_nop, vmorn.mm,  v8, v16, v24)
	m_$1(bench_vmnandmm, T_m1, m_nop, vmnand.mm, v8, v16, v24)
	m_$1(bench_vmnormm,  T_m1, m_nop, vmnor.mm,  v8, v16, v24)
	m_$1(bench_vmxnormm, T_m1, m_nop, vmxnor.mm, v8, v16, v24)
	m_mask($1, bench_vfredosumvs, T_F, m_nop, vfredosum.vs, v8, v16, v24)
	m_mask($1, bench_vwredsumuvs, T_WR, m_nop, vwredsumu.vs, v8, v16, v24)
	m_mask($1, bench_vwredsumvs,  T_WR, m_nop, vwredsum.vs,  v8, v16, v24)
	m_mask($1, bench_vfwredosumvs, T_FWR, m_nop, vfwredosum.vs, v8, v16, v24)
	m_mask($1, bench_vfwredusumvs, T_FWR, m_nop, vfwredusum.vs, v8, v16, v24)
	m_mask($1, bench_vfirstm,  T_m1,  m_1bit, vfirst.m,  t0, v8)
	m_mask($1, bench_vzextvf2, T_E2, m_1bit, vzext.vf2, v8, v16)
	m_mask($1, bench_vsextvf2, T_E2, m_1bit, vsext.vf2, v8, v16)
	m_mask($1, bench_vzextvf4, T_E4, m_1bit, vzext.vf4, v8, v16)
	m_mask($1, bench_vsextvf4, T_E4, m_1bit, vsext.vf4, v8, v16)
	m_mask($1, bench_vzextvf8, T_E8, m_1bit, vzext.vf8, v8, v16)
	m_mask($1, bench_vsextvf8, T_E8, m_1bit, vsext.vf8, v8, v16)

@mp-17
Copy link

mp-17 commented Jan 20, 2025

Hey @camel-cdr, @rseac, I should have fixed many of the old bugs you initially reported and open-sourced the missing RVV instruction support.

I am almost done with basic Linux support and my next priority is verification. I will start from these instructions, thanks for reporting!

@camel-cdr
Copy link
Owner

@mp-17 Thanks, thats great to hear, I closed the old issue.

Is the vrgather performance I measured expected and/or how does the current implementation work/is supposed to work?

I saw the new AraXL paper, is that a fork of Ara or further development?

@rseac
Copy link
Author

rseac commented Jan 22, 2025

@camel-cdr Thanks for providing this. These are single instruction tests, is that correct? I'm assuming that you'd want these single instruction tests to pass before trying the rev-bench benchmarks themselves?

Do any of the benchmarks in bench/ go through? Or has that not been tried yet.

@camel-cdr
Copy link
Owner

@rseac Yes, I prefer to get the instructions themselves working.

I tried a few of the benchmarks, here are results from a 4-lane configuration, for the things that worked:

{
title: "utf8 count",
labels: ["0","scalar","rvv_m1","rvv_m2","rvv_m4","rvv_m8","rvv_align_m1","rvv_align_m2","rvv_align_m4","rvv_align_m8",],
data: [
[1,5,13,29,61,125,253,509,1021,2045,4093,8189,16381,],
[0.0051546,0.0207468,0.0404984,0.0595482,0.0716803,0.0786658,0.0845588,0.0874570,0.0883906,0.0884515,0.0889821,0.0890902,0.0887272,],
[0.0045045,0.0263157,0.0714285,0.1629213,0.3446327,0.6410256,1.1822429,1.9728682,2.6246786,3.1656346,3.6940433,3.9771733,4.1189338,],
[0.0048543,0.0282485,0.0710382,0.1542553,0.3446327,0.6756756,1.1552511,1.9063670,2.9766763,3.7047101,4.2326783,4.6928366,4.9206969,],
[0.0052631,0.0270270,0.0714285,0.1629213,0.3096446,0.6476683,1.1822429,1.9063670,2.9766763,4.1064257,4.6991963,5.1438442,5.4205823,],
[0.0051813,0.0284090,0.0734463,0.1629213,0.3333333,0.6443298,1.1712962,1.9882812,3.0660660,4.1064257,4.9732685,5.4448138,5.6977391,],
[0.0043859,0.0218340,0.0599078,0.1250000,0.2595744,0.5081300,0.9547169,1.7254237,2.0460921,2.9048295,3.7310847,4.2628839,4.5975301,],
[0.0042918,0.0214592,0.0565217,0.1260869,0.2595744,0.5000000,0.9068100,1.6743421,2.6939313,3.2306477,4.2414507,5.0737298,5.5266531,],
[0.0040983,0.0211864,0.0534979,0.1124031,0.2711111,0.5144032,0.8971631,1.6472491,2.6519480,3.7870370,4.5376940,5.4556962,6.1931947,],
[0.0041152,0.0215517,0.0570175,0.1283185,0.2618025,0.4921259,0.9730769,1.7312925,2.6727748,3.8295880,4.7983587,5.8871315,6.6052419,],
],
},
{
title: "ascii to utf16",
labels: ["0","scalar","rvv_ext_m1","rvv_ext_m2","rvv_ext_m4","rvv_vsseg_m1","rvv_vsseg_m2","rvv_vsseg_m4","rvv_vss_m1","rvv_vss_m2","rvv_vss_m4",],
data: [
[1,5,13,29,61,125,253,509,1021,2045,4093,8189,],
[0.0062893,0.0273224,0.0562770,0.0840579,0.1035653,0.1229105,0.1323914,0.1373819,0.1396907,0.1412780,0.1420637,0.1410170,],
[0.0062500,0.0352112,0.0915492,0.2042253,0.4236111,0.8680555,0.7552238,0.8330605,0.8572628,1.5341335,0.8762577,2.5808383,],
[0.0062500,0.0347222,0.0915492,0.2013888,0.4295774,0.6443298,1.6012658,0.8303425,0.8940455,2.6285347,0.8913327,1.6193395,],
[0.0062500,0.0352112,0.0915492,0.2042253,0.4295774,0.6410256,1.2105263,0.8330605,2.4902439,0.8910675,0.8991652,0.9288793,],
[0.0048543,0.0213675,0.0415335,0.0609243,0.0767295,0.0865051,0.0924369,0.0956946,0.0970347,0.0977440,0.0980899,0.0982672,],
[0.0042372,0.0189393,0.0377906,0.0570866,0.0734055,0.0848608,0.0914347,0.0950868,0.0970624,0.0979031,0.0983279,0.0985427,],
[0.0033333,0.0152439,0.0318627,0.0508771,0.0684624,0.0812215,0.0893992,0.0939634,0.0964754,0.0977673,0.0983326,0.0986270,],
[0.0046728,0.0290697,0.0687830,0.1271929,0.2006578,0.3156565,0.3285714,0.3661870,0.3832582,0.4884165,0.4958207,0.3997168,],
[0.0047393,0.0245098,0.0596330,0.1203319,0.2155477,0.3156565,0.3940809,0.3656609,0.4811498,0.3973188,0.4021813,0.4046148,],
[0.0036101,0.0186567,0.0460992,0.0944625,0.1747851,0.2847380,0.3285714,0.4480633,0.3880653,0.3996482,0.5805673,0.4069472,],
]
},
{
title: "ascii to utf16 aligned",
labels: ["0","scalar","rvv_ext_m1","rvv_ext_m2","rvv_ext_m4","rvv_vsseg_m1","rvv_vsseg_m2","rvv_vsseg_m4","rvv_vss_m1","rvv_vss_m2","rvv_vss_m4",],
data: [
[1,5,13,29,61,125,253,509,1021,2045,4093,8189,],
[0.0064935,0.0268817,0.0548523,0.0830945,0.1064572,0.1224289,0.1319770,0.1372337,0.1399972,0.1412195,0.1420342,0.1409878,],
[0.0056497,0.0337837,0.0878378,0.1959459,0.4121621,0.8223684,1.5426829,2.1208333,2.4023529,2.5030599,2.5565271,2.5840959,],
[0.0064935,0.0337837,0.0878378,0.1959459,0.4121621,0.8223684,1.5426829,2.1208333,2.4543269,2.6455368,2.6980883,2.7251247,],
[0.0064935,0.0337837,0.0878378,0.1959459,0.4121621,0.8223684,1.5426829,2.1208333,2.4543269,2.7122015,2.7692828,2.7987012,],
[0.0050505,0.0209205,0.0407523,0.0604166,0.0759651,0.0862663,0.0923020,0.0955868,0.0969979,0.0977626,0.0981017,0.0982719,],
[0.0043478,0.0184501,0.0370370,0.0566406,0.0730538,0.0844024,0.0912369,0.0950158,0.0970071,0.0979218,0.0983374,0.0985462,],
[0.0034013,0.0149253,0.0313253,0.0503472,0.0678531,0.0809061,0.0891787,0.0938941,0.0964208,0.0977767,0.0983421,0.0986270,],
[0.0048543,0.0279329,0.0677083,0.1348837,0.2319391,0.3369272,0.4324786,0.5014778,0.5362394,0.5570689,0.5663484,0.5710997,],
[0.0048780,0.0236966,0.0580357,0.1174089,0.2110726,0.3280839,0.4324786,0.5014778,0.5451147,0.5666389,0.5762353,0.5811510,],
[0.0037174,0.0181818,0.0451388,0.0932475,0.1728045,0.2808988,0.4035087,0.5014778,0.5451147,0.5712290,0.5809794,0.5859749,],
]
},

It's interesting to see that alignment matters a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants