RVV Bench on ARA #17

rseac · 2025-01-08T17:44:51Z

Is there interest in trying to get these benchmarks work on the ARA vector engine? Has this already been started?

camel-cdr · 2025-01-08T17:52:19Z

Yes, I tried about half a year ago, but ara still had some bugs, which broke most of the benchmarks.
I think they should be fixes now, but I haven't gotten my old build scripts working again.
I'll try to get it working again. Have you works with the project? If so, I'd appreciate some help in getting a docker environment setup, similar to the scripts in the wiki: https://github.com/camel-cdr/rvv-bench/wiki

rseac · 2025-01-08T20:37:58Z

@camel-cdr I do have a docker container for the ARA. Here is the repo for it. https://github.com/rseac/pulp-ara-docker

camel-cdr · 2025-01-08T20:46:11Z

@rseac Thanks a lot! I'll try to use it.

camel-cdr · 2025-01-09T17:26:10Z

@rseac You docker worked great, ara unfortunately didn't. I tried running my instruction throughout benchmark, but about 5% of instructions hang the simulation with certain valid vtype configurations.

The most problematic examples were the integer comparisons (vmseq,...), except for vmsgtu/vmsgt, which hung with SEW=16 and logical mask instructions. Otheres are vrgather.vx, vcompress.vm, vredsum.vs, vmadc, zext

vrgather.vv works, but is extremely slow at 6-8 cycles per element.

It's still measuring the last handful of instructions, but if you want, I can share the results of the instructions that worked once it completed.
Most regular instructions executed in 16/32/64/128 cycles for LMUL=1/2/4/8. The masked variants are 25/41/73/137, most floats instructions are 29/45/77/141.

rseac · 2025-01-10T12:50:00Z

@camel-cdr Yes, I'd be happy to take a look at the results. I could even try to post some of these problems you notice on the ARA issues once I understand them myself.

rseac · 2025-01-14T11:59:43Z

@camel-cdr Did you use the same scripts as linked earlier? Or specific ones to target the ARA core?

camel-cdr · 2025-01-16T22:50:40Z

@rseac Sorry for the late response.

Here are the measurements I managed to run: log.txt
A few instructions that probably work are missing, but it should have most of them that work.

I used your buildscript with small modifications:

FROM ubuntu:22.04

ENV DEBIAN_FRONTEND=noninteractive
RUN apt -y update && apt -y upgrade && apt-get --no-install-recommends -y install \
		build-essential git zlib1g zlib1g-dev pkg-config cmake vim \
		ninja-build python3 texinfo device-tree-compiler \
		autoconf automake bc bison clang flex \
		ca-certificates ccache libfl2 libfl-dev help2man \
		curl libelf-dev python3-numpy \
	&& apt-get clean \
	&& rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/pulp-platform/ara.git

WORKDIR /ara

RUN git config --global url."https://github.com/".insteadOf "[email protected]:";
RUN git submodule update --init --recursive
RUN make toolchain-llvm
RUN make riscv-isa-sim
RUN make verilator

RUN /usr/bin/install -c /ara/install/verilator/bin/verilator_bin /ara/install/verilator/share/verilator/
RUN cd hardware && make checkout && make apply-patches && make verilate

WORKDIR /ara/apps

RUN git clone https://github.com/camel-cdr/rvv-bench \
	&& cp rvv-bench/nolibc.h . \
	&& mkdir rvv \
	&& cp rvv-bench/instructions/rvv/gen.S . \
	&& cp rvv-bench/instructions/rvv/config.h rvv-bench/instructions/rvv/main.c rvv \
	&& sed -e '2a#define CUSTOM_HOST 1' -e '2a#include "printf.h"' -e '2a#include <string.h>' -i nolibc.h \
	&& sed 's/main/nolibc_main/g;s/_start/main/g;s/nolibc_main();/\0\n#define main nolibc_main/g' -i nolibc.h \
	&& sed 's/\(memwrite(.*\)}/\1printf("%.*s",len,ptr);}/g' -i nolibc.h \
	&& sed 's/WARMUP.*$/WARMUP 1/g;s/UNROLL.*$/UNROLL 4/g;s/LOOP.*$/LOOP 8/1;s/RUNS.*$/RUNS 1/g' -i rvv/config.h \
	&& sed 's/\.\.\/nolibc/nolibc/g' -i rvv/main.c

RUN echo 'echo "vim gen.S && m4 gen.S > rvv/gen.S && ( make clean; make bin/rvv ) 2>&1 >/dev/null && app=rvv make -C /ara/hardware simv"' > ~/.bashrc

You can just execute the command printed once you run the container.
It opens gen.S, where you need to modify the contents of m_bench_all, which specify which instructions to measure. The binary gets to big when you build all of them at once, so I recommend selecting 10-30 at a time.

The following were the ones that failed to execute correctly:

	m_mask($1, bench_vrgathervx,     T_A,  m_mod_t0_vl,      vrgather.vx,     v8, v16, t0)
	m_mask($1, bench_vrgathervi,     T_A,  m_nop,            vrgather.vi,     v8, v16, 3)
	m_mask($1, bench_vredsumvs,  T_A, m_nop, vredsum.vs,  v8, v16, v24)
	m_mask($1, bench_vredandvs,  T_A, m_nop, vredand.vs,  v8, v16, v24)
	m_mask($1, bench_vredorvs,   T_A, m_nop, vredor.vs,   v8, v16, v24)
	m_mask($1, bench_vredxorvs,  T_A, m_nop, vredxor.vs,  v8, v16, v24)
	m_mask($1, bench_vredminuvs, T_A, m_nop, vredminu.vs, v8, v16, v24)
	m_mask($1, bench_vredminvs,  T_A, m_nop, vredmin.vs,  v8, v16, v24)
	m_mask($1, bench_vredmaxuvs, T_A, m_nop, vredmaxu.vs, v8, v16, v24)
	m_mask($1, bench_vredmaxvs,  T_A, m_nop, vredmax.vs,  v8, v16, v24)
	m_bench_vxim($1, T_A, vmadc)
	m_bench_vxm($1,  T_A, vmsbc)
	m_bench_vxi($1, T_A, vmseq)
	m_bench_vxi($1, T_A, vmsne)
	m_bench_vx($1,  T_A, vmsltu)
	m_bench_vx($1,  T_A, vmslt)
	m_bench_vxi($1, T_A, vmsleu)
	m_bench_vxi($1, T_A, vmsle)
	m_$1(bench_vcompressvm, T_A, m_nop, vcompress.vm, v8, v16, v24)
	m_$1(bench_vmandnmm, T_m1, m_nop, vmandn.mm, v8, v16, v24)
	m_$1(bench_vmandmm,  T_m1, m_nop, vmand.mm,  v8, v16, v24)
	m_$1(bench_vmormm,   T_m1, m_nop, vmor.mm,   v8, v16, v24)
	m_$1(bench_vmxormm,  T_m1, m_nop, vmxor.mm,  v8, v16, v24)
	m_$1(bench_vmornmm,  T_m1, m_nop, vmorn.mm,  v8, v16, v24)
	m_$1(bench_vmnandmm, T_m1, m_nop, vmnand.mm, v8, v16, v24)
	m_$1(bench_vmnormm,  T_m1, m_nop, vmnor.mm,  v8, v16, v24)
	m_$1(bench_vmxnormm, T_m1, m_nop, vmxnor.mm, v8, v16, v24)
	m_mask($1, bench_vfredosumvs, T_F, m_nop, vfredosum.vs, v8, v16, v24)
	m_mask($1, bench_vwredsumuvs, T_WR, m_nop, vwredsumu.vs, v8, v16, v24)
	m_mask($1, bench_vwredsumvs,  T_WR, m_nop, vwredsum.vs,  v8, v16, v24)
	m_mask($1, bench_vfwredosumvs, T_FWR, m_nop, vfwredosum.vs, v8, v16, v24)
	m_mask($1, bench_vfwredusumvs, T_FWR, m_nop, vfwredusum.vs, v8, v16, v24)
	m_mask($1, bench_vfirstm,  T_m1,  m_1bit, vfirst.m,  t0, v8)
	m_mask($1, bench_vzextvf2, T_E2, m_1bit, vzext.vf2, v8, v16)
	m_mask($1, bench_vsextvf2, T_E2, m_1bit, vsext.vf2, v8, v16)
	m_mask($1, bench_vzextvf4, T_E4, m_1bit, vzext.vf4, v8, v16)
	m_mask($1, bench_vsextvf4, T_E4, m_1bit, vsext.vf4, v8, v16)
	m_mask($1, bench_vzextvf8, T_E8, m_1bit, vzext.vf8, v8, v16)
	m_mask($1, bench_vsextvf8, T_E8, m_1bit, vsext.vf8, v8, v16)

mp-17 · 2025-01-20T15:52:13Z

Hey @camel-cdr, @rseac, I should have fixed many of the old bugs you initially reported and open-sourced the missing RVV instruction support.

I am almost done with basic Linux support and my next priority is verification. I will start from these instructions, thanks for reporting!

camel-cdr · 2025-01-20T17:22:16Z

@mp-17 Thanks, thats great to hear, I closed the old issue.

Is the vrgather performance I measured expected and/or how does the current implementation work/is supposed to work?

I saw the new AraXL paper, is that a fork of Ara or further development?

rseac · 2025-01-22T16:46:49Z

@camel-cdr Thanks for providing this. These are single instruction tests, is that correct? I'm assuming that you'd want these single instruction tests to pass before trying the rev-bench benchmarks themselves?

Do any of the benchmarks in bench/ go through? Or has that not been tried yet.

camel-cdr · 2025-01-22T23:56:15Z

@rseac Yes, I prefer to get the instructions themselves working.

I tried a few of the benchmarks, here are results from a 4-lane configuration, for the things that worked:

{
title: "utf8 count",
labels: ["0","scalar","rvv_m1","rvv_m2","rvv_m4","rvv_m8","rvv_align_m1","rvv_align_m2","rvv_align_m4","rvv_align_m8",],
data: [
[1,5,13,29,61,125,253,509,1021,2045,4093,8189,16381,],
[0.0051546,0.0207468,0.0404984,0.0595482,0.0716803,0.0786658,0.0845588,0.0874570,0.0883906,0.0884515,0.0889821,0.0890902,0.0887272,],
[0.0045045,0.0263157,0.0714285,0.1629213,0.3446327,0.6410256,1.1822429,1.9728682,2.6246786,3.1656346,3.6940433,3.9771733,4.1189338,],
[0.0048543,0.0282485,0.0710382,0.1542553,0.3446327,0.6756756,1.1552511,1.9063670,2.9766763,3.7047101,4.2326783,4.6928366,4.9206969,],
[0.0052631,0.0270270,0.0714285,0.1629213,0.3096446,0.6476683,1.1822429,1.9063670,2.9766763,4.1064257,4.6991963,5.1438442,5.4205823,],
[0.0051813,0.0284090,0.0734463,0.1629213,0.3333333,0.6443298,1.1712962,1.9882812,3.0660660,4.1064257,4.9732685,5.4448138,5.6977391,],
[0.0043859,0.0218340,0.0599078,0.1250000,0.2595744,0.5081300,0.9547169,1.7254237,2.0460921,2.9048295,3.7310847,4.2628839,4.5975301,],
[0.0042918,0.0214592,0.0565217,0.1260869,0.2595744,0.5000000,0.9068100,1.6743421,2.6939313,3.2306477,4.2414507,5.0737298,5.5266531,],
[0.0040983,0.0211864,0.0534979,0.1124031,0.2711111,0.5144032,0.8971631,1.6472491,2.6519480,3.7870370,4.5376940,5.4556962,6.1931947,],
[0.0041152,0.0215517,0.0570175,0.1283185,0.2618025,0.4921259,0.9730769,1.7312925,2.6727748,3.8295880,4.7983587,5.8871315,6.6052419,],
],
},
{
title: "ascii to utf16",
labels: ["0","scalar","rvv_ext_m1","rvv_ext_m2","rvv_ext_m4","rvv_vsseg_m1","rvv_vsseg_m2","rvv_vsseg_m4","rvv_vss_m1","rvv_vss_m2","rvv_vss_m4",],
data: [
[1,5,13,29,61,125,253,509,1021,2045,4093,8189,],
[0.0062893,0.0273224,0.0562770,0.0840579,0.1035653,0.1229105,0.1323914,0.1373819,0.1396907,0.1412780,0.1420637,0.1410170,],
[0.0062500,0.0352112,0.0915492,0.2042253,0.4236111,0.8680555,0.7552238,0.8330605,0.8572628,1.5341335,0.8762577,2.5808383,],
[0.0062500,0.0347222,0.0915492,0.2013888,0.4295774,0.6443298,1.6012658,0.8303425,0.8940455,2.6285347,0.8913327,1.6193395,],
[0.0062500,0.0352112,0.0915492,0.2042253,0.4295774,0.6410256,1.2105263,0.8330605,2.4902439,0.8910675,0.8991652,0.9288793,],
[0.0048543,0.0213675,0.0415335,0.0609243,0.0767295,0.0865051,0.0924369,0.0956946,0.0970347,0.0977440,0.0980899,0.0982672,],
[0.0042372,0.0189393,0.0377906,0.0570866,0.0734055,0.0848608,0.0914347,0.0950868,0.0970624,0.0979031,0.0983279,0.0985427,],
[0.0033333,0.0152439,0.0318627,0.0508771,0.0684624,0.0812215,0.0893992,0.0939634,0.0964754,0.0977673,0.0983326,0.0986270,],
[0.0046728,0.0290697,0.0687830,0.1271929,0.2006578,0.3156565,0.3285714,0.3661870,0.3832582,0.4884165,0.4958207,0.3997168,],
[0.0047393,0.0245098,0.0596330,0.1203319,0.2155477,0.3156565,0.3940809,0.3656609,0.4811498,0.3973188,0.4021813,0.4046148,],
[0.0036101,0.0186567,0.0460992,0.0944625,0.1747851,0.2847380,0.3285714,0.4480633,0.3880653,0.3996482,0.5805673,0.4069472,],
]
},
{
title: "ascii to utf16 aligned",
labels: ["0","scalar","rvv_ext_m1","rvv_ext_m2","rvv_ext_m4","rvv_vsseg_m1","rvv_vsseg_m2","rvv_vsseg_m4","rvv_vss_m1","rvv_vss_m2","rvv_vss_m4",],
data: [
[1,5,13,29,61,125,253,509,1021,2045,4093,8189,],
[0.0064935,0.0268817,0.0548523,0.0830945,0.1064572,0.1224289,0.1319770,0.1372337,0.1399972,0.1412195,0.1420342,0.1409878,],
[0.0056497,0.0337837,0.0878378,0.1959459,0.4121621,0.8223684,1.5426829,2.1208333,2.4023529,2.5030599,2.5565271,2.5840959,],
[0.0064935,0.0337837,0.0878378,0.1959459,0.4121621,0.8223684,1.5426829,2.1208333,2.4543269,2.6455368,2.6980883,2.7251247,],
[0.0064935,0.0337837,0.0878378,0.1959459,0.4121621,0.8223684,1.5426829,2.1208333,2.4543269,2.7122015,2.7692828,2.7987012,],
[0.0050505,0.0209205,0.0407523,0.0604166,0.0759651,0.0862663,0.0923020,0.0955868,0.0969979,0.0977626,0.0981017,0.0982719,],
[0.0043478,0.0184501,0.0370370,0.0566406,0.0730538,0.0844024,0.0912369,0.0950158,0.0970071,0.0979218,0.0983374,0.0985462,],
[0.0034013,0.0149253,0.0313253,0.0503472,0.0678531,0.0809061,0.0891787,0.0938941,0.0964208,0.0977767,0.0983421,0.0986270,],
[0.0048543,0.0279329,0.0677083,0.1348837,0.2319391,0.3369272,0.4324786,0.5014778,0.5362394,0.5570689,0.5663484,0.5710997,],
[0.0048780,0.0236966,0.0580357,0.1174089,0.2110726,0.3280839,0.4324786,0.5014778,0.5451147,0.5666389,0.5762353,0.5811510,],
[0.0037174,0.0181818,0.0451388,0.0932475,0.1728045,0.2808988,0.4035087,0.5014778,0.5451147,0.5712290,0.5809794,0.5859749,],
]
},

It's interesting to see that alignment matters a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RVV Bench on ARA #17

RVV Bench on ARA #17

rseac commented Jan 8, 2025

camel-cdr commented Jan 8, 2025 •

edited

Loading

rseac commented Jan 8, 2025

camel-cdr commented Jan 8, 2025

camel-cdr commented Jan 9, 2025

rseac commented Jan 10, 2025

rseac commented Jan 14, 2025

camel-cdr commented Jan 16, 2025 •

edited

Loading

mp-17 commented Jan 20, 2025 •

edited

Loading

camel-cdr commented Jan 20, 2025

rseac commented Jan 22, 2025

camel-cdr commented Jan 22, 2025

RVV Bench on ARA #17

RVV Bench on ARA #17

Comments

rseac commented Jan 8, 2025

camel-cdr commented Jan 8, 2025 • edited Loading

rseac commented Jan 8, 2025

camel-cdr commented Jan 8, 2025

camel-cdr commented Jan 9, 2025

rseac commented Jan 10, 2025

rseac commented Jan 14, 2025

camel-cdr commented Jan 16, 2025 • edited Loading

mp-17 commented Jan 20, 2025 • edited Loading

camel-cdr commented Jan 20, 2025

rseac commented Jan 22, 2025

camel-cdr commented Jan 22, 2025

camel-cdr commented Jan 8, 2025 •

edited

Loading

camel-cdr commented Jan 16, 2025 •

edited

Loading

mp-17 commented Jan 20, 2025 •

edited

Loading