Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

radvd encounters segmentation fault on boot #174

Closed
johnkisch opened this issue Mar 2, 2022 · 24 comments
Closed

radvd encounters segmentation fault on boot #174

johnkisch opened this issue Mar 2, 2022 · 24 comments

Comments

@johnkisch
Copy link

johnkisch commented Mar 2, 2022

Hello,

I'm having an issue where the radvd daemon (version 2.19) encounters a segmentation fault when the daemon is started at boot time.

The system with this issue is running Alpine Linux 3.15:

hydra:~# cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.15.0
PRETTY_NAME="Alpine Linux v3.15"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"
hydra:~#

NOTE: Alpine Linux is based on the musl C Standard Library. Alpine Linux uses the OpenRC init system.

I was able to capture a coredump, the backtrace reads as follows:

Reading symbols from /usr/sbin/radvd...
(No debugging symbols found in /usr/sbin/radvd)
[New LWP 2596]
Core was generated by `/usr/sbin/radvd -C /etc/radvd.conf -p /run/radvd/radvd.pid -u radvd'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007ffb03ac45fc in strcmp (l=0x7ffb03b0b140 "eth1", r=0x1495ba68 <error: Cannot access memory at address 0x1495ba68>) at src/string/strcmp.c:5
5       src/string/strcmp.c: No such file or directory.
(gdb) backtrace
#0  0x00007ffb03ac45fc in strcmp (l=0x7ffb03b0b140 "eth1", r=0x1495ba68 <error: Cannot access memory at address 0x1495ba68>) at src/string/strcmp.c:5
#1  0x0000561756851e04 in ?? ()
#2  0x0000561756858125 in ?? ()
#3  0x00005617568531c5 in ?? ()
#4  0x0000561756850108 in ?? ()
#5  0x00007ffb03a90a03 in libc_start_main_stage2 (main=0x56175684f610, argc=7, argv=0x7ffe1495d458) at src/env/__libc_start_main.c:94
#6  0x00005617568501c9 in ?? ()
#7  0x0000000000000007 in ?? ()
#8  0x00007ffe1495eebe in ?? ()
#9  0x00007ffe1495eece in ?? ()
#10 0x00007ffe1495eed1 in ?? ()
#11 0x00007ffe1495eee1 in ?? ()
#12 0x00007ffe1495eee4 in ?? ()
#13 0x00007ffe1495eef9 in ?? ()
#14 0x00007ffe1495eefc in ?? ()
#15 0x0000000000000000 in ?? ()
(gdb) 

After the system is finished booting, rc-service radvd stop; rc-service radvd start results in the daemon starting successfully. Perhaps the radvd daemon is being started before the eth1 device is up? This would make sense as to why the daemon starts successfully after the system finishes coming up.

I've also opened an issue with the folks at Alpine Linux, as seen here:

https://gitlab.alpinelinux.org/alpine/aports/-/issues/13570

Please let me know if there's any further information that I can provide.

Thanks.

@stappersg
Copy link
Member

stappersg commented Mar 2, 2022 via email

@johnkisch
Copy link
Author

See if the OpenRC init system has something like start this proces after the network interfaces are up Groeten Geert Stappers -- Silence is hard to parse

OpenRC has a parameter that can be set in /etc/rc.conf called rc_depend_strict which essentially will not allow services that depend on net to start before all interfaces are up. I have the following configured in my /etc/rc.conf:

# Do we allow any started service in the runlevel to satisfy the dependency
# or do we want all of them regardless of state? For example, if net.eth0
# and net.eth1 are in the default runlevel then with rc_depend_strict="NO"
# both will be started, but services that depend on 'net' will work if either
# one comes up. With rc_depend_strict="YES" we would require them both to
# come up.
rc_depend_strict="YES"

I still receive a segfault for radvd on boot with this set in /etc/rc.conf.

@robbat2
Copy link
Member

robbat2 commented Mar 4, 2022

  • You didn't specify what version of radvd you're running at all.
  • Can you easily try the tip of Git as well?
  • is it using netifrc as well as openrc?
  • If you add rc_after=net.eth1 into /etc/conf.d/radvd does it start working? What about rc_need=net.eth1?

@johnkisch
Copy link
Author

Hi Robin,

Whoops, missed that! This is radvd version 2.19.

hydra:~# radvd --version
Version: 2.19

Compiled in settings:
  default config file           "/etc/radvd.conf"
  default pidfile               "/run/radvd/radvd.pid"
  default logfile               "/var/log/radvd.log"
  default syslog facility       24
Please send bug reports or suggestions to Reuben Hawkins <[email protected]>.
hydra:~#

I'm using ifupdown-ng for interface configuration. Neither rc_after=net.eth1 nor rc_need=net.eth1 in /etc/conf.d/radvd work to resolve the issue, unfortunately.

I'll go ahead and give building directly from the repo a shot here and post an update once I do so.

Thanks!

@robbat2
Copy link
Member

robbat2 commented Mar 19, 2022

@johnkisch did the latest version work for you?

@nopeno
Copy link

nopeno commented Apr 3, 2022

my alpine box has same problem. my config is:

--------- /etc/network/interfaces--------

auto lo
iface lo inet loopback

allow-hotplug wan0
auto wan0
iface wan0 inet static
address 192.168.1.33
netmask 255.255.255.0
broadcast 192.168.1.255
pre-up /sbin/ip link set wan0 up
up ifup ppp0=telecom
down ifdown ppp0=telecom
post-down /sbin/ip link set wan0 up

auto ppp0
iface ppp0 inet ppp
provider telecom

auto br0
iface br0 inet static
bridge-ports fib1
bridge-stp 0
address 192.168.2.253
netmask 255.255.255.0

iface fib0 inet manual
iface fib0 inet6 manual
iface fib1 inet manual
iface fib1 inet6 manual

------------------ /etc/radvd.conf ---------
interface br0 {
AdvSendAdvert on;
AdvManagedFlag off;
AdvOtherConfigFlag on;
AdvLinkMTU 1480;
prefix ::/64 {
AdvOnLink on;
AdvRouterAddr on;
};
};

---- and the folloing is dmsg
[ 12.144768] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[ 12.147709] br0: port 1(fib1) entered blocking state
[ 12.147714] br0: port 1(fib1) entered disabled state
[ 12.147778] device fib1 entered promiscuous mode
[ 12.159114] br0: port 1(fib1) entered blocking state
[ 12.159118] br0: port 1(fib1) entered forwarding state
[ 12.685266] 8021q: 802.1Q VLAN Support v1.8
[ 12.685290] 8021q: adding VLAN 0 to HW filter on device fib1
[ 12.686407] 8021q: adding VLAN 0 to HW filter on device wan0
[ 12.717712] cfg80211: Loading compiled-in X.509 certificates for regulatory database
[ 12.720037] cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
[ 12.720501] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
[ 12.720506] cfg80211: failed to load regulatory.db
[ 13.704173] igb 0000:01:00.0 wan0: igb: wan0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[ 13.704478] IPv6: ADDRCONF(NETDEV_CHANGE): wan0: link becomes ready
[ 51.753464] mlx4_core 0000:04:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update
[ 52.643188] Bridge firewalling registered
[ 52.667737] Initializing XFRM netlink socket
[ 66.521759] radvd[3472]: segfault at 4420dbfc ip 00007f3d36818f2e sp 00007ffd4420d2a8 error 4 in ld-musl-x86_64.so.1[7f3d367de000+48000]
[ 66.521794] Code: ff fe fe fe fe fe fe fe 49 bb 80 80 80 80 80 80 80 80 4c 0f af c0 eb ae 5b c3 40 0f b6 f6 48 89 f8 a8 07 74 14 48 85 d2 74 7c <0f> b6 08 39 f1 74 3f 48 ff c0 48 ff ca eb e8 48 85 d2 74 68 0f b6
[ 224.471533] radvd[3932]: segfault at 27b5ed0c ip 00007fcc17451f2e sp 00007ffd27b5e3b8 error 4 in ld-musl-x86_64.so.1[7fcc17417000+48000]
[ 224.471569] Code: ff fe fe fe fe fe fe fe 49 bb 80 80 80 80 80 80 80 80 4c 0f af c0 eb ae 5b c3 40 0f b6 f6 48 89 f8 a8 07 74 14 48 85 d2 74 7c <0f> b6 08 39 f1 74 3f 48 ff c0 48 ff ca eb e8 48 85 d2 74 68 0f b6
[ 650.240911] mlx4_core 0000:04:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update
[ 729.496981] radvd[4300]: segfault at 6b95a33c ip 00007f392ac9bf2e sp 00007fff6b9599e8 error 4 in ld-musl-x86_64.so.1[7f392ac61000+48000]
[ 729.497018] Code: ff fe fe fe fe fe fe fe 49 bb 80 80 80 80 80 80 80 80 4c 0f af c0 eb ae 5b c3 40 0f b6 f6 48 89 f8 a8 07 74 14 48 85 d2 74 7c <0f> b6 08 39 f1 74 3f 48 ff c0 48 ff ca eb e8 48 85 d2 74 68 0f b6
[ 1260.259938] mlx4_core 0000:04:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update
[ 1282.004566] mlx4_core 0000:04:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update
[ 1412.341256] radvd[5736]: segfault at ffffffffdaa413bc ip 00007f1975b68f2e sp 00007ffcdaa40a68 error 5 in ld-musl-x86_64.so.1[7f1975b2e000+48000]
[ 1412.341293] Code: ff fe fe fe fe fe fe fe 49 bb 80 80 80 80 80 80 80 80 4c 0f af c0 eb ae 5b c3 40 0f b6 f6 48 89 f8 a8 07 74 14 48 85 d2 74 7c <0f> b6 08 39 f1 74 3f 48 ff c0 48 ff ca eb e8 48 85 d2 74 68 0f b6
[11363.172526] mlx4_en: fib1: Link Down
[11363.172898] br0: port 1(fib1) entered disabled state
[11432.424962] mlx4_en: fib1: Link Up
[11432.426525] br0: port 1(fib1) entered blocking state
[11432.426540] br0: port 1(fib1) entered forwarding state
[29611.854959] mlx4_core 0000:04:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update
[29726.325641] radvd[9015]: segfault at ffffffffc9763f7c ip 00007fa38a432f2e sp 00007ffdc9763628 error 5 in ld-musl-x86_64.so.1[7fa38a3f8000+48000]
[29726.325678] Code: ff fe fe fe fe fe fe fe 49 bb 80 80 80 80 80 80 80 80 4c 0f af c0 eb ae 5b c3 40 0f b6 f6 48 89 f8 a8 07 74 14 48 85 d2 74 7c <0f> b6 08 39 f1 74 3f 48 ff c0 48 ff ca eb e8 48 85 d2 74 68 0f b6
[30199.993401] IPv4: martian source 255.255.255.255 from 192.168.88.1, on dev br0
[30199.993422] ll header: 00000000: ff ff ff ff ff ff 2c c8 1b a9 45 6d 08 00
[30224.546276] mlx4_core 0000:04:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update
[30237.972864] radvd[9605]: segfault at ffffffffdb22d75c ip 00007f2145672f2e sp 00007ffddb22ce08 error 5 in ld-musl-x86_64.so.1[7f2145638000+48000]
[30237.972900] Code: ff fe fe fe fe fe fe fe 49 bb 80 80 80 80 80 80 80 80 4c 0f af c0 eb ae 5b c3 40 0f b6 f6 48 89 f8 a8 07 74 14 48 85 d2 74 7c <0f> b6 08 39 f1 74 3f 48 ff c0 48 ff ca eb e8 48 85 d2 74 68 0f b6

@nopeno
Copy link

nopeno commented Apr 3, 2022

just now, it crashed againt. i found these in dmesg

[48965.408094] radvd[10207]: segfault at fffffffffd7f8018 ip 00007fdac539e5fc sp 00007ffdfd7f7f78 error 5 in ld-musl-x86_64.so.1[7fdac5363000+48000]
[48965.408115] Code: 48 09 c8 4c 85 c8 75 0d 49 83 c4 08 eb d4 39 f0 74 0c 49 ff c4 41 0f b6 04 24 84 c0 75 f0 4c 89 e0 41 5c c3 31 c9 0f b6 04 0f <0f> b6 14 0e 38 d0 75 07 48 ff c1 84 c0 75 ed 29 d0 c3 41 54 49 89
[48965.409217] br0: port 2(tap0) entered blocking state
[48965.409223] br0: port 2(tap0) entered disabled state
[48965.409377] device tap0 entered promiscuous mode
[48965.409694] br0: port 2(tap0) entered blocking state
[48965.409698] br0: port 2(tap0) entered forwarding state

it seems that radvd will crash when i change the bridge setting

@PaulosV
Copy link

PaulosV commented Apr 3, 2022

I'd say it's more general and it crashes whenever there is a change in network interfaces (adding, removing, changing settings...). One time I was removing some interfaces on the side, doing nothing to our bridges and yet, radvd still crashed.

@robbat2
Copy link
Member

robbat2 commented Apr 4, 2022

@PaulosV @nopeno were you using the latest master, or what specific version?

@PaulosV
Copy link

PaulosV commented Apr 4, 2022

In my case, Alpine Linux v3.15 with radvd version 2.19.
I will attempt running with master.

@PaulosV
Copy link

PaulosV commented Apr 4, 2022

Running with master seems to handle things fine.

Also 2.19 compiled from source crashes, too.
With the 2.19 version in the packaging system, I have now got a command and a config file to reliably crash radvd:

/usr/sbin/radvd -C /etc/radvd.conf -p /run/radvd/radvd.pid -u radvd -d 3 -n

Debug levels 2 and above trigger the crash.

This is the minimal file that triggers the crash:

interface br_lan.10 {
	AdvSendAdvert on;
	prefix fd54:2e24:1f9b:a::/64 {
	};
};

@robbat2
Copy link
Member

robbat2 commented Apr 4, 2022

@PaulosV thanks for that. I don't see why that config should crash on v2.19 and not in the latest master. Most of the changes in there were build systems or new features.

If you use a dummy interface on Alpine, does it also crash, or is it some interaction between musl & vlan or bridges (a couple of the configs in the thread had bridges, which makes me wonder, e.g. if the bridge is in a non-forwarding state due to STP).

If you can spare the time to run git bisect between v2.19 & master, that would be hugely appreciated, bonus if you know your way around gdb.

Mostly I think this builds confidence to say we're good to have a v2.20 release soon.

@nopeno
Copy link

nopeno commented Apr 5, 2022

Alpine Linux v3.15 with radvd version 2.19.

me 2.

@PaulosV
Copy link

PaulosV commented Apr 5, 2022

@robbat2 I'll try to do the bisect later. I'm not very comfortable in gdb but I can probably do a core dump or extract some vars/registers if needed. I'll also try the dummy interfaces.

Also, I should have probably been clearer - when running in the foreground with high enough debug level (-d2 or -d3), radvd did not need any further convincing and crashed (SIGSEGV) instantly during startup.

@robbat2
Copy link
Member

robbat2 commented Apr 5, 2022

@PaulosV
After you bisect to narrow it down, here's the easy way to drive gdb to convert the core to a backtrace:

gdb-trace.sh:

#!/bin/sh
exe=$1
core=$2

gdb ${exe} \
        --core ${core} \
        --batch \
        --quiet \
        -ex "thread apply all bt full" \
        -ex "quit"

tee the output to a file, and it'll be good enough.

@PaulosV
Copy link

PaulosV commented Apr 5, 2022

Ok, so the issue was fixed by commit 06689f8 (issue #158, PR #161).
This time, I tested inside an LXC container ( images:alpine/3.15 ), and there was eth0 interface, without any bridge involved in the system. I was unable to reproduce with lo.

bash-5.1# ./radvd -n -d3                                                 
[Apr 05 21:31:29] radvd (4750): version 2.19 started
[Apr 05 21:31:29] radvd (4750): config file, /etc/radvd.conf, syntax ok
[Apr 05 21:31:29] radvd (4750): IPv6 forwarding setting is: 0, should be 1 or 2
[Apr 05 21:31:29] radvd (4750): IPv6 forwarding seems to be disabled, but continuing anyway
[Apr 05 21:31:29] radvd (4750): radvd startup PID is 4750
[Apr 05 21:31:29] radvd (4750): radvd PID is 4750
[Apr 05 21:31:29] radvd (4750): initializing privsep
[Apr 05 21:31:29] radvd (4750): radvd privsep PID is 4751
[Apr 05 21:31:29] radvd (4750): eth0 mtu: 1500
[Apr 05 21:31:29] radvd (4750): eth0 hardware type: ARPHRD_ETHER
[Apr 05 21:31:29] radvd (4750): eth0 hardware address: 00:16:3e:5e:83:6f
[Apr 05 21:31:29] radvd (4750): eth0 link layer token length: 48
[Apr 05 21:31:29] radvd (4750): eth0 prefix length: 64
[Apr 05 21:31:29] radvd (4750): IPv6 forwarding on interface seems to be disabled, but continuing anyway
[Apr 05 21:31:29] radvd (4750): polling for 16 second(s), next iface is eth0
[Apr 05 21:31:29] radvd (4751): Freeing Interfaces
[Apr 05 21:31:29] radvd (4751): Exiting, privsep_read_loop had readn return 0 bytes
[Apr 05 21:31:29] radvd (4751): Exiting, privsep_read_loop is complete.
Segmentation fault (core dumped)

The backtrace from tag v2.19: radvd-bt-2.19.log

@johnkisch
Copy link
Author

johnkisch commented Apr 18, 2022

Apologies for not getting back to this - life got in the way, etc.

I cloned latest on April 11th and rolled it into an apk package and installed. After giving a week of bake time, radvd has successfully started at boot time as expected every time I've tested. I think this issue has been resolved at this point. I think it would be helpful if a new release was cut so that distro maintainers can update their packages.

@PaulosV
Copy link

PaulosV commented Apr 18, 2022

That is good to hear. I think, because the issue has quite a big impact and makes radvd downright unusable in some circumstances, it might make sense to backport that specific patch (06689f8) for Alpine and include it in the aports, so they can rebuild the package with the fix.

@johnkisch
Copy link
Author

Here's the MR in the aports repo for this:

https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/33358

@stappersg
Copy link
Member

stappersg commented Nov 18, 2022

Did see https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/33358/diffs and does understand the please do more as just git releases.

@robbat2
Copy link
Member

robbat2 commented Dec 7, 2023

@johnkisch can you please confirm 2.20 rc resolves the issue for you?

@stappersg
Copy link
Member

stappersg commented Dec 30, 2024 via email

@PaulosV
Copy link

PaulosV commented Dec 31, 2024

On Wed, Dec 06, 2023 at 10:53:56PM -0800, Robin H. Johnson wrote: @johnkisch can you please confirm 2.20 rc resolves the issue for you?
Asking @PaulosV and @nopeno also if they can reproduce the segfault with the version 2.20 Release Candidate.

@stappersg I cannot reproduce the segfault in 2.20 RC. In 2.19 I can.

I had to test with a Docker container, but since the same fix was applied in Alpine (by a patch) and it fixed it for me, I would say that it is fixed.

@stappersg
Copy link
Member

I would say that it is fixed.

Thanks for reporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants