Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CARP: consider setting senderr_demotion_factor to 0 to combat spurious failovers #8437

Open
2 tasks done
swhite2 opened this issue Mar 14, 2025 · 2 comments
Open
2 tasks done

Comments

@swhite2
Copy link
Member

swhite2 commented Mar 14, 2025

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

The network stack may have many reasons for returning error codes, but one particularly annoying one is send errors on CARP advertisement packets. If the machine demotes itself due to a send error, the source code seems to suggest that the machine is not able to recover from this unless the demotion is manually lowered by an administrator:

The following logic:
https://github.com/opnsense/src/blob/stable/25.1/sys/netinet/ip_carp.c#L905-L916
requires a demoted machine (possibly to backup) to continue sending advertisements to notice if things are OK again, however, the advertisements are stopped in backup mode: https://github.com/opnsense/src/blob/stable/25.1/sys/netinet/ip_carp.c#L1375. Interestingly enough, it actually requires 3 send errors for this condition to trigger, meaning 3 packets were unable to exit the interface.

Now this may or may not be intended behavior, but it's not documented.

This is especially relevant when a machine is booting, when userspace configures i.e. LAGGs or other systems that may influence outbound packet handling which may trigger CARP to react on these things, leaving a configured master machine in a backup state.

I have also seen cases where traffic flows normally through an interface, but the CARP advertisement packets specifically are encountering errors for unknown reasons, causing the machine to failover based on the wrong information.

A "permanent" workaround is to ignore send errors all together by setting net.inet.carp.senderr_demotion_factor to 0. However, I'm not sure what the impact of this will be. A machine in backup state will run a "master timeout" timer, giving a master up to 4+ seconds (to account for the error count of 3) to recover (https://github.com/opnsense/src/blob/stable/25.1/sys/netinet/ip_carp.c#L1376). If the send errors are valid, failover will now be slowed down significantly.

To Reproduce

N/A - not easy to reproduce.

Expected behavior

Either the CARP system should ignore send errors during the booting/userspace configuration stage, or it should automatically recover from send error demotions, both of which currently aren't happening

Describe alternatives you considered

N/A

Additional context

I do want to point out that this doesn't seem to happen very often, and may be very dependent on platform and configuration used.

@mimugmail
Copy link
Member

+1 for this, it usually happens with 50% of all Gbics on LWL interfaces.
RJ45 is never affected ...

@Monviech
Copy link
Member

Im also for this, spurious send errors are one of the main causes for failover troubleshooting in support cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants