CARP: consider setting senderr_demotion_factor to 0 to combat spurious failovers #8437

swhite2 · 2025-03-14T09:59:52Z

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

Describe the bug

The network stack may have many reasons for returning error codes, but one particularly annoying one is send errors on CARP advertisement packets. If the machine demotes itself due to a send error, the source code seems to suggest that the machine is not able to recover from this unless the demotion is manually lowered by an administrator:

The following logic:
https://github.com/opnsense/src/blob/stable/25.1/sys/netinet/ip_carp.c#L905-L916
requires a demoted machine (possibly to backup) to continue sending advertisements to notice if things are OK again, however, the advertisements are stopped in backup mode: https://github.com/opnsense/src/blob/stable/25.1/sys/netinet/ip_carp.c#L1375. Interestingly enough, it actually requires 3 send errors for this condition to trigger, meaning 3 packets were unable to exit the interface.

Now this may or may not be intended behavior, but it's not documented.

This is especially relevant when a machine is booting, when userspace configures i.e. LAGGs or other systems that may influence outbound packet handling which may trigger CARP to react on these things, leaving a configured master machine in a backup state.

I have also seen cases where traffic flows normally through an interface, but the CARP advertisement packets specifically are encountering errors for unknown reasons, causing the machine to failover based on the wrong information.

A "permanent" workaround is to ignore send errors all together by setting net.inet.carp.senderr_demotion_factor to 0. However, I'm not sure what the impact of this will be. A machine in backup state will run a "master timeout" timer, giving a master up to 4+ seconds (to account for the error count of 3) to recover (https://github.com/opnsense/src/blob/stable/25.1/sys/netinet/ip_carp.c#L1376). If the send errors are valid, failover will now be slowed down significantly.

To Reproduce

N/A - not easy to reproduce.

Expected behavior

Either the CARP system should ignore send errors during the booting/userspace configuration stage, or it should automatically recover from send error demotions, both of which currently aren't happening

Describe alternatives you considered

N/A

Additional context

I do want to point out that this doesn't seem to happen very often, and may be very dependent on platform and configuration used.

The text was updated successfully, but these errors were encountered:

mimugmail · 2025-03-14T13:42:22Z

+1 for this, it usually happens with 50% of all Gbics on LWL interfaces.
RJ45 is never affected ...

Monviech · 2025-03-14T16:53:38Z

Im also for this, spurious send errors are one of the main causes for failover troubleshooting in support cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CARP: consider setting senderr_demotion_factor to 0 to combat spurious failovers #8437

CARP: consider setting senderr_demotion_factor to 0 to combat spurious failovers #8437

swhite2 commented Mar 14, 2025 •

edited

Loading

mimugmail commented Mar 14, 2025

Monviech commented Mar 14, 2025

CARP: consider setting senderr_demotion_factor to 0 to combat spurious failovers #8437

CARP: consider setting senderr_demotion_factor to 0 to combat spurious failovers #8437

Comments

swhite2 commented Mar 14, 2025 • edited Loading

mimugmail commented Mar 14, 2025

Monviech commented Mar 14, 2025

swhite2 commented Mar 14, 2025 •

edited

Loading