Skip to content

Proposal: Extend network bonding metrics #1604

@hoffie

Description

@hoffie

node_exporter currently exposes details about network bonding, which is great. To be able to monitor more failure cases, we would need additional metrics which we haven't found in node_exporter yet:

  • An association between network devices and bonding devices: This would be useful to detect mixed-speed ports in the same bond (which is considered a misconfiguration in our case).
  • LACP (802.3ad) link aggregation status: Currently, there seems to be no way to detect if LACP has been negotiated between the machine and the switch on all ports.

Essentially, we would need the following metrics:

# HELP node_bonding_slave_info Contains association between bonding master device and associated slaves. Value is always 1.
# TYPE node_bonding_slave_info GAUGE
## from /sys/class/net/<master>/bonding/slaves
# - alert: LinuxPlatformBondMixedPortSpeeds
#   expr: min(node_bonding_slave_info * on(instance, device) group_left() node_network_speed_bytes) by (instance, master) != max(node_bonding_slave_info * on(instance, device) group_left() node_network_speed_bytes) by (instance, master) and on(instance, master) node_bonding_active == node_bonding_slaves
node_bonding_slave_info{master="bond0",device="eno1"} 1
node_bonding_slave_info{master="bond0",device="eno2"} 1

# HELP node_bonding_ad_num_ports Number of visible LACP/AD ports
# TYPE node_bonding_ad_num_ports GAUGE'
## from /sys/devices/virtual/net/<master>/bonding/ad_num_ports
## - alert: BondLacpPortsDegraded
##   expr: node_bonding_ad_num_ports != node_bonding_slaves
node_bonding_ad_num_ports{master="bond0"} 2

It could also make sense to add some more information in the same go. So far, we haven't required these in our alerting, but they may be useful nevertheless:

# HELP node_bonding_mode Kind of bonding (e.g. 1 for active-backup, 4 for lacp/ad)
# TYPE node_bonding_mode GAUGE
## from /sys/class/net/<master>/bonding/mode
## might be useful to have different monitoring for active-backup, lacp, etc.
node_bonding_mode{master="bond0"} 4

# HELP node_bonding_ad_info Contains the LACP/AD partner_mac. Value is always 1.
# TYPE node_bonding_ad_info GAUGE
## from /sys/devices/virtual/net/<master>/bonding/ad_partner_mac)
## might be useful for detecting bad lacp configs (zero MAC) or building stats about switch MACs
node_bonding_ad_info{master="bond0",partner_mac="00:00:00:00:00"} 1

# HELP node_bonding_ad_port_partner_oper_port_state Bitfield denoting LACP/AD port state for the partner
# TYPE node_bonding_ad_port_partner_oper_port_state GAUGE
## from /sys/class/net/bond0/slave_eno1/bonding_slave/ad_partner_oper_port_state
node_bonding_ad_port_partner_oper_port_state{master="bond0",device="eno1"} 61
node_bonding_ad_port_partner_oper_port_state{master="bond0",device="eno2"} 61

# HELP node_bonding_ad_port_actor_oper_port_state Bitfield denoting LACP/AD port state for the actor
# TYPE node_bonding_ad_port_actor_oper_port_state GAUGE
node_bonding_ad_port_actor_oper_port_state{master="bond0",device="eno1"} 61
node_bonding_ad_port_actor_oper_port_state{master="bond0",device="eno2"} 61

We currently use a shell script and node_exporter's textfile collector to fill this gap. However, I think it would be useful to support these metrics out-of-the box, especially since only /sys files need to be read.

I'd volunteer to work on PRs against procfs and node_exporter. I would suggest adding this to the existing bonding_linux.go as it is closely related. Looks like this would also imply converting the existing bonding collector to procfs.

Related: #841

@SuperQ @discordianfish @pgier What do you think? Does it make sense in general? Should we only implement the first two metrics or the more generic approach? Do the suggested names make sense?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions