-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
node_exporter currently exposes details about network bonding, which is great. To be able to monitor more failure cases, we would need additional metrics which we haven't found in node_exporter yet:
- An association between network devices and bonding devices: This would be useful to detect mixed-speed ports in the same bond (which is considered a misconfiguration in our case).
- LACP (802.3ad) link aggregation status: Currently, there seems to be no way to detect if LACP has been negotiated between the machine and the switch on all ports.
Essentially, we would need the following metrics:
# HELP node_bonding_slave_info Contains association between bonding master device and associated slaves. Value is always 1.
# TYPE node_bonding_slave_info GAUGE
## from /sys/class/net/<master>/bonding/slaves
# - alert: LinuxPlatformBondMixedPortSpeeds
# expr: min(node_bonding_slave_info * on(instance, device) group_left() node_network_speed_bytes) by (instance, master) != max(node_bonding_slave_info * on(instance, device) group_left() node_network_speed_bytes) by (instance, master) and on(instance, master) node_bonding_active == node_bonding_slaves
node_bonding_slave_info{master="bond0",device="eno1"} 1
node_bonding_slave_info{master="bond0",device="eno2"} 1
# HELP node_bonding_ad_num_ports Number of visible LACP/AD ports
# TYPE node_bonding_ad_num_ports GAUGE'
## from /sys/devices/virtual/net/<master>/bonding/ad_num_ports
## - alert: BondLacpPortsDegraded
## expr: node_bonding_ad_num_ports != node_bonding_slaves
node_bonding_ad_num_ports{master="bond0"} 2
It could also make sense to add some more information in the same go. So far, we haven't required these in our alerting, but they may be useful nevertheless:
# HELP node_bonding_mode Kind of bonding (e.g. 1 for active-backup, 4 for lacp/ad)
# TYPE node_bonding_mode GAUGE
## from /sys/class/net/<master>/bonding/mode
## might be useful to have different monitoring for active-backup, lacp, etc.
node_bonding_mode{master="bond0"} 4
# HELP node_bonding_ad_info Contains the LACP/AD partner_mac. Value is always 1.
# TYPE node_bonding_ad_info GAUGE
## from /sys/devices/virtual/net/<master>/bonding/ad_partner_mac)
## might be useful for detecting bad lacp configs (zero MAC) or building stats about switch MACs
node_bonding_ad_info{master="bond0",partner_mac="00:00:00:00:00"} 1
# HELP node_bonding_ad_port_partner_oper_port_state Bitfield denoting LACP/AD port state for the partner
# TYPE node_bonding_ad_port_partner_oper_port_state GAUGE
## from /sys/class/net/bond0/slave_eno1/bonding_slave/ad_partner_oper_port_state
node_bonding_ad_port_partner_oper_port_state{master="bond0",device="eno1"} 61
node_bonding_ad_port_partner_oper_port_state{master="bond0",device="eno2"} 61
# HELP node_bonding_ad_port_actor_oper_port_state Bitfield denoting LACP/AD port state for the actor
# TYPE node_bonding_ad_port_actor_oper_port_state GAUGE
node_bonding_ad_port_actor_oper_port_state{master="bond0",device="eno1"} 61
node_bonding_ad_port_actor_oper_port_state{master="bond0",device="eno2"} 61
We currently use a shell script and node_exporter's textfile collector to fill this gap. However, I think it would be useful to support these metrics out-of-the box, especially since only /sys files need to be read.
I'd volunteer to work on PRs against procfs and node_exporter. I would suggest adding this to the existing bonding_linux.go as it is closely related. Looks like this would also imply converting the existing bonding collector to procfs.
Related: #841
@SuperQ @discordianfish @pgier What do you think? Does it make sense in general? Should we only implement the first two metrics or the more generic approach? Do the suggested names make sense?