[Bug] Nodes fail to remove from notifier - ERR update not sent #2192

dustinblackman · 2024-10-11T17:32:46Z

Is this a support request?

This is not a support request

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

There is a two part bug related to node notification updates. When a node either loses connection/system goes to sleep, this logic expects that Notifier.RemoveNode will be called before the next sendAll loop, where the effected node won't be there anymore.

I have systems coming in and out of my network during the day, and when some go offline they're not correctly removed from the notifier, resulting in the notifier.sendAll function emitting the following logs until a disconnect event is finally emitted, or Headscale is rebooted.

2024-10-11T13:14:50Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:15Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:20Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:39Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:44Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:49Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:59Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:16:04Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:16:15Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:16:24Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:16:31Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:16:32Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:16:40Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:16:52Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:17:12Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:17:14Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:17:27Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:17:32Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:17:35Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:18:08Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:31:12Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:31:15Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:31:17Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:31:32Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:31:34Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:31:37Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66

With the first problem of Notifier.RemoveNode not being called when a node goes offline in all instances, secondly this issue occurs where logic in sendAll will not work as intended, as any node in the n.nodes array after the failed node will not receive updates.

I have yet to pin point why the first issue occurs. I have a feeling the desktop Tailscale clients are not correctly calling disconnect when a machine goes to sleep, but eventually do at same point as the errors above stop after awhile when updating the sendAll functions to continue looping through nodes rather than returning early.

Expected Behavior

When a node goes offline, it is correctly removed from the notifiers available nodes list.
Even if a node is unreachable, the remaining nodes should continue to receive updates.

Steps To Reproduce

Reproducing I've personally found tricky. I'm deployed using the deb installation package + systemd and will have anywhere between 20 and 40 nodes on the network at a given time. Not all sleep events on machines or ephemeral nodes being removed cause the error to occur.

Environment

- OS: Debian Bookworm
- Headscale version: 0.23.0
- Tailscale version: 1.74.0

Runtime environment

Headscale is behind a (reverse) proxy
Headscale runs in a container

Anything else?

No response

The text was updated successfully, but these errors were encountered:

dustinblackman added the bug Something isn't working label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Nodes fail to remove from notifier - ERR update not sent #2192

[Bug] Nodes fail to remove from notifier - ERR update not sent #2192

dustinblackman commented Oct 11, 2024 •

edited

Loading

[Bug] Nodes fail to remove from notifier - ERR update not sent #2192

[Bug] Nodes fail to remove from notifier - ERR update not sent #2192

Comments

dustinblackman commented Oct 11, 2024 • edited Loading

Is this a support request?

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Runtime environment

Anything else?

dustinblackman commented Oct 11, 2024 •

edited

Loading