Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Nodes fail to remove from notifier - ERR update not sent #2192

Open
2 of 4 tasks
dustinblackman opened this issue Oct 11, 2024 · 0 comments
Open
2 of 4 tasks

[Bug] Nodes fail to remove from notifier - ERR update not sent #2192

dustinblackman opened this issue Oct 11, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@dustinblackman
Copy link

dustinblackman commented Oct 11, 2024

Is this a support request?

  • This is not a support request

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

There is a two part bug related to node notification updates. When a node either loses connection/system goes to sleep, this logic expects that Notifier.RemoveNode will be called before the next sendAll loop, where the effected node won't be there anymore.

I have systems coming in and out of my network during the day, and when some go offline they're not correctly removed from the notifier, resulting in the notifier.sendAll function emitting the following logs until a disconnect event is finally emitted, or Headscale is rebooted.

2024-10-11T13:14:50Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:15Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:20Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:39Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:44Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:49Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:59Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:16:04Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:16:15Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:16:24Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:16:31Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:16:32Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:16:40Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:16:52Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:17:12Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:17:14Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:17:27Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:17:32Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:17:35Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:18:08Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:31:12Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:31:15Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:31:17Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:31:32Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:31:34Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66  
2024-10-11T13:31:37Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66

With the first problem of Notifier.RemoveNode not being called when a node goes offline in all instances, secondly this issue occurs where logic in sendAll will not work as intended, as any node in the n.nodes array after the failed node will not receive updates.

I have yet to pin point why the first issue occurs. I have a feeling the desktop Tailscale clients are not correctly calling disconnect when a machine goes to sleep, but eventually do at same point as the errors above stop after awhile when updating the sendAll functions to continue looping through nodes rather than returning early.

Expected Behavior

  1. When a node goes offline, it is correctly removed from the notifiers available nodes list.
  2. Even if a node is unreachable, the remaining nodes should continue to receive updates.

Steps To Reproduce

Reproducing I've personally found tricky. I'm deployed using the deb installation package + systemd and will have anywhere between 20 and 40 nodes on the network at a given time. Not all sleep events on machines or ephemeral nodes being removed cause the error to occur.

Environment

- OS: Debian Bookworm
- Headscale version: 0.23.0
- Tailscale version: 1.74.0

Runtime environment

  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

Anything else?

No response

@dustinblackman dustinblackman added the bug Something isn't working label Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant