-
Notifications
You must be signed in to change notification settings - Fork 92
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix node connected status flappging (#4587)
## Problem Statement Node connection status has been observed to flap between connected and disconnected states due to race conditions in the `HeartbeatServer` and its interaction with the priority queue for heartbeat management. ## Root Cause Analysis The issue stems from non-atomic operations in the existing heartbeat message [handler](https://github.com/bacalhau-project/bacalhau/blob/44b44efed9ce84043db9be7b9365e98113b2014a/pkg/node/heartbeat/server.go#L202) implementation: 1. Checking for an older heartbeat 2. Removing the old heartbeat 3. Enqueuing a new heartbeat This sequence of operations is vulnerable to race conditions when concurrent heartbeats arrive from the same node, potentially resulting in multiple heartbeats for a single node in the queue. This result in unexpected behaviour as the HashedPriorityQueue is expected to have a single item in the queue for the same key/node ## Why Now? While this bug has existed since version 1.4, it only became apparent with the introduction of concurrent heartbeats in version 1.5. The new version requires nodes to heartbeat to two topics: - The old topic (supported by 1.4 orchestrators) - A new topic (supported by 1.5 orchestrators) As a result, 1.5 orchestrators now receive two concurrent heartbeats from 1.5 compute nodes, exposing the race condition. ## Reproduction Steps 1. Set up a devstack environment with approximately 10 nodes 2. Observe the node connection status 3. Note the flapping between connected and disconnected states ## Solution Instead of simply locking the `HeartbeatServer.Handle()` method, I've implemented a more comprehensive fix to address the underlying issues: 1. Modified `HashedPriorityQueue` to enforce a single item per key atomically within the queue 2. Introduced a `Peek` method to allow `HeartbeatServer` to examine the oldest item without removal and without having to loop over all item using `DequeueWhere` 3. Corrected the priority and ordering of heartbeat events in the queue These changes eliminate the need for manual checks, dequeues, and re-enqueues, while also improving the overall efficiency of the queue operations. ## Implementation Details 1. `HashedPriorityQueue` Modifications: - Ensure atomic operations for maintaining a single item per key - Implement version tracking for items so that enqueues remain fast, while dequeues will lazily filter out and remove items that don't match the latest version for the same key 2. New `Peek` Method: - Allow examination of the oldest item without altering the queue state - Improve efficiency of `HeartbeatServer` operations without having to loop over all item using `DequeueWhere` 3. Heartbeat Event Prioritization: - Adjust priority calculation to ensure oldest events are dequeued first ## Testing Conducted - Enhanced test coverage for `HashedPriorityQueue` to ensure unique items per key - Improved concurrent heartbeat testing in `HeartbeatServer` - Manual testing using devstack environments
- Loading branch information
Showing
7 changed files
with
597 additions
and
164 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.