fix(monitor): set monitors to nil after closing them #3388

derekbit · 2024-12-23T06:49:23Z

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

coderabbitai · 2024-12-23T06:49:30Z

Walkthrough

The pull request introduces modifications to error handling and resource management across multiple files in the Longhorn controller. Changes include renaming the NodeMonitor to DiskMonitor, updating synchronization periods, and enhancing error logging in both disk and environment check monitors. The node controller has been adjusted to nullify monitor references during node deletion, preventing potential memory leaks and ensuring proper resource cleanup. Additionally, the testing framework for the node controller has been expanded to include environment checks.

Changes

File	Change Summary
`controller/monitor/disk_monitor.go`	Renamed `NodeMonitor` to `DiskMonitor`, updated method signatures and error handling, including specific logging for context cancellation and node retrieval issues.
`controller/monitor/environment_check_monitor.go`	Adjusted `NewEnvironmentCheckMonitor` to use `EnvironmentCheckMonitorSyncPeriod` and refined error handling in the `Start` method.
`controller/node_controller.go`	Enhanced cleanup in `syncNode` method by nullifying monitor references during node deletion.
`controller/node_controller_test.go`	Expanded test conditions in `NodeControllerSuite` to validate additional node states with the inclusion of `environmentCheckMonitor`.
`controller/monitor/fake_disk_monitor.go`	Renamed `NewFakeNodeMonitor` to `NewFakeDiskMonitor` and updated the return type and initialization logic.
`controller/monitor/fake_environment_check_monitor.go`	Introduced `FakeEnvironmentCheckMonitor` struct with methods for monitoring environmental conditions.

Assessment against linked issues

Objective	Addressed	Explanation
Resolve node disk management issues [#10035]	✅
Prevent resource leaks during node deletion	✅
Improve error logging and monitoring	✅

Possibly related PRs

fix: move environment checks to a dedicated monitor #3369: Related to monitoring process error handling and logging improvements.

Suggested reviewers

COLDTURNIP
innobead
ChanYiLin
mantissahz

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10cf797 and 6ebe54b.

📒 Files selected for processing (6)

controller/monitor/disk_monitor.go (12 hunks)
controller/monitor/environment_check_monitor.go (2 hunks)
controller/monitor/fake_disk_monitor.go (1 hunks)
controller/monitor/fake_environment_check_monitor.go (1 hunks)
controller/node_controller.go (1 hunks)
controller/node_controller_test.go (19 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

controller/monitor/environment_check_monitor.go
controller/node_controller.go
controller/monitor/fake_environment_check_monitor.go

👮 Files not reviewed due to content moderation or server errors (3)

controller/monitor/fake_disk_monitor.go
controller/monitor/disk_monitor.go
controller/node_controller_test.go

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

derekbit · 2024-12-23T06:51:25Z

@mergify backport v1.8.x

mergify · 2024-12-23T06:51:29Z

backport v1.8.x

✅ Backports have been created

#3394 fix(monitor): set monitors to nil after closing them (backport #3388) has been created for branch v1.8.x

derekbit · 2024-12-23T08:18:52Z

test passed
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/2292/

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (11)

controller/node_controller_test.go (11)

190-193: Consider factoring out repeated condition checks.

Lines 190-193 introduce repetitive checks for required packages, multipathd, kernel modules, and NFS client installation. This exact pattern also appears in other tests. Centralizing into a helper method or table-driven approach would reduce duplication and improve clarity.

278-281: Repeated condition checks detected again.

Similarly to lines 190-193, these lines reintroduce the same block of condition checks. Refactoring into a helper method would avoid repetition.

366-369: Repeated condition checks detected again.

Repeating these four lines across multiple test methods suggests consolidating them into a shared, reusable function or table-driven approach.

454-457: Repeated condition checks detected again.

This block mirrors earlier condition checks. A common function or data structure could streamline these tests.

577-580: Repeated condition checks detected again.

The same reasoning applies here; applying DRY (Don’t Repeat Yourself) principle would enhance maintainability.

730-733: Repeated condition checks detected again.

Refactor this recurring pattern into a helper method to prevent code bloat.

889-892: Repeated condition checks detected again.

Once more, consider consolidating these lines into a singular, reusable function or table-driven test to reduce duplication.

1019-1022: Repeated condition checks detected again.

These lines illustrate the same test logic used in the prior segments. A shared helper method can improve long-term maintainability.

1166-1169: Repeated condition checks detected again.

Refactoring these lines into a common helper would make future updates easier and help avoid potential inconsistencies across tests.

1283-1286: Repeated condition checks detected again.

As before, centralizing the condition checks will simplify maintenance and testing.

2123-2126: Repeated condition checks detected again.

It would be beneficial to unify this pattern with previous blocks. Consolidation ensures consistency across tests and reduces duplicate upkeep.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9b6e48 and 9649bed.

📒 Files selected for processing (4)

controller/monitor/disk_monitor.go (2 hunks)
controller/monitor/environment_check_monitor.go (1 hunks)
controller/node_controller.go (1 hunks)
controller/node_controller_test.go (11 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

controller/node_controller.go
controller/monitor/environment_check_monitor.go
controller/monitor/disk_monitor.go

Longhorn 10035 Signed-off-by: Derek Su <[email protected]>

No need to delay the environment check. Longhorn 10035 Signed-off-by: Derek Su <[email protected]>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (7)

controller/monitor/fake_environment_check_monitor.go (2)
46-60: Ensure clear logging upon normal PollUntilContextCancel termination
When the context finishes naturally (instead of failing with an error), consider emitting a final info log message. This helps distinguish normal shutdowns from unexpected terminations.
 func (m *FakeEnvironmentCheckMonitor) Start() {
   if err := wait.PollUntilContextCancel(m.ctx, m.syncPeriod, true, func(context.Context) (bool, error) {
     if err := m.run(struct{}{}); err != nil {
       m.logger.WithError(err).Error("Stopped monitoring environment check")
     }
     return false, nil
   }); err != nil {
     if errors.Is(err, context.Canceled) {
       m.logger.WithError(err).Warning("Environment check monitor is stopped")
+      m.logger.Info("Environment check monitor gracefully exited")
     } else {
       m.logger.WithError(err).Error("Failed to start environment check monitor")
     }
   }
 }
73-83: Consider returning strongly typed data
Currently, GetCollectedData() returns an interface{} containing a slice of Conditions. For better clarity and type-safety, you might return a well-defined struct or slice type. This enhances readability and reduces type assertion steps for consumers of this method.
controller/node_controller_test.go (1)

2340-2344: Use consistent naming conventions
environmentCheckMonitor is established alongside diskMonitor in the same function. Ensure consistent naming conventions or structuring, such as grouping all monitors, for readability and maintainability.

controller/monitor/fake_disk_monitor.go (1)

24-28: Clarify usage of contexts for concurrency
Within NewFakeDiskMonitor, you’re creating a context but only returning the constructed monitor. If you plan to start a polling loop or background task later, consider making that explicit, or return an additional “Start” method. This avoids confusion about whether the monitor is active by default.

controller/monitor/disk_monitor.go (3)

28-28: Make DiskMonitorSyncPeriod configurable
Hardcoding 30-second intervals is convenient for now, but consider a configuration mechanism if production usage requires dynamic intervals or environment-based overrides.

72-76: Defer the goroutine start until after construction
NewDiskMonitor launches m.Start() immediately via go routine. Instead, consider exposing separate “Start” and “Stop” methods so that users can manage when the monitor is launched and ensure all prerequisites are ready.

Line range hint 396-422: Review block vs filesystem branching
Multiple places switch on diskType. Validate that new disk types or future expansions follow the same logic or error out properly. Consider a shared function that handles unknown disk types in a single location for DRY compliance.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9649bed and 10cf797.

📒 Files selected for processing (6)

controller/monitor/disk_monitor.go (12 hunks)
controller/monitor/environment_check_monitor.go (2 hunks)
controller/monitor/fake_disk_monitor.go (1 hunks)
controller/monitor/fake_environment_check_monitor.go (1 hunks)
controller/node_controller.go (1 hunks)
controller/node_controller_test.go (19 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

controller/monitor/environment_check_monitor.go
controller/node_controller.go

🔇 Additional comments (21)

controller/monitor/fake_environment_check_monitor.go (1)

85-106: Validate handle for missing or deleting node
The run() method relies on fetching the node object and logs an error if it cannot be retrieved. If the node is in the process of being deleted, consider handling that scenario explicitly. Otherwise, your logs might fill with repeated errors if the node stays unavailable.

controller/node_controller_test.go (17)

297-298: Same comment as lines 211-212

383-384: Same comment as lines 211-212

469-470: Same comment as lines 211-212

623-624: Same comment as lines 211-212

770-771: Same comment as lines 211-212

927-928: Same comment as lines 211-212

1049-1050: Same comment as lines 211-212

1195-1196: Same comment as lines 211-212

1289-1290: Same comment as lines 211-212

1383-1384: Same comment as lines 211-212

1472-1473: Same comment as lines 211-212

1561-1562: Same comment as lines 211-212

1650-1651: Same comment as lines 211-212

1739-1740: Same comment as lines 211-212

1834-1835: Same comment as lines 211-212

211-212: Verify environmentCheckMonitor is non-nil before invocation
Prior to calling environmentCheckMonitor.RunOnce(), consider verifying that the monitor is initialized. This avoids potential panics in edge cases if the controller’s monitor is unintentionally nil.

2334-2337: Check concurrency assumptions for diskMonitor
diskMonitor is assigned here after instantiation. Validate that no goroutine starts the monitor prior to this assignment, preventing a potential race condition if diskMonitor’s methods are called before fully assigned.

✅ Verification successful

Let me analyze the code to check for potential race conditions by searching for any goroutines that might access the diskMonitor before its assignment.

Let me search for more context about how the controller and monitor are initialized and used.

Let me try one more search to understand the controller's initialization and monitor usage pattern.

Let me try one final search to understand the monitor initialization and usage pattern.

Based on the gathered information, I can now provide a final response about the concurrency safety of the diskMonitor initialization.

No race condition risk with diskMonitor initialization
The monitor is safely initialized before use. The code shows that:

The monitor is initialized using NewFakeDiskMonitor and immediately assigned to nc.diskMonitor
The monitor uses a baseMonitor struct with proper synchronization primitives (mutex)
The only usage of diskMonitor in tests is through explicit RunOnce() calls, which happen after initialization
No evidence of background goroutines starting before the monitor assignment

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash
# Search for the diskMonitor usage in the test file and related files
rg -A 5 "diskMonitor" controller/node_controller_test.go

# Search for goroutine starts or go keyword usage near diskMonitor
rg -B 5 -A 5 "go \w+.*diskMonitor" controller/node_controller_test.go

# Look for the NewFakeDiskMonitor implementation to understand its behavior
ast-grep --pattern 'func NewFakeDiskMonitor($$$) ($_, $_) {
  $$$
}'

Length of output: 4863

Script:

#!/bin/bash
# Search for controller initialization and monitor usage
rg -B 10 -A 5 "nc.diskMonitor = diskMonitor" controller/node_controller_test.go

# Look for any Start or Run methods that might start goroutines
ast-grep --pattern 'func ($v *NodeController) Start($$$) {
  $$$
}'

# Search for monitor start/run patterns
rg -B 5 -A 5 "Start\(\)|Run\(\)" controller/monitor/fake_disk_monitor.go

Length of output: 720

Script:

#!/bin/bash
# Search for the test setup and controller initialization
rg -B 20 -A 10 "NewFakeNodeController" controller/node_controller_test.go

# Look for any monitor interface or base implementation
ast-grep --pattern 'type $_ interface {
  $$$
}'

# Search for RunOnce implementation
rg -B 5 -A 5 "func.*RunOnce" controller/monitor/fake_disk_monitor.go

Length of output: 65789

Script:

#!/bin/bash
# Search for monitor initialization and Start/Run methods
rg -B 5 -A 10 "type.*Monitor struct" controller/monitor/

# Look for any goroutine starts in the controller
rg -B 5 -A 5 "go \w+|Start\(\)" controller/node_controller_test.go

# Search for RunOnce usage context
rg -B 5 -A 5 "RunOnce\(\)" controller/node_controller_test.go

Length of output: 10995

controller/monitor/disk_monitor.go (3)

221-221: Validate concurrency in collectDiskData
collectDiskData calls m.newDiskServiceClients() and closes them with defer, which is correct for concurrency. Just ensure that no other threads reference the clients after they’ve been closed, so there isn’t a race condition.

97-108: Confirm partial failures logging
The Start method logs an error and continues looping. If partial failures occur repeatedly, consider capturing them differently or adding a backoff strategy, so logs aren’t flooded.

136-139: Guard against node deletion
If the node is being deleted or is otherwise absent, the disk monitor logs an error. Repeatedly calling run() under those circumstances can lead to spammy logs. You may want to short-circuit gracefully when node deletion is detected.

controller/monitor/disk_monitor.go

controller/monitor/environment_check_monitor.go

Set monitors to nil after closing them, so the controller can recreate them when the node is back. Longhorn 10035 Signed-off-by: Derek Su <[email protected]>

mantissahz

LGTM

derekbit · 2024-12-23T15:05:54Z

@mergify backport v1.7.x v1.6.x

mergify · 2024-12-23T15:06:02Z

backport v1.7.x v1.6.x

✅ Backports have been created

#3395 fix(monitor): set monitors to nil after closing them (backport #3388) has been created for branch v1.7.x but encountered conflicts
#3396 fix(monitor): set monitors to nil after closing them (backport #3388) has been created for branch v1.6.x but encountered conflicts

derekbit self-assigned this Dec 23, 2024

derekbit changed the title ~~fix(monitor): make the warning and error messages more clear~~ fix(monitor): set monitors to nil after closing them Dec 23, 2024

derekbit marked this pull request as ready for review December 23, 2024 06:50

derekbit requested review from innobead, ChanYiLin, c3y1huang and james-munson December 23, 2024 06:51

derekbit force-pushed the issue-10035 branch from 094f527 to 9649bed Compare December 23, 2024 08:18

coderabbitai bot reviewed Dec 23, 2024

View reviewed changes

derekbit marked this pull request as draft December 23, 2024 09:29

derekbit force-pushed the issue-10035 branch from 9649bed to fb73905 Compare December 23, 2024 11:51

derekbit marked this pull request as ready for review December 23, 2024 11:52

derekbit added 2 commits December 23, 2024 19:53

fix(monitor): make the warning and error messages more clear

7ec7e9b

Longhorn 10035 Signed-off-by: Derek Su <[email protected]>

feat(monitor): start environment check immediately

93e049f

No need to delay the environment check. Longhorn 10035 Signed-off-by: Derek Su <[email protected]>

derekbit force-pushed the issue-10035 branch from fb73905 to 10cf797 Compare December 23, 2024 11:53

coderabbitai bot reviewed Dec 23, 2024

View reviewed changes

controller/monitor/disk_monitor.go Show resolved Hide resolved

mantissahz reviewed Dec 23, 2024

View reviewed changes

controller/monitor/environment_check_monitor.go Outdated Show resolved Hide resolved

fix(monitor): set monitors to nil after closing them

6ebe54b

Set monitors to nil after closing them, so the controller can recreate them when the node is back. Longhorn 10035 Signed-off-by: Derek Su <[email protected]>

derekbit force-pushed the issue-10035 branch from 10cf797 to 6ebe54b Compare December 23, 2024 13:58

derekbit requested a review from mantissahz December 23, 2024 13:58

mantissahz approved these changes Dec 23, 2024

View reviewed changes

derekbit merged commit a65a6ce into longhorn:master Dec 23, 2024
9 checks passed

mergify bot mentioned this pull request Dec 23, 2024

fix(monitor): set monitors to nil after closing them (backport #3388) #3394

Merged

derekbit mentioned this pull request Dec 23, 2024

[BUG][v1.8.x] Unable to add block disk after node deleted and added back longhorn/longhorn#10035

Closed

This was referenced Dec 23, 2024

fix(monitor): set monitors to nil after closing them (backport #3388) #3395

Closed

fix(monitor): set monitors to nil after closing them (backport #3388) #3396

Closed

coderabbitai bot mentioned this pull request Dec 23, 2024

fix(monitor): remove the close of monitors #3397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(monitor): set monitors to nil after closing them #3388

fix(monitor): set monitors to nil after closing them #3388

derekbit commented Dec 23, 2024

coderabbitai bot commented Dec 23, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

derekbit commented Dec 23, 2024

mergify bot commented Dec 23, 2024 •

edited

Loading

derekbit commented Dec 23, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

mantissahz left a comment

derekbit commented Dec 23, 2024

mergify bot commented Dec 23, 2024 •

edited

Loading

fix(monitor): set monitors to nil after closing them #3388

fix(monitor): set monitors to nil after closing them #3388

Conversation

derekbit commented Dec 23, 2024

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

coderabbitai bot commented Dec 23, 2024 • edited Loading

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

derekbit commented Dec 23, 2024

mergify bot commented Dec 23, 2024 • edited Loading

✅ Backports have been created

derekbit commented Dec 23, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

mantissahz left a comment

Choose a reason for hiding this comment

derekbit commented Dec 23, 2024

mergify bot commented Dec 23, 2024 • edited Loading

✅ Backports have been created

coderabbitai bot commented Dec 23, 2024 •

edited

Loading

mergify bot commented Dec 23, 2024 •

edited

Loading

mergify bot commented Dec 23, 2024 •

edited

Loading