Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(monitor): set monitors to nil after closing them #3388

Merged
merged 3 commits into from
Dec 23, 2024

Conversation

derekbit
Copy link
Member

Which issue(s) this PR fixes:

Issue longhorn/longhorn#10035

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

@derekbit derekbit self-assigned this Dec 23, 2024
Copy link

coderabbitai bot commented Dec 23, 2024

Walkthrough

The pull request introduces modifications to error handling and resource management across multiple files in the Longhorn controller. Changes include renaming the NodeMonitor to DiskMonitor, updating synchronization periods, and enhancing error logging in both disk and environment check monitors. The node controller has been adjusted to nullify monitor references during node deletion, preventing potential memory leaks and ensuring proper resource cleanup. Additionally, the testing framework for the node controller has been expanded to include environment checks.

Changes

File Change Summary
controller/monitor/disk_monitor.go Renamed NodeMonitor to DiskMonitor, updated method signatures and error handling, including specific logging for context cancellation and node retrieval issues.
controller/monitor/environment_check_monitor.go Adjusted NewEnvironmentCheckMonitor to use EnvironmentCheckMonitorSyncPeriod and refined error handling in the Start method.
controller/node_controller.go Enhanced cleanup in syncNode method by nullifying monitor references during node deletion.
controller/node_controller_test.go Expanded test conditions in NodeControllerSuite to validate additional node states with the inclusion of environmentCheckMonitor.
controller/monitor/fake_disk_monitor.go Renamed NewFakeNodeMonitor to NewFakeDiskMonitor and updated the return type and initialization logic.
controller/monitor/fake_environment_check_monitor.go Introduced FakeEnvironmentCheckMonitor struct with methods for monitoring environmental conditions.

Assessment against linked issues

Objective Addressed Explanation
Resolve node disk management issues [#10035]
Prevent resource leaks during node deletion
Improve error logging and monitoring

Possibly related PRs

Suggested reviewers

  • COLDTURNIP
  • innobead
  • ChanYiLin
  • mantissahz

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10cf797 and 6ebe54b.

📒 Files selected for processing (6)
  • controller/monitor/disk_monitor.go (12 hunks)
  • controller/monitor/environment_check_monitor.go (2 hunks)
  • controller/monitor/fake_disk_monitor.go (1 hunks)
  • controller/monitor/fake_environment_check_monitor.go (1 hunks)
  • controller/node_controller.go (1 hunks)
  • controller/node_controller_test.go (19 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • controller/monitor/environment_check_monitor.go
  • controller/node_controller.go
  • controller/monitor/fake_environment_check_monitor.go
👮 Files not reviewed due to content moderation or server errors (3)
  • controller/monitor/fake_disk_monitor.go
  • controller/monitor/disk_monitor.go
  • controller/node_controller_test.go

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@derekbit derekbit changed the title fix(monitor): make the warning and error messages more clear fix(monitor): set monitors to nil after closing them Dec 23, 2024
@derekbit derekbit marked this pull request as ready for review December 23, 2024 06:50
@derekbit
Copy link
Member Author

@mergify backport v1.8.x

Copy link

mergify bot commented Dec 23, 2024

backport v1.8.x

✅ Backports have been created

@derekbit
Copy link
Member Author

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (11)
controller/node_controller_test.go (11)

190-193: Consider factoring out repeated condition checks.

Lines 190-193 introduce repetitive checks for required packages, multipathd, kernel modules, and NFS client installation. This exact pattern also appears in other tests. Centralizing into a helper method or table-driven approach would reduce duplication and improve clarity.


278-281: Repeated condition checks detected again.

Similarly to lines 190-193, these lines reintroduce the same block of condition checks. Refactoring into a helper method would avoid repetition.


366-369: Repeated condition checks detected again.

Repeating these four lines across multiple test methods suggests consolidating them into a shared, reusable function or table-driven approach.


454-457: Repeated condition checks detected again.

This block mirrors earlier condition checks. A common function or data structure could streamline these tests.


577-580: Repeated condition checks detected again.

The same reasoning applies here; applying DRY (Don’t Repeat Yourself) principle would enhance maintainability.


730-733: Repeated condition checks detected again.

Refactor this recurring pattern into a helper method to prevent code bloat.


889-892: Repeated condition checks detected again.

Once more, consider consolidating these lines into a singular, reusable function or table-driven test to reduce duplication.


1019-1022: Repeated condition checks detected again.

These lines illustrate the same test logic used in the prior segments. A shared helper method can improve long-term maintainability.


1166-1169: Repeated condition checks detected again.

Refactoring these lines into a common helper would make future updates easier and help avoid potential inconsistencies across tests.


1283-1286: Repeated condition checks detected again.

As before, centralizing the condition checks will simplify maintenance and testing.


2123-2126: Repeated condition checks detected again.

It would be beneficial to unify this pattern with previous blocks. Consolidation ensures consistency across tests and reduces duplicate upkeep.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9b6e48 and 9649bed.

📒 Files selected for processing (4)
  • controller/monitor/disk_monitor.go (2 hunks)
  • controller/monitor/environment_check_monitor.go (1 hunks)
  • controller/node_controller.go (1 hunks)
  • controller/node_controller_test.go (11 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • controller/node_controller.go
  • controller/monitor/environment_check_monitor.go
  • controller/monitor/disk_monitor.go

@derekbit derekbit marked this pull request as draft December 23, 2024 09:29
@derekbit derekbit marked this pull request as ready for review December 23, 2024 11:52
No need to delay the environment check.

Longhorn 10035

Signed-off-by: Derek Su <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
controller/monitor/fake_environment_check_monitor.go (2)

46-60: Ensure clear logging upon normal PollUntilContextCancel termination
When the context finishes naturally (instead of failing with an error), consider emitting a final info log message. This helps distinguish normal shutdowns from unexpected terminations.

 func (m *FakeEnvironmentCheckMonitor) Start() {
   if err := wait.PollUntilContextCancel(m.ctx, m.syncPeriod, true, func(context.Context) (bool, error) {
     if err := m.run(struct{}{}); err != nil {
       m.logger.WithError(err).Error("Stopped monitoring environment check")
     }
     return false, nil
   }); err != nil {
     if errors.Is(err, context.Canceled) {
       m.logger.WithError(err).Warning("Environment check monitor is stopped")
+      m.logger.Info("Environment check monitor gracefully exited")
     } else {
       m.logger.WithError(err).Error("Failed to start environment check monitor")
     }
   }
 }

73-83: Consider returning strongly typed data
Currently, GetCollectedData() returns an interface{} containing a slice of Conditions. For better clarity and type-safety, you might return a well-defined struct or slice type. This enhances readability and reduces type assertion steps for consumers of this method.

controller/node_controller_test.go (1)

2340-2344: Use consistent naming conventions
environmentCheckMonitor is established alongside diskMonitor in the same function. Ensure consistent naming conventions or structuring, such as grouping all monitors, for readability and maintainability.

controller/monitor/fake_disk_monitor.go (1)

24-28: Clarify usage of contexts for concurrency
Within NewFakeDiskMonitor, you’re creating a context but only returning the constructed monitor. If you plan to start a polling loop or background task later, consider making that explicit, or return an additional “Start” method. This avoids confusion about whether the monitor is active by default.

controller/monitor/disk_monitor.go (3)

28-28: Make DiskMonitorSyncPeriod configurable
Hardcoding 30-second intervals is convenient for now, but consider a configuration mechanism if production usage requires dynamic intervals or environment-based overrides.


72-76: Defer the goroutine start until after construction
NewDiskMonitor launches m.Start() immediately via go routine. Instead, consider exposing separate “Start” and “Stop” methods so that users can manage when the monitor is launched and ensure all prerequisites are ready.


Line range hint 396-422: Review block vs filesystem branching
Multiple places switch on diskType. Validate that new disk types or future expansions follow the same logic or error out properly. Consider a shared function that handles unknown disk types in a single location for DRY compliance.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9649bed and 10cf797.

📒 Files selected for processing (6)
  • controller/monitor/disk_monitor.go (12 hunks)
  • controller/monitor/environment_check_monitor.go (2 hunks)
  • controller/monitor/fake_disk_monitor.go (1 hunks)
  • controller/monitor/fake_environment_check_monitor.go (1 hunks)
  • controller/node_controller.go (1 hunks)
  • controller/node_controller_test.go (19 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • controller/monitor/environment_check_monitor.go
  • controller/node_controller.go
🔇 Additional comments (21)
controller/monitor/fake_environment_check_monitor.go (1)

85-106: Validate handle for missing or deleting node
The run() method relies on fetching the node object and logs an error if it cannot be retrieved. If the node is in the process of being deleted, consider handling that scenario explicitly. Otherwise, your logs might fill with repeated errors if the node stays unavailable.

controller/node_controller_test.go (17)

297-298: Same comment as lines 211-212


383-384: Same comment as lines 211-212


469-470: Same comment as lines 211-212


623-624: Same comment as lines 211-212


770-771: Same comment as lines 211-212


927-928: Same comment as lines 211-212


1049-1050: Same comment as lines 211-212


1195-1196: Same comment as lines 211-212


1289-1290: Same comment as lines 211-212


1383-1384: Same comment as lines 211-212


1472-1473: Same comment as lines 211-212


1561-1562: Same comment as lines 211-212


1650-1651: Same comment as lines 211-212


1739-1740: Same comment as lines 211-212


1834-1835: Same comment as lines 211-212


211-212: Verify environmentCheckMonitor is non-nil before invocation
Prior to calling environmentCheckMonitor.RunOnce(), consider verifying that the monitor is initialized. This avoids potential panics in edge cases if the controller’s monitor is unintentionally nil.


2334-2337: Check concurrency assumptions for diskMonitor
diskMonitor is assigned here after instantiation. Validate that no goroutine starts the monitor prior to this assignment, preventing a potential race condition if diskMonitor’s methods are called before fully assigned.

✅ Verification successful

Let me analyze the code to check for potential race conditions by searching for any goroutines that might access the diskMonitor before its assignment.


Let me search for more context about how the controller and monitor are initialized and used.


Let me try one more search to understand the controller's initialization and monitor usage pattern.


Let me try one final search to understand the monitor initialization and usage pattern.


Based on the gathered information, I can now provide a final response about the concurrency safety of the diskMonitor initialization.

No race condition risk with diskMonitor initialization
The monitor is safely initialized before use. The code shows that:

  1. The monitor is initialized using NewFakeDiskMonitor and immediately assigned to nc.diskMonitor
  2. The monitor uses a baseMonitor struct with proper synchronization primitives (mutex)
  3. The only usage of diskMonitor in tests is through explicit RunOnce() calls, which happen after initialization
  4. No evidence of background goroutines starting before the monitor assignment
🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Search for the diskMonitor usage in the test file and related files
rg -A 5 "diskMonitor" controller/node_controller_test.go

# Search for goroutine starts or go keyword usage near diskMonitor
rg -B 5 -A 5 "go \w+.*diskMonitor" controller/node_controller_test.go

# Look for the NewFakeDiskMonitor implementation to understand its behavior
ast-grep --pattern 'func NewFakeDiskMonitor($$$) ($_, $_) {
  $$$
}'

Length of output: 4863


Script:

#!/bin/bash
# Search for controller initialization and monitor usage
rg -B 10 -A 5 "nc.diskMonitor = diskMonitor" controller/node_controller_test.go

# Look for any Start or Run methods that might start goroutines
ast-grep --pattern 'func ($v *NodeController) Start($$$) {
  $$$
}'

# Search for monitor start/run patterns
rg -B 5 -A 5 "Start\(\)|Run\(\)" controller/monitor/fake_disk_monitor.go

Length of output: 720


Script:

#!/bin/bash
# Search for the test setup and controller initialization
rg -B 20 -A 10 "NewFakeNodeController" controller/node_controller_test.go

# Look for any monitor interface or base implementation
ast-grep --pattern 'type $_ interface {
  $$$
}'

# Search for RunOnce implementation
rg -B 5 -A 5 "func.*RunOnce" controller/monitor/fake_disk_monitor.go

Length of output: 65789


Script:

#!/bin/bash
# Search for monitor initialization and Start/Run methods
rg -B 5 -A 10 "type.*Monitor struct" controller/monitor/

# Look for any goroutine starts in the controller
rg -B 5 -A 5 "go \w+|Start\(\)" controller/node_controller_test.go

# Search for RunOnce usage context
rg -B 5 -A 5 "RunOnce\(\)" controller/node_controller_test.go

Length of output: 10995

controller/monitor/disk_monitor.go (3)

221-221: Validate concurrency in collectDiskData
collectDiskData calls m.newDiskServiceClients() and closes them with defer, which is correct for concurrency. Just ensure that no other threads reference the clients after they’ve been closed, so there isn’t a race condition.


97-108: Confirm partial failures logging
The Start method logs an error and continues looping. If partial failures occur repeatedly, consider capturing them differently or adding a backoff strategy, so logs aren’t flooded.


136-139: Guard against node deletion
If the node is being deleted or is otherwise absent, the disk monitor logs an error. Repeatedly calling run() under those circumstances can lead to spammy logs. You may want to short-circuit gracefully when node deletion is detected.

controller/monitor/disk_monitor.go Show resolved Hide resolved
Set monitors to nil after closing them, so the controller can recreate
them when the node is back.

Longhorn 10035

Signed-off-by: Derek Su <[email protected]>
Copy link
Contributor

@mantissahz mantissahz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekbit
Copy link
Member Author

@mergify backport v1.7.x v1.6.x

Copy link

mergify bot commented Dec 23, 2024

backport v1.7.x v1.6.x

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants