fix: move environment checks to a dedicated monitor #3369

derekbit · 2024-12-20T05:45:55Z

Which issue(s) this PR fixes:

Issue longhorn/longhorn#9983

What this PR does / why we need it:

Move environment checks to a dedicated monitor
Remove hugepage condition after disabling v2 data engine
Increase the environment check period to 30 minutes

Special notes for your reviewer:

Additional documentation or context

coderabbitai · 2024-12-20T05:46:03Z

Warning

Rate limit exceeded

@derekbit has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 3 minutes and 54 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 0560dc8 and d01d271.

📒 Files selected for processing (9)

controller/backup_backing_image_controller.go (1 hunks)
controller/monitor/disk_monitor.go (1 hunks)
controller/monitor/environment_check_monitor.go (1 hunks)
controller/monitor/monitor.go (1 hunks)
controller/monitor/snapshot_monitor.go (1 hunks)
controller/node_controller.go (7 hunks)
controller/node_controller_test.go (0 hunks)
engineapi/backup_backing_image.go (1 hunks)
types/condition.go (1 hunks)

Walkthrough

The pull request introduces significant changes to the Longhorn monitoring infrastructure, focusing on enhancing environment checks, error handling, and resource monitoring. Key modifications include renaming Close() methods to Stop() across multiple monitor implementations, adding a new EnvironmentCheckMonitor to periodically check node environment conditions, and improving error reporting and data collection mechanisms. The changes aim to provide more robust monitoring and better visibility into node and system configurations.

Changes

File	Change Summary
`controller/monitor/monitor.go`	Renamed `Close()` method to `Stop()` in `Monitor` interface
`controller/monitor/disk_monitor.go`	Updated `Close()` to `Stop()`, enhanced error handling and data collection methods
`controller/monitor/snapshot_monitor.go`	Renamed `Close()` to `Stop()`, improved task processing and configuration management
`controller/monitor/backup_backing_image_controller.go`	Updated method to use `monitor.Stop()` instead of `monitor.Close()`
`engineapi/backup_backing_image.go`	Renamed `Close()` to `Stop()` in `BackupBackingImageMonitor`
`controller/node_controller.go`	Added `environmentCheckMonitor`, updated monitoring logic
`controller/monitor/environment_check_monitor.go`	New file introducing `EnvironmentCheckMonitor` for periodic environment checks
`types/condition.go`	Added `RemoveCondition()` function for managing conditions

Assessment against linked issues

Objective	Addressed	Explanation
Periodic HugePages (2Mi) configuration check [#9983]	✅
Check HugePages registered as Kubernetes resource	✅
Update node condition for HugePages availability	✅

Possibly related PRs

fix(backup): modify backup volume/volume APIs and naming #3360: Relates to backup volume API modifications, which shares similarities with the backup backing image error handling improvements in this PR.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

controller/monitor/snapshot_monitor.go (1)

Line range hint 259-284: Consider potential race condition in configuration update

There's a potential race condition between checking the scheduler length and removing the job. The lock should encompass both operations to ensure thread safety.

-	m.RLock()
-	if dataIntegrityCronJob == m.existingDataIntegrityCronJob && m.checkScheduler.Len() > 0 {
-		m.RUnlock()
-		return nil
-	}
-	m.RUnlock()
-
-	if m.checkScheduler.Len() > 0 {
-		m.checkScheduler.Remove(m.checkSnapshots)
-	}
+	m.Lock()
+	defer m.Unlock()
+	
+	if dataIntegrityCronJob == m.existingDataIntegrityCronJob && m.checkScheduler.Len() > 0 {
+		return nil
+	}
+	
+	if m.checkScheduler.Len() > 0 {
+		m.checkScheduler.Remove(m.checkSnapshots)
+	}

-	job, err := m.checkScheduler.Cron(dataIntegrityCronJob).Do(m.checkSnapshots)
+	job, err := m.checkScheduler.Cron(dataIntegrityCronJob).Do(m.checkSnapshots)
 	if err != nil {
 		return errors.Wrap(err, "failed to schedule snapshot check job")
 	}
-
-	m.Lock()
-	previousDataIntegrityCronJob := m.existingDataIntegrityCronJob
-	m.existingDataIntegrityCronJob = dataIntegrityCronJob
-	m.Unlock()
+	
+	previousDataIntegrityCronJob := m.existingDataIntegrityCronJob
+	m.existingDataIntegrityCronJob = dataIntegrityCronJob

🧹 Nitpick comments (11)

controller/monitor/environment_check_monitor.go (4)

82-91: Consider a more robust restart mechanism
Currently, if the polling loop exits due to an error, it's logged but the monitor doesn't attempt to recover, leading to permanent monitoring loss. Consider implementing a retry or backoff to automatically restart polling in case of transient errors.

105-115: Document concurrency considerations for GetCollectedData
The code uses RLock and defers RUnlock for data retrieval, which is good. You might document that the returned data is a copy so that consumers know modifying it does not affect the original state.

173-260: Gracefully handle unknown package managers
You return early with a condition about an unknown OS image, but in certain edge cases, there might be additional fallback logic or user alerts. For example, consider offering a helpful error message or recommended manual steps for unrecognized distributions.

358-389: Veifying necessity of huge pages check
HugePages checks are relevant only when the v2-data-engine setting is true. You correctly skip the check if the engine is disabled, but if there's any plan to add more dynamic logic, consider documenting future use cases for these checks.

controller/node_controller.go (5)

59-60: Initialize monitors on NodeController creation
Two new fields (diskMonitor, environmentCheckMonitor) are introduced but only get created lazily. This is fine, but you might consider handling potential constructor errors to avoid partially initialized states if there's a problem.

459-462: Validate disk monitor creation
The code tries to create the disk monitor every sync cycle. Consider short-circuiting repeated attempts to avoid logs spam or repeated goroutine overhead if the monitor repeatedly fails to initialize.

690-717: Possibility of removing conditions
syncEnvironmentCheckConditions only adds or updates environment-based conditions. If conditions become stale and are no longer relevant, consider removing them from the node status to avoid confusion.

1305-1318: Reuse or unify monitor factory methods
Both createDiskMonitor and createEnvironmentCheckMonitor share a similar pattern. Consider extracting a helper or factory to broaden code reusability, especially if more monitors get introduced.

1470-1478: Ensure type checking for returned environment conditions
syncWithEnvironmentCheckMonitor unconditionally casts to []longhorn.Condition. This code is safe if you’re certain the monitor always returns that type. Consider returning a typed structure or checking the assertion with caution.
types/condition.go (1)
12-25: LGTM! Consider adding godoc.

The implementation is correct and handles edge cases well. Consider adding godoc to document the function's purpose and parameters.
+// RemoveCondition removes the condition with the specified type from the conditions slice.
+// If conditions is nil, returns nil. If no matching condition is found, returns the original slice.
+// Parameters:
+//   - conditions: slice of conditions to filter
+//   - conditionType: type of condition to remove
+// Returns: new slice with the matching condition removed
 func RemoveCondition(conditions []longhorn.Condition, conditionType string) []longhorn.Condition {
controller/monitor/snapshot_monitor.go (1)
Line range hint 319-332: Consider logging errors from getSnapshotDataIntegrity

While the error handling flow is correct, errors from getSnapshotDataIntegrity should be logged before returning to help with debugging potential issues.
 	dataIntegrity, err := m.getSnapshotDataIntegrity(task.volumeName)
 	if err != nil {
+		m.logger.WithField("monitor", monitorName).WithError(err).
+			Errorf("Failed to get snapshot data integrity for volume %v", task.volumeName)
 		return true
 	}

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between def8a3b and d0a65d0.

📒 Files selected for processing (8)

controller/backup_backing_image_controller.go (1 hunks)
controller/monitor/disk_monitor.go (1 hunks)
controller/monitor/environment_check_monitor.go (1 hunks)
controller/monitor/monitor.go (1 hunks)
controller/monitor/snapshot_monitor.go (1 hunks)
controller/node_controller.go (7 hunks)
engineapi/backup_backing_image.go (1 hunks)
types/condition.go (1 hunks)

🔇 Additional comments (8)

controller/monitor/disk_monitor.go (1)

108-108: Renamed method from Close() to Stop() aligns with usage convention
This rename maintains consistency across other monitors and clarifies the method's purpose. The method simply calls the cancel function.

controller/monitor/environment_check_monitor.go (1)

526-556: Confirm tolerance for older NFS versions
The check focuses on NFS v4.x. If you expect or must handle older versions (e.g., NFSv3), consider extending or clarifying. Otherwise, this coverage is acceptable.

controller/node_controller.go (1)

378-389: Ensure concurrency-safety when stopping monitors
Stopping each monitor upon node deletion is correct. However, if multiple code paths or goroutines also call Stop, double-canceling the same context might produce spurious logs. Confirm concurrency safety if multiple cancellations happen.

controller/monitor/monitor.go (1)

14-14: Stop method name improves clarity
Replacing Close() with Stop() consistently unifies naming. Ensure all references in docstrings or readmes are also updated to reduce confusion.

engineapi/backup_backing_image.go (1)

268-270: LGTM! Method rename is consistent.

The rename from Close to Stop aligns with the monitor interface changes and maintains the same functionality.

controller/backup_backing_image_controller.go (1)

472-472: LGTM! Method call updated consistently.

The update from monitor.Close() to monitor.Stop() maintains consistency with the monitor interface changes.

controller/monitor/snapshot_monitor.go (2)

Line range hint 245-250: LGTM: Clean implementation of the Stop method

The renamed method follows a proper shutdown sequence by cleaning up resources in the correct order: logging, shutting down the task queue, stopping the scheduler, and finally quitting the monitor.

Line range hint 334-609: LGTM: Robust snapshot integrity check implementation

The snapshot integrity check implementation is thorough and well-designed:

Comprehensive validation before processing
Proper handling of corrupted replicas
Robust checksum determination with voting mechanism
Clear error messages and appropriate event recording

controller/node_controller.go

c3y1huang

LGTM

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

controller/node_controller.go (1)
459-462: Consider error handling improvement

While the environment check monitor creation is implemented correctly, consider adding debug logging for monitor creation failures to aid troubleshooting.
 if _, err := nc.createEnvironmentCheckMonitor(); err != nil {
+    nc.logger.WithError(err).Debug("Failed to create environment check monitor")
     return err
 }

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97b067f and 0560dc8.

📒 Files selected for processing (1)

controller/node_controller.go (7 hunks)

🔇 Additional comments (5)

controller/node_controller.go (5)

59-60: LGTM: Monitor fields properly added

The addition of environmentCheckMonitor field alongside existing monitors follows consistent patterns and maintains good code organization.

378-390: LGTM: Proper cleanup of all monitors during node deletion

The code correctly stops all monitors (disk, environment check, and snapshot) during node deletion, preventing resource leaks.

478-483: Best-effort approach for environment checks

The environment check conditions are updated only if syncWithEnvironmentCheckMonitor succeeds. If that monitor fails or times out, the node conditions remain stale.

1305-1318: LGTM: Environment check monitor creation implementation

The implementation follows the established pattern used by other monitors, with proper nil checks and error handling.

690-717: Verify v2-data-engine setting access error handling

The code handles v2-data-engine setting access error by defaulting to false, but consider if this is the desired behavior in all cases.

✅ Verification successful

Based on the search results, I can now provide a final assessment of the v2-data-engine setting error handling:

Error handling for v2-data-engine setting is consistent with codebase patterns

The current error handling approach (defaulting to false) is the correct behavior because:

It matches the consistent pattern used across the codebase where v2-data-engine is treated as an optional feature that defaults to disabled when there are issues
The error handling is defensive and safe - defaulting to false prevents enabling v2 data engine features when the setting cannot be reliably determined
Other components in the codebase (node validator, environment check monitor, etc.) handle the setting error in the same way by defaulting to false
The setting is used for feature gating, so defaulting to false on error maintains system stability by falling back to the base v1 functionality

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if v2-data-engine setting is used elsewhere with different error handling
rg -A 3 "GetSettingAsBool.*v2-data-engine"

Length of output: 44

Script:

#!/bin/bash
# Search for v2-data-engine setting usage with more context and variations
rg -i "v2.*data.*engine" -A 5 -B 2

Length of output: 70251

Script:

#!/bin/bash
# Search for GetSettingAsBool usage patterns to understand error handling
rg "GetSettingAsBool" -A 3

Length of output: 19638

Script:

#!/bin/bash
# Look for setting type definition
ast-grep --pattern 'const SettingNameV2DataEngine = $_'

Length of output: 57

controller/node_controller.go

- Move environment checks to a dedicated monitor - Remove hugepage condition after disabling v2 data engine Longhorn 9983 Signed-off-by: Derek Su <[email protected]>

The environment check is best-effort, so no need to check too frequently. Longhorn 9983 Signed-off-by: Derek Su <[email protected]>

Longhorn 9983 Signed-off-by: Derek Su <[email protected]>

controller/monitor/environment_check_monitor.go

mantissahz

LGTM

derekbit · 2024-12-20T07:34:52Z

@mergify backport v1.8.x

mergify · 2024-12-20T07:35:09Z

backport v1.8.x

✅ Backports have been created

#3370 fix: move environment checks to a dedicated monitor (backport #3369) has been created for branch v1.8.x

derekbit · 2024-12-20T08:18:19Z

@mergify backport v1.7.x

mergify · 2024-12-20T08:18:30Z

backport v1.7.x

✅ Backports have been created

#3371 fix: move environment checks to a dedicated monitor (backport #3369) has been created for branch v1.7.x but encountered conflicts

derekbit requested review from COLDTURNIP, innobead, ChanYiLin, mantissahz and c3y1huang December 20, 2024 05:45

derekbit self-assigned this Dec 20, 2024

coderabbitai bot reviewed Dec 20, 2024

View reviewed changes

controller/node_controller.go Show resolved Hide resolved

c3y1huang previously approved these changes Dec 20, 2024

View reviewed changes

derekbit dismissed c3y1huang’s stale review via 97b067f December 20, 2024 06:08

derekbit requested a review from c3y1huang December 20, 2024 06:28

coderabbitai bot reviewed Dec 20, 2024

View reviewed changes

controller/node_controller.go Outdated Show resolved Hide resolved

derekbit force-pushed the issue-9983 branch from 0560dc8 to 091f5ca Compare December 20, 2024 06:40

derekbit added 4 commits December 20, 2024 14:41

fix: move environment checks to a dedicated monitor

3ca5a02

- Move environment checks to a dedicated monitor - Remove hugepage condition after disabling v2 data engine Longhorn 9983 Signed-off-by: Derek Su <[email protected]>

fix: increase the environment check period to 30 minutes

ceaffc0

The environment check is best-effort, so no need to check too frequently. Longhorn 9983 Signed-off-by: Derek Su <[email protected]>

fix: update node_controller_test.go

8b60565

Longhorn 9983 Signed-off-by: Derek Su <[email protected]>

fix: change info message to warning level

d01d271

Longhorn 9983 Signed-off-by: Derek Su <[email protected]>

derekbit force-pushed the issue-9983 branch from 091f5ca to d01d271 Compare December 20, 2024 06:42

mantissahz reviewed Dec 20, 2024

View reviewed changes

controller/monitor/environment_check_monitor.go Show resolved Hide resolved

c3y1huang approved these changes Dec 20, 2024

View reviewed changes

mantissahz approved these changes Dec 20, 2024

View reviewed changes

mergify bot merged commit 1b74376 into longhorn:master Dec 20, 2024
9 checks passed

mergify bot mentioned this pull request Dec 20, 2024

fix: move environment checks to a dedicated monitor (backport #3369) #3370

Merged

mergify bot mentioned this pull request Dec 20, 2024

fix: move environment checks to a dedicated monitor (backport #3369) #3371

Merged

This was referenced Dec 23, 2024

fix(monitor): set monitors to nil after closing them #3388

Merged

fix(monitor): remove the close of monitors #3397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: move environment checks to a dedicated monitor #3369

fix: move environment checks to a dedicated monitor #3369

derekbit commented Dec 20, 2024 •

edited

Loading

coderabbitai bot commented Dec 20, 2024 •

edited

Loading

Rate limit exceeded

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

c3y1huang left a comment

coderabbitai bot left a comment

mantissahz left a comment

derekbit commented Dec 20, 2024

mergify bot commented Dec 20, 2024 •

edited

Loading

derekbit commented Dec 20, 2024

mergify bot commented Dec 20, 2024 •

edited

Loading

fix: move environment checks to a dedicated monitor #3369

fix: move environment checks to a dedicated monitor #3369

Conversation

derekbit commented Dec 20, 2024 • edited Loading

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

coderabbitai bot commented Dec 20, 2024 • edited Loading

Rate limit exceeded

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

c3y1huang left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

mantissahz left a comment

Choose a reason for hiding this comment

derekbit commented Dec 20, 2024

mergify bot commented Dec 20, 2024 • edited Loading

✅ Backports have been created

derekbit commented Dec 20, 2024

mergify bot commented Dec 20, 2024 • edited Loading

✅ Backports have been created

derekbit commented Dec 20, 2024 •

edited

Loading

coderabbitai bot commented Dec 20, 2024 •

edited

Loading

mergify bot commented Dec 20, 2024 •

edited

Loading

mergify bot commented Dec 20, 2024 •

edited

Loading