-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: move environment checks to a dedicated monitor #3369
Conversation
Warning Rate limit exceeded@derekbit has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 3 minutes and 54 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (9)
WalkthroughThe pull request introduces significant changes to the Longhorn monitoring infrastructure, focusing on enhancing environment checks, error handling, and resource monitoring. Key modifications include renaming Changes
Assessment against linked issues
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
controller/monitor/snapshot_monitor.go (1)
Line range hint
259-284
: Consider potential race condition in configuration updateThere's a potential race condition between checking the scheduler length and removing the job. The lock should encompass both operations to ensure thread safety.
- m.RLock() - if dataIntegrityCronJob == m.existingDataIntegrityCronJob && m.checkScheduler.Len() > 0 { - m.RUnlock() - return nil - } - m.RUnlock() - - if m.checkScheduler.Len() > 0 { - m.checkScheduler.Remove(m.checkSnapshots) - } + m.Lock() + defer m.Unlock() + + if dataIntegrityCronJob == m.existingDataIntegrityCronJob && m.checkScheduler.Len() > 0 { + return nil + } + + if m.checkScheduler.Len() > 0 { + m.checkScheduler.Remove(m.checkSnapshots) + } - job, err := m.checkScheduler.Cron(dataIntegrityCronJob).Do(m.checkSnapshots) + job, err := m.checkScheduler.Cron(dataIntegrityCronJob).Do(m.checkSnapshots) if err != nil { return errors.Wrap(err, "failed to schedule snapshot check job") } - - m.Lock() - previousDataIntegrityCronJob := m.existingDataIntegrityCronJob - m.existingDataIntegrityCronJob = dataIntegrityCronJob - m.Unlock() + + previousDataIntegrityCronJob := m.existingDataIntegrityCronJob + m.existingDataIntegrityCronJob = dataIntegrityCronJob
🧹 Nitpick comments (11)
controller/monitor/environment_check_monitor.go (4)
82-91
: Consider a more robust restart mechanism
Currently, if the polling loop exits due to an error, it's logged but the monitor doesn't attempt to recover, leading to permanent monitoring loss. Consider implementing a retry or backoff to automatically restart polling in case of transient errors.
105-115
: Document concurrency considerations for GetCollectedData
The code uses RLock and defers RUnlock for data retrieval, which is good. You might document that the returned data is a copy so that consumers know modifying it does not affect the original state.
173-260
: Gracefully handle unknown package managers
You return early with a condition about an unknown OS image, but in certain edge cases, there might be additional fallback logic or user alerts. For example, consider offering a helpful error message or recommended manual steps for unrecognized distributions.
358-389
: Veifying necessity of huge pages check
HugePages checks are relevant only when the v2-data-engine setting is true. You correctly skip the check if the engine is disabled, but if there's any plan to add more dynamic logic, consider documenting future use cases for these checks.controller/node_controller.go (5)
59-60
: Initialize monitors on NodeController creation
Two new fields (diskMonitor, environmentCheckMonitor) are introduced but only get created lazily. This is fine, but you might consider handling potential constructor errors to avoid partially initialized states if there's a problem.
459-462
: Validate disk monitor creation
The code tries to create the disk monitor every sync cycle. Consider short-circuiting repeated attempts to avoid logs spam or repeated goroutine overhead if the monitor repeatedly fails to initialize.
690-717
: Possibility of removing conditions
syncEnvironmentCheckConditions only adds or updates environment-based conditions. If conditions become stale and are no longer relevant, consider removing them from the node status to avoid confusion.
1305-1318
: Reuse or unify monitor factory methods
Both createDiskMonitor and createEnvironmentCheckMonitor share a similar pattern. Consider extracting a helper or factory to broaden code reusability, especially if more monitors get introduced.
1470-1478
: Ensure type checking for returned environment conditions
syncWithEnvironmentCheckMonitor unconditionally casts to []longhorn.Condition. This code is safe if you’re certain the monitor always returns that type. Consider returning a typed structure or checking the assertion with caution.types/condition.go (1)
12-25
: LGTM! Consider adding godoc.The implementation is correct and handles edge cases well. Consider adding godoc to document the function's purpose and parameters.
+// RemoveCondition removes the condition with the specified type from the conditions slice. +// If conditions is nil, returns nil. If no matching condition is found, returns the original slice. +// Parameters: +// - conditions: slice of conditions to filter +// - conditionType: type of condition to remove +// Returns: new slice with the matching condition removed func RemoveCondition(conditions []longhorn.Condition, conditionType string) []longhorn.Condition {controller/monitor/snapshot_monitor.go (1)
Line range hint
319-332
: Consider logging errors from getSnapshotDataIntegrityWhile the error handling flow is correct, errors from
getSnapshotDataIntegrity
should be logged before returning to help with debugging potential issues.dataIntegrity, err := m.getSnapshotDataIntegrity(task.volumeName) if err != nil { + m.logger.WithField("monitor", monitorName).WithError(err). + Errorf("Failed to get snapshot data integrity for volume %v", task.volumeName) return true }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
controller/backup_backing_image_controller.go
(1 hunks)controller/monitor/disk_monitor.go
(1 hunks)controller/monitor/environment_check_monitor.go
(1 hunks)controller/monitor/monitor.go
(1 hunks)controller/monitor/snapshot_monitor.go
(1 hunks)controller/node_controller.go
(7 hunks)engineapi/backup_backing_image.go
(1 hunks)types/condition.go
(1 hunks)
🔇 Additional comments (8)
controller/monitor/disk_monitor.go (1)
108-108
: Renamed method from Close() to Stop() aligns with usage convention
This rename maintains consistency across other monitors and clarifies the method's purpose. The method simply calls the cancel function.
controller/monitor/environment_check_monitor.go (1)
526-556
: Confirm tolerance for older NFS versions
The check focuses on NFS v4.x. If you expect or must handle older versions (e.g., NFSv3), consider extending or clarifying. Otherwise, this coverage is acceptable.
controller/node_controller.go (1)
378-389
: Ensure concurrency-safety when stopping monitors
Stopping each monitor upon node deletion is correct. However, if multiple code paths or goroutines also call Stop, double-canceling the same context might produce spurious logs. Confirm concurrency safety if multiple cancellations happen.
controller/monitor/monitor.go (1)
14-14
: Stop method name improves clarity
Replacing Close() with Stop() consistently unifies naming. Ensure all references in docstrings or readmes are also updated to reduce confusion.
engineapi/backup_backing_image.go (1)
268-270
: LGTM! Method rename is consistent.
The rename from Close
to Stop
aligns with the monitor interface changes and maintains the same functionality.
controller/backup_backing_image_controller.go (1)
472-472
: LGTM! Method call updated consistently.
The update from monitor.Close()
to monitor.Stop()
maintains consistency with the monitor interface changes.
controller/monitor/snapshot_monitor.go (2)
Line range hint 245-250
: LGTM: Clean implementation of the Stop method
The renamed method follows a proper shutdown sequence by cleaning up resources in the correct order: logging, shutting down the task queue, stopping the scheduler, and finally quitting the monitor.
Line range hint 334-609
: LGTM: Robust snapshot integrity check implementation
The snapshot integrity check implementation is thorough and well-designed:
- Comprehensive validation before processing
- Proper handling of corrupted replicas
- Robust checksum determination with voting mechanism
- Clear error messages and appropriate event recording
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
controller/node_controller.go (1)
459-462
: Consider error handling improvementWhile the environment check monitor creation is implemented correctly, consider adding debug logging for monitor creation failures to aid troubleshooting.
if _, err := nc.createEnvironmentCheckMonitor(); err != nil { + nc.logger.WithError(err).Debug("Failed to create environment check monitor") return err }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
controller/node_controller.go
(7 hunks)
🔇 Additional comments (5)
controller/node_controller.go (5)
59-60
: LGTM: Monitor fields properly added
The addition of environmentCheckMonitor
field alongside existing monitors follows consistent patterns and maintains good code organization.
378-390
: LGTM: Proper cleanup of all monitors during node deletion
The code correctly stops all monitors (disk, environment check, and snapshot) during node deletion, preventing resource leaks.
478-483
: Best-effort approach for environment checks
The environment check conditions are updated only if syncWithEnvironmentCheckMonitor succeeds. If that monitor fails or times out, the node conditions remain stale.
1305-1318
: LGTM: Environment check monitor creation implementation
The implementation follows the established pattern used by other monitors, with proper nil checks and error handling.
690-717
: Verify v2-data-engine setting access error handling
The code handles v2-data-engine setting access error by defaulting to false, but consider if this is the desired behavior in all cases.
✅ Verification successful
Based on the search results, I can now provide a final assessment of the v2-data-engine setting error handling:
Error handling for v2-data-engine setting is consistent with codebase patterns
The current error handling approach (defaulting to false) is the correct behavior because:
- It matches the consistent pattern used across the codebase where v2-data-engine is treated as an optional feature that defaults to disabled when there are issues
- The error handling is defensive and safe - defaulting to false prevents enabling v2 data engine features when the setting cannot be reliably determined
- Other components in the codebase (node validator, environment check monitor, etc.) handle the setting error in the same way by defaulting to false
- The setting is used for feature gating, so defaulting to false on error maintains system stability by falling back to the base v1 functionality
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if v2-data-engine setting is used elsewhere with different error handling
rg -A 3 "GetSettingAsBool.*v2-data-engine"
Length of output: 44
Script:
#!/bin/bash
# Search for v2-data-engine setting usage with more context and variations
rg -i "v2.*data.*engine" -A 5 -B 2
Length of output: 70251
Script:
#!/bin/bash
# Search for GetSettingAsBool usage patterns to understand error handling
rg "GetSettingAsBool" -A 3
Length of output: 19638
Script:
#!/bin/bash
# Look for setting type definition
ast-grep --pattern 'const SettingNameV2DataEngine = $_'
Length of output: 57
- Move environment checks to a dedicated monitor - Remove hugepage condition after disabling v2 data engine Longhorn 9983 Signed-off-by: Derek Su <[email protected]>
The environment check is best-effort, so no need to check too frequently. Longhorn 9983 Signed-off-by: Derek Su <[email protected]>
Longhorn 9983 Signed-off-by: Derek Su <[email protected]>
Longhorn 9983 Signed-off-by: Derek Su <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@mergify backport v1.8.x |
✅ Backports have been created
|
@mergify backport v1.7.x |
✅ Backports have been created
|
Which issue(s) this PR fixes:
Issue longhorn/longhorn#9983
What this PR does / why we need it:
Special notes for your reviewer:
Additional documentation or context