Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: move environment checks to a dedicated monitor #3369

Merged
merged 4 commits into from
Dec 20, 2024

Conversation

derekbit
Copy link
Member

@derekbit derekbit commented Dec 20, 2024

Which issue(s) this PR fixes:

Issue longhorn/longhorn#9983

What this PR does / why we need it:

  • Move environment checks to a dedicated monitor
  • Remove hugepage condition after disabling v2 data engine
  • Increase the environment check period to 30 minutes

Special notes for your reviewer:

Additional documentation or context

Copy link

coderabbitai bot commented Dec 20, 2024

Warning

Rate limit exceeded

@derekbit has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 3 minutes and 54 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 0560dc8 and d01d271.

📒 Files selected for processing (9)
  • controller/backup_backing_image_controller.go (1 hunks)
  • controller/monitor/disk_monitor.go (1 hunks)
  • controller/monitor/environment_check_monitor.go (1 hunks)
  • controller/monitor/monitor.go (1 hunks)
  • controller/monitor/snapshot_monitor.go (1 hunks)
  • controller/node_controller.go (7 hunks)
  • controller/node_controller_test.go (0 hunks)
  • engineapi/backup_backing_image.go (1 hunks)
  • types/condition.go (1 hunks)

Walkthrough

The pull request introduces significant changes to the Longhorn monitoring infrastructure, focusing on enhancing environment checks, error handling, and resource monitoring. Key modifications include renaming Close() methods to Stop() across multiple monitor implementations, adding a new EnvironmentCheckMonitor to periodically check node environment conditions, and improving error reporting and data collection mechanisms. The changes aim to provide more robust monitoring and better visibility into node and system configurations.

Changes

File Change Summary
controller/monitor/monitor.go Renamed Close() method to Stop() in Monitor interface
controller/monitor/disk_monitor.go Updated Close() to Stop(), enhanced error handling and data collection methods
controller/monitor/snapshot_monitor.go Renamed Close() to Stop(), improved task processing and configuration management
controller/monitor/backup_backing_image_controller.go Updated method to use monitor.Stop() instead of monitor.Close()
engineapi/backup_backing_image.go Renamed Close() to Stop() in BackupBackingImageMonitor
controller/node_controller.go Added environmentCheckMonitor, updated monitoring logic
controller/monitor/environment_check_monitor.go New file introducing EnvironmentCheckMonitor for periodic environment checks
types/condition.go Added RemoveCondition() function for managing conditions

Assessment against linked issues

Objective Addressed Explanation
Periodic HugePages (2Mi) configuration check [#9983]
Check HugePages registered as Kubernetes resource
Update node condition for HugePages availability

Possibly related PRs


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
controller/monitor/snapshot_monitor.go (1)

Line range hint 259-284: Consider potential race condition in configuration update

There's a potential race condition between checking the scheduler length and removing the job. The lock should encompass both operations to ensure thread safety.

-	m.RLock()
-	if dataIntegrityCronJob == m.existingDataIntegrityCronJob && m.checkScheduler.Len() > 0 {
-		m.RUnlock()
-		return nil
-	}
-	m.RUnlock()
-
-	if m.checkScheduler.Len() > 0 {
-		m.checkScheduler.Remove(m.checkSnapshots)
-	}
+	m.Lock()
+	defer m.Unlock()
+	
+	if dataIntegrityCronJob == m.existingDataIntegrityCronJob && m.checkScheduler.Len() > 0 {
+		return nil
+	}
+	
+	if m.checkScheduler.Len() > 0 {
+		m.checkScheduler.Remove(m.checkSnapshots)
+	}

-	job, err := m.checkScheduler.Cron(dataIntegrityCronJob).Do(m.checkSnapshots)
+	job, err := m.checkScheduler.Cron(dataIntegrityCronJob).Do(m.checkSnapshots)
 	if err != nil {
 		return errors.Wrap(err, "failed to schedule snapshot check job")
 	}
-
-	m.Lock()
-	previousDataIntegrityCronJob := m.existingDataIntegrityCronJob
-	m.existingDataIntegrityCronJob = dataIntegrityCronJob
-	m.Unlock()
+	
+	previousDataIntegrityCronJob := m.existingDataIntegrityCronJob
+	m.existingDataIntegrityCronJob = dataIntegrityCronJob
🧹 Nitpick comments (11)
controller/monitor/environment_check_monitor.go (4)

82-91: Consider a more robust restart mechanism
Currently, if the polling loop exits due to an error, it's logged but the monitor doesn't attempt to recover, leading to permanent monitoring loss. Consider implementing a retry or backoff to automatically restart polling in case of transient errors.


105-115: Document concurrency considerations for GetCollectedData
The code uses RLock and defers RUnlock for data retrieval, which is good. You might document that the returned data is a copy so that consumers know modifying it does not affect the original state.


173-260: Gracefully handle unknown package managers
You return early with a condition about an unknown OS image, but in certain edge cases, there might be additional fallback logic or user alerts. For example, consider offering a helpful error message or recommended manual steps for unrecognized distributions.


358-389: Veifying necessity of huge pages check
HugePages checks are relevant only when the v2-data-engine setting is true. You correctly skip the check if the engine is disabled, but if there's any plan to add more dynamic logic, consider documenting future use cases for these checks.

controller/node_controller.go (5)

59-60: Initialize monitors on NodeController creation
Two new fields (diskMonitor, environmentCheckMonitor) are introduced but only get created lazily. This is fine, but you might consider handling potential constructor errors to avoid partially initialized states if there's a problem.


459-462: Validate disk monitor creation
The code tries to create the disk monitor every sync cycle. Consider short-circuiting repeated attempts to avoid logs spam or repeated goroutine overhead if the monitor repeatedly fails to initialize.


690-717: Possibility of removing conditions
syncEnvironmentCheckConditions only adds or updates environment-based conditions. If conditions become stale and are no longer relevant, consider removing them from the node status to avoid confusion.


1305-1318: Reuse or unify monitor factory methods
Both createDiskMonitor and createEnvironmentCheckMonitor share a similar pattern. Consider extracting a helper or factory to broaden code reusability, especially if more monitors get introduced.


1470-1478: Ensure type checking for returned environment conditions
syncWithEnvironmentCheckMonitor unconditionally casts to []longhorn.Condition. This code is safe if you’re certain the monitor always returns that type. Consider returning a typed structure or checking the assertion with caution.

types/condition.go (1)

12-25: LGTM! Consider adding godoc.

The implementation is correct and handles edge cases well. Consider adding godoc to document the function's purpose and parameters.

+// RemoveCondition removes the condition with the specified type from the conditions slice.
+// If conditions is nil, returns nil. If no matching condition is found, returns the original slice.
+// Parameters:
+//   - conditions: slice of conditions to filter
+//   - conditionType: type of condition to remove
+// Returns: new slice with the matching condition removed
 func RemoveCondition(conditions []longhorn.Condition, conditionType string) []longhorn.Condition {
controller/monitor/snapshot_monitor.go (1)

Line range hint 319-332: Consider logging errors from getSnapshotDataIntegrity

While the error handling flow is correct, errors from getSnapshotDataIntegrity should be logged before returning to help with debugging potential issues.

 	dataIntegrity, err := m.getSnapshotDataIntegrity(task.volumeName)
 	if err != nil {
+		m.logger.WithField("monitor", monitorName).WithError(err).
+			Errorf("Failed to get snapshot data integrity for volume %v", task.volumeName)
 		return true
 	}
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between def8a3b and d0a65d0.

📒 Files selected for processing (8)
  • controller/backup_backing_image_controller.go (1 hunks)
  • controller/monitor/disk_monitor.go (1 hunks)
  • controller/monitor/environment_check_monitor.go (1 hunks)
  • controller/monitor/monitor.go (1 hunks)
  • controller/monitor/snapshot_monitor.go (1 hunks)
  • controller/node_controller.go (7 hunks)
  • engineapi/backup_backing_image.go (1 hunks)
  • types/condition.go (1 hunks)
🔇 Additional comments (8)
controller/monitor/disk_monitor.go (1)

108-108: Renamed method from Close() to Stop() aligns with usage convention
This rename maintains consistency across other monitors and clarifies the method's purpose. The method simply calls the cancel function.

controller/monitor/environment_check_monitor.go (1)

526-556: Confirm tolerance for older NFS versions
The check focuses on NFS v4.x. If you expect or must handle older versions (e.g., NFSv3), consider extending or clarifying. Otherwise, this coverage is acceptable.

controller/node_controller.go (1)

378-389: Ensure concurrency-safety when stopping monitors
Stopping each monitor upon node deletion is correct. However, if multiple code paths or goroutines also call Stop, double-canceling the same context might produce spurious logs. Confirm concurrency safety if multiple cancellations happen.

controller/monitor/monitor.go (1)

14-14: Stop method name improves clarity
Replacing Close() with Stop() consistently unifies naming. Ensure all references in docstrings or readmes are also updated to reduce confusion.

engineapi/backup_backing_image.go (1)

268-270: LGTM! Method rename is consistent.

The rename from Close to Stop aligns with the monitor interface changes and maintains the same functionality.

controller/backup_backing_image_controller.go (1)

472-472: LGTM! Method call updated consistently.

The update from monitor.Close() to monitor.Stop() maintains consistency with the monitor interface changes.

controller/monitor/snapshot_monitor.go (2)

Line range hint 245-250: LGTM: Clean implementation of the Stop method

The renamed method follows a proper shutdown sequence by cleaning up resources in the correct order: logging, shutting down the task queue, stopping the scheduler, and finally quitting the monitor.


Line range hint 334-609: LGTM: Robust snapshot integrity check implementation

The snapshot integrity check implementation is thorough and well-designed:

  • Comprehensive validation before processing
  • Proper handling of corrupted replicas
  • Robust checksum determination with voting mechanism
  • Clear error messages and appropriate event recording

controller/node_controller.go Show resolved Hide resolved
c3y1huang
c3y1huang previously approved these changes Dec 20, 2024
Copy link
Contributor

@c3y1huang c3y1huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
controller/node_controller.go (1)

459-462: Consider error handling improvement

While the environment check monitor creation is implemented correctly, consider adding debug logging for monitor creation failures to aid troubleshooting.

 if _, err := nc.createEnvironmentCheckMonitor(); err != nil {
+    nc.logger.WithError(err).Debug("Failed to create environment check monitor")
     return err
 }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97b067f and 0560dc8.

📒 Files selected for processing (1)
  • controller/node_controller.go (7 hunks)
🔇 Additional comments (5)
controller/node_controller.go (5)

59-60: LGTM: Monitor fields properly added

The addition of environmentCheckMonitor field alongside existing monitors follows consistent patterns and maintains good code organization.


378-390: LGTM: Proper cleanup of all monitors during node deletion

The code correctly stops all monitors (disk, environment check, and snapshot) during node deletion, preventing resource leaks.


478-483: Best-effort approach for environment checks

The environment check conditions are updated only if syncWithEnvironmentCheckMonitor succeeds. If that monitor fails or times out, the node conditions remain stale.


1305-1318: LGTM: Environment check monitor creation implementation

The implementation follows the established pattern used by other monitors, with proper nil checks and error handling.


690-717: Verify v2-data-engine setting access error handling

The code handles v2-data-engine setting access error by defaulting to false, but consider if this is the desired behavior in all cases.

✅ Verification successful

Based on the search results, I can now provide a final assessment of the v2-data-engine setting error handling:

Error handling for v2-data-engine setting is consistent with codebase patterns

The current error handling approach (defaulting to false) is the correct behavior because:

  1. It matches the consistent pattern used across the codebase where v2-data-engine is treated as an optional feature that defaults to disabled when there are issues
  2. The error handling is defensive and safe - defaulting to false prevents enabling v2 data engine features when the setting cannot be reliably determined
  3. Other components in the codebase (node validator, environment check monitor, etc.) handle the setting error in the same way by defaulting to false
  4. The setting is used for feature gating, so defaulting to false on error maintains system stability by falling back to the base v1 functionality
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if v2-data-engine setting is used elsewhere with different error handling
rg -A 3 "GetSettingAsBool.*v2-data-engine"

Length of output: 44


Script:

#!/bin/bash
# Search for v2-data-engine setting usage with more context and variations
rg -i "v2.*data.*engine" -A 5 -B 2

Length of output: 70251


Script:

#!/bin/bash
# Search for GetSettingAsBool usage patterns to understand error handling
rg "GetSettingAsBool" -A 3

Length of output: 19638


Script:

#!/bin/bash
# Look for setting type definition
ast-grep --pattern 'const SettingNameV2DataEngine = $_'

Length of output: 57

controller/node_controller.go Outdated Show resolved Hide resolved
- Move environment checks to a dedicated monitor
- Remove hugepage condition after disabling v2 data engine

Longhorn 9983

Signed-off-by: Derek Su <[email protected]>
The environment check is best-effort, so no need to check too
frequently.

Longhorn 9983

Signed-off-by: Derek Su <[email protected]>
Longhorn 9983

Signed-off-by: Derek Su <[email protected]>
Longhorn 9983

Signed-off-by: Derek Su <[email protected]>
Copy link
Contributor

@mantissahz mantissahz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot merged commit 1b74376 into longhorn:master Dec 20, 2024
9 checks passed
@derekbit
Copy link
Member Author

@mergify backport v1.8.x

Copy link

mergify bot commented Dec 20, 2024

backport v1.8.x

✅ Backports have been created

@derekbit
Copy link
Member Author

@mergify backport v1.7.x

Copy link

mergify bot commented Dec 20, 2024

backport v1.7.x

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants