-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: detached volume replicas not evicted #3293
fix: detached volume replicas not evicted #3293
Conversation
WalkthroughThe changes in this pull request enhance the Changes
Assessment against linked issues
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
controller/volume_eviction_controller.go (2)
209-211
: Ensure proper handling when no disk candidates are found during evictionBy adding the condition
if vec.hasDiskCandidateForReplicaEviction(replicas, vol)
, the attachment ticket is only created if there are valid disk candidates. Consider implementing additional logic to handle scenarios where no disk candidates are available, such as updating the volume's condition status or notifying the user that the eviction cannot proceed. This will improve transparency and allow users to take corrective actions.
219-242
: Consider evaluating all replicas before returningCurrently, the
hasDiskCandidateForReplicaEviction
function returnsfalse
immediately upon encountering a replica without disk candidates. It might be beneficial to evaluate all replicas to identify every replica that cannot be scheduled for eviction. This approach provides comprehensive feedback, which can be valuable for debugging and informing users about all scheduling issues.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (2)
controller/volume_eviction_controller.go
(4 hunks)k8s/pkg/apis/longhorn/v1beta2/node.go
(1 hunks)
🔇 Additional comments (4)
k8s/pkg/apis/longhorn/v1beta2/node.go (1)
59-59
: LGTM! Verify the usage of this new error constant.
The new error constant follows the existing naming pattern and provides a clear message for replica eviction failures, which aligns with the PR objectives.
Let's verify how this constant is used in the codebase:
✅ Verification successful
Let me check the usage context in the volume eviction controller to ensure proper implementation.
The error constant is properly implemented and used in the volume eviction controller
The constant is used appropriately in controller/volume_eviction_controller.go
when there are no disk candidates available for evicting a replica. The error is aggregated with other potential errors and logged with proper context, following the codebase's error handling patterns.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the usage of the new error constant in the codebase
# Expected: This constant should be used in volume eviction related code
# Search for direct usage of the constant
rg "ErrorReplicaScheduleEvictReplicaFailed" --type go
# Search for the error message to catch string literal usage
rg "evict replica failed" --type go
Length of output: 472
Script:
#!/bin/bash
# Check the implementation context in volume_eviction_controller.go
rg -B 5 -A 5 "ErrorReplicaScheduleEvictReplicaFailed" controller/volume_eviction_controller.go
Length of output: 586
controller/volume_eviction_controller.go (3)
24-25
: Addition of scheduler and util package imports is appropriate
The imports for "github.com/longhorn/longhorn-manager/scheduler"
and "github.com/longhorn/longhorn-manager/util"
are necessary for the new scheduling functionality and utility methods.
43-43
: Introduce scheduler field in VolumeEvictionController
Adding the scheduler *scheduler.ReplicaScheduler
field to the VolumeEvictionController
struct integrates the ReplicaScheduler
for managing replica evictions.
69-70
: Initialize ReplicaScheduler in NewVolumeEvictionController
Initializing the scheduler
with scheduler.NewReplicaScheduler(ds)
ensures that the controller can utilize the scheduling logic for replica eviction decisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let @PhanLe1010 and @shuo-wu review as well.
longhorn/longhorn-9781 longhorn/longhorn-9857 Signed-off-by: Chin-Ya Huang <[email protected]>
97ff87a
to
a4af88d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
controller/volume_eviction_controller.go (2)
69-70
: Consider adding error handling for scheduler initializationWhile the scheduler initialization is correct, consider handling potential initialization errors by modifying the scheduler's constructor to return an error, similar to other components in the codebase.
-vec.scheduler = scheduler.NewReplicaScheduler(ds) +var err error +vec.scheduler, err = scheduler.NewReplicaScheduler(ds) +if err != nil { + return nil, errors.Wrap(err, "failed to initialize replica scheduler") +}
219-242
: Consider performance optimizations for disk candidate checkingThe implementation is solid, but could benefit from the following optimizations:
- Add early return for empty replicas map
- Consider parallel processing for large replica sets
- Add context support for potential cancellation
-func (vec *VolumeEvictionController) hasDiskCandidateForReplicaEviction(replicas map[string]*longhorn.Replica, volume *longhorn.Volume) bool { +func (vec *VolumeEvictionController) hasDiskCandidateForReplicaEviction(ctx context.Context, replicas map[string]*longhorn.Replica, volume *longhorn.Volume) bool { + if len(replicas) == 0 { + return true + } + + type result struct { + replica *longhorn.Replica + found bool + err error + } + + resultChan := make(chan result, len(replicas)) + for _, replica := range replicas { + go func(r *longhorn.Replica) { + replicaCopy := r.DeepCopy() + replicaCopy.Spec.HardNodeAffinity = "" + + diskCandidates, multiError, err := vec.scheduler.FindDiskCandidates(replicaCopy, replicas, volume) + if err != nil { + resultChan <- result{replica: r, found: false, err: err} + return + } + + if len(diskCandidates) == 0 { + aggregatedReplicaScheduledError := util.NewMultiError(longhorn.ErrorReplicaScheduleEvictReplicaFailed) + if multiError != nil { + aggregatedReplicaScheduledError.Append(multiError) + } + vec.logger.Warnf("No disk candidates for evicting replica %q: %v", r.Name, aggregatedReplicaScheduledError.Join()) + resultChan <- result{replica: r, found: false, err: nil} + return + } + resultChan <- result{replica: r, found: true, err: nil} + }(replica) + } + + for range replicas { + select { + case <-ctx.Done(): + return false + case res := <-resultChan: + if res.err != nil { + vec.logger.WithError(res.err).Warnf("Failed to find disk candidates for evicting replica %q", res.replica.Name) + return false + } + if !res.found { + return false + } + } + } + + vec.logger.Infof("Found disk candidates for evicting replicas of volume %q", volume.Name) + return true }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (2)
controller/volume_eviction_controller.go
(4 hunks)k8s/pkg/apis/longhorn/v1beta2/node.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- k8s/pkg/apis/longhorn/v1beta2/node.go
🔇 Additional comments (3)
controller/volume_eviction_controller.go (3)
24-25
: LGTM: Clean struct extension
The addition of the scheduler dependency and corresponding struct field is well-organized and follows Go conventions.
Also applies to: 43-43
209-211
: LGTM: Improved eviction control logic
The additional check for disk candidates before creating the attachment ticket aligns well with the PR objectives. This should prevent unnecessary eviction attempts when no suitable disk candidates are available.
Let's verify the impact on volume eviction:
✅ Verification successful
Based on the search results, I can now verify the impact of the volume eviction changes. The code shows that:
- The volume eviction controller creates attachment tickets to coordinate volume attachment/detachment during eviction
- The attachment ticket system is used consistently across multiple controllers (backup, clone, snapshot, etc.)
- The condition
hasDiskCandidateForReplicaEviction()
is used to prevent unnecessary attachment tickets when no suitable disk candidates are available
The change aligns with the existing attachment ticket pattern and adds an important optimization to prevent unnecessary eviction attempts.
LGTM: Verified eviction control logic
The additional check for disk candidates before creating the attachment ticket is a valid optimization that prevents unnecessary volume attachment/detachment cycles when no suitable disk candidates are available for eviction.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any existing volume eviction patterns or issues
# Look for patterns of volume eviction in the codebase
# Check for volume eviction related error patterns
rg -A 3 "volume.*evict|evict.*volume" --type go
# Check for attachment ticket usage patterns
rg -A 3 "AttachmentTicket" --type go
Length of output: 67463
224-224
: Verify scheduler's FindDiskCandidates implementation
Let's ensure the scheduler's FindDiskCandidates method properly handles all edge cases for replica eviction.
✅ Verification successful
FindDiskCandidates implementation is robust and well-tested
The implementation in scheduler/replica_scheduler.go
is thoroughly designed with:
- Comprehensive error handling using MultiError for non-fatal cases
- Proper node candidate filtering with multiple validation checks
- Disk candidate selection with proper resource and tag validation
- Well-documented interface and implementation
- Proper test coverage in
scheduler/replica_scheduler_test.go
The usage in volume_eviction_controller.go
correctly handles all error cases and empty disk candidates scenarios.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check scheduler's FindDiskCandidates implementation
# Look for the implementation and its test coverage
# Find the implementation
ast-grep --pattern 'func $_($$) (_, _, error) {
$$$
FindDiskCandidates
$$$
}'
# Check test coverage
rg -A 5 "TestFindDiskCandidates" --type go
Length of output: 128
Script:
#!/bin/bash
# Let's try a different approach to find the FindDiskCandidates implementation
# Search for the method declaration
ast-grep --pattern 'func $_FindDiskCandidates($$$) ($$$) {
$$$
}'
# Also search for interface definition
ast-grep --pattern 'type $_ interface {
$$$
FindDiskCandidates($$$)
$$$
}'
# Backup search using ripgrep
rg "FindDiskCandidates" -A 10 --type go
Length of output: 4115
Script:
#!/bin/bash
# Let's examine the scheduler implementation more thoroughly
# Check the getNodeCandidates implementation since it's crucial for FindDiskCandidates
ast-grep --pattern 'func $_ getNodeCandidates($$$) ($$$) {
$$$
}'
# Look for test coverage of the scheduler
rg "TestReplicaScheduler" -A 5 --type go
# Check for any potential error handling patterns
rg "MultiError" -B 2 -A 2 scheduler/replica_scheduler.go
Length of output: 7348
@mergify backport v1.7.x v1.6.x |
✅ Backports have been created
|
Which issue(s) this PR fixes:
Issue longhorn/longhorn#9781, longhorn/longhorn#9857
What this PR does / why we need it:
We need to allow the volume eviction controller to handle detached volumes. This proposes another approach to address longhorn/longhorn#9781: proceed with volume attachment for replica eviction only when suitable node/disk candidates are available.
Special notes for your reviewer:
Additional documentation or context