-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-2253] Support retry when bind volume failed case instead of… #890
Conversation
… failing the task
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #890 +/- ##
==========================================
+ Coverage 68.07% 68.27% +0.19%
==========================================
Files 70 70
Lines 7575 7616 +41
==========================================
+ Hits 5157 5200 +43
+ Misses 2203 2201 -2
Partials 215 215 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against the change, we need to greatly simplify the retry logic. We can re-use the existing one from K8s which is properly tested and gives us a much simpler solution.
pkg/cache/context.go
Outdated
for i := 0; i < maxRetries; i++ { | ||
err = ctx.apiProvider.GetAPIs().VolumeBinder.BindPodVolumes(context.Background(), assumedPod, volumes) | ||
if err == nil { | ||
return nil | ||
} | ||
|
||
log.Log(log.ShimContext).Error("Failed to bind pod volumes", | ||
zap.String("podName", assumedPod.Name), | ||
zap.String("nodeName", assumedPod.Spec.NodeName), | ||
zap.Int("dynamicProvisions", len(volumes.DynamicProvisions)), | ||
zap.Int("staticBindings", len(volumes.StaticBindings)), | ||
zap.Int("retryCount", i+1), | ||
zap.Error(err)) | ||
|
||
if i == maxRetries-1 { | ||
log.Log(log.ShimContext).Error("Failed to bind pod volumes after retry", | ||
zap.String("podName", assumedPod.Name), | ||
zap.String("nodeName", assumedPod.Spec.NodeName), | ||
zap.Int("dynamicProvisions", len(volumes.DynamicProvisions)), | ||
zap.Int("staticBindings", len(volumes.StaticBindings)), | ||
zap.Error(err)) | ||
return err | ||
} | ||
|
||
delay := baseDelay * time.Duration(1<<uint(i)) | ||
if delay > maxDelay { | ||
delay = maxDelay | ||
} | ||
|
||
retryStrategy.Sleep(delay) // Use the retry strategy | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a retry logic in the K8s codebase that we can reuse:
import "k8s.io/client-go/util/retry"
backoff := wait.Backoff{
Steps: 5,
Duration: time.Second,
Factor: 2.0,
Jitter: 0,
}
err := retry.OnError(backoff, func(_ error) bool {
return true // retry on all error
}, func() error {
return ctx.apiProvider.GetAPIs().VolumeBinder.BindPodVolumes(context.Background(), assumedPod, volumes)
})
There's retry.DefaultRetry
and retry.DefaultBackoff
but those don't look suitable for us. With no network delay this retries 5 times with a total wait time of 30 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was think about this, perhaps we're better off with a normal, non-exponential retry (steps: 5, factor: 1.0, Duration: 10*time.Second).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good wrapper, i will take a look to replace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think exponential retry is enough, because each retry with internal timeout for k8s itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed the new API in latest PR.
pkg/cache/context.go
Outdated
type RetryStrategy interface { | ||
// Sleep function used for retry delays | ||
Sleep(duration time.Duration) | ||
} | ||
|
||
// DefaultRetryStrategy is a simple retry strategy that sleeps for a fixed duration | ||
// We can extend this to support more advanced retry strategies in the future and also for testing purposes | ||
type DefaultRetryStrategy struct{} | ||
|
||
func (r *DefaultRetryStrategy) Sleep(duration time.Duration) { | ||
time.Sleep(duration) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lot of extra code, not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pbacsko for review, i added this to expose to testing code to mock the sleep time interval, we don't have a the context mock class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed the new API in latest PR, also removed the extra code.
pkg/cache/context_test.go
Outdated
type MockRetryStrategy struct { | ||
totalSleep time.Duration | ||
} | ||
|
||
func (m *MockRetryStrategy) Sleep(duration time.Duration) { | ||
m.totalSleep += duration | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra code, not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to expose to testing code to mock the sleep time interval, we don't have a the context mock class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed the new API in latest PR, also removed the extra code.
Another thing we can consider is wrapping the entire As a follow-up, we can think about retrying while doing the pod binding in |
Thanks @pbacsko for review. This is a good idea, we can follow up in future, we can retry some other cases task failed provide a general retry logic for those tasks, may be need a specific config to enable it. |
@pbacsko Also added a follow-up jira targe 1.7.0: |
Not sure I agree with the direction
I think that that is the only correct way to handle this. It is a larger change but anything else is a simple bandaid. We also already have the option to increase the bind timeout via the config so wrapping the retry that is already in the binder again is not a good idea. If it takes too long increase the timeout. The check is run every second so increasing the configured timeout from 10s to 30s will only affect these failure cases. The documentation for the BindPodVolumes call shows:
So even the default scheduler just dumps it back into the scheduling cycle and retries if after the timeout it has failed. |
Thanks @wilfred-s for clarify, do i make sense right? |
correct, we need to have the option to reject the allocation by the k8shim this late in the cycle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhuqi-lucas had a quick discussion w/ Wilfred, we do believe this PR can be closed. There is a retry loop inside the volume binder itself. The solution is to change service.volumeBindTimeout
.
Sure @pbacsko , @wilfred-s , close this PR now, i will do follow-up:
|
… failing the task
What is this PR for?
Currently, we support bind volume to pass the time out parameter, but we'd better support retry bind volume, because the timeout is one of the error for bind volume fails.
We will benefit a lot if we can retry successfully, it will make task not failed.
And i can see more cases which cusomters want to retry when volume bind fails, such as:
https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1723510519206459
What type of PR is it?
Todos
What is the Jira issue?
[YUNIKORN-2] Gang scheduling interface parameters
How should this be tested?
Screenshots (if appropriate)
Questions: