Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2253] Support retry when bind volume failed case instead of… #890

Closed
wants to merge 5 commits into from

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Aug 14, 2024

… failing the task

What is this PR for?

Currently, we support bind volume to pass the time out parameter, but we'd better support retry bind volume, because the timeout is one of the error for bind volume fails.

We will benefit a lot if we can retry successfully, it will make task not failed.

And i can see more cases which cusomters want to retry when volume bind fails, such as:
https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1723510519206459

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

Copy link

codecov bot commented Aug 14, 2024

Codecov Report

Attention: Patch coverage is 91.30435% with 4 lines in your changes missing coverage. Please review.

Project coverage is 68.27%. Comparing base (eed4ea1) to head (01644ba).
Report is 1 commits behind head on master.

Files Patch % Lines
pkg/cache/context.go 80.95% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #890      +/-   ##
==========================================
+ Coverage   68.07%   68.27%   +0.19%     
==========================================
  Files          70       70              
  Lines        7575     7616      +41     
==========================================
+ Hits         5157     5200      +43     
+ Misses       2203     2201       -2     
  Partials      215      215              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zhuqi-lucas zhuqi-lucas requested review from pbacsko, chenyulin0719 and craigcondit and removed request for craigcondit, pbacsko and chenyulin0719 August 14, 2024 09:26
Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against the change, we need to greatly simplify the retry logic. We can re-use the existing one from K8s which is properly tested and gives us a much simpler solution.

Comment on lines 790 to 820
for i := 0; i < maxRetries; i++ {
err = ctx.apiProvider.GetAPIs().VolumeBinder.BindPodVolumes(context.Background(), assumedPod, volumes)
if err == nil {
return nil
}

log.Log(log.ShimContext).Error("Failed to bind pod volumes",
zap.String("podName", assumedPod.Name),
zap.String("nodeName", assumedPod.Spec.NodeName),
zap.Int("dynamicProvisions", len(volumes.DynamicProvisions)),
zap.Int("staticBindings", len(volumes.StaticBindings)),
zap.Int("retryCount", i+1),
zap.Error(err))

if i == maxRetries-1 {
log.Log(log.ShimContext).Error("Failed to bind pod volumes after retry",
zap.String("podName", assumedPod.Name),
zap.String("nodeName", assumedPod.Spec.NodeName),
zap.Int("dynamicProvisions", len(volumes.DynamicProvisions)),
zap.Int("staticBindings", len(volumes.StaticBindings)),
zap.Error(err))
return err
}

delay := baseDelay * time.Duration(1<<uint(i))
if delay > maxDelay {
delay = maxDelay
}

retryStrategy.Sleep(delay) // Use the retry strategy
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a retry logic in the K8s codebase that we can reuse:

import "k8s.io/client-go/util/retry"

backoff := wait.Backoff{
	Steps:    5,
	Duration: time.Second,
	Factor:   2.0,
	Jitter:   0,
}
err := retry.OnError(backoff, func(_ error) bool {
	return true // retry on all error
}, func() error {
	return ctx.apiProvider.GetAPIs().VolumeBinder.BindPodVolumes(context.Background(), assumedPod, volumes)
})

There's retry.DefaultRetry and retry.DefaultBackoff but those don't look suitable for us. With no network delay this retries 5 times with a total wait time of 30 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was think about this, perhaps we're better off with a normal, non-exponential retry (steps: 5, factor: 1.0, Duration: 10*time.Second).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good wrapper, i will take a look to replace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think exponential retry is enough, because each retry with internal timeout for k8s itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the new API in latest PR.

Comment on lines 767 to 778
type RetryStrategy interface {
// Sleep function used for retry delays
Sleep(duration time.Duration)
}

// DefaultRetryStrategy is a simple retry strategy that sleeps for a fixed duration
// We can extend this to support more advanced retry strategies in the future and also for testing purposes
type DefaultRetryStrategy struct{}

func (r *DefaultRetryStrategy) Sleep(duration time.Duration) {
time.Sleep(duration)
}
Copy link
Contributor

@pbacsko pbacsko Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lot of extra code, not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pbacsko for review, i added this to expose to testing code to mock the sleep time interval, we don't have a the context mock class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the new API in latest PR, also removed the extra code.

Comment on lines 2437 to 2443
type MockRetryStrategy struct {
totalSleep time.Duration
}

func (m *MockRetryStrategy) Sleep(duration time.Duration) {
m.totalSleep += duration
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra code, not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to expose to testing code to mock the sleep time interval, we don't have a the context mock class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the new API in latest PR, also removed the extra code.

@pbacsko
Copy link
Contributor

pbacsko commented Aug 14, 2024

Another thing we can consider is wrapping the entire bindPodVolumes() in a retry loop, I'm not sure if that makes sense.

As a follow-up, we can think about retrying while doing the pod binding in Task.postTaskAllocated().
Another thing can be a more generic allocation retry where a failed volume/pod binding does not result in a failed Task. Instead, we cancel the allocation from the shim and let the core re-schedule it at a later time.

@zhuqi-lucas
Copy link
Contributor Author

Another thing we can consider is wrapping the entire bindPodVolumes() in a retry loop, I'm not sure if that makes sense.

As a follow-up, we can think about retrying while doing the pod binding in Task.postTaskAllocated(). Another thing can be a more generic allocation retry where a failed volume/pod binding does not result in a failed Task. Instead, we cancel the allocation from the shim and let the core re-schedule it at a later time.

Thanks @pbacsko for review.
I first wanted to do this in Task.postTaskAllocated(), but we can see the function include a lot of fine-grained operation besides the bindvolume operation, so i choose to retry the fine-grained function just including the bind volume function.

This is a good idea, we can follow up in future, we can retry some other cases task failed provide a general retry logic for those tasks, may be need a specific config to enable it.

@zhuqi-lucas
Copy link
Contributor Author

@pbacsko Also added a follow-up jira targe 1.7.0:
https://issues.apache.org/jira/browse/YUNIKORN-2804

@zhuqi-lucas zhuqi-lucas requested a review from pbacsko August 15, 2024 03:52
@wilfred-s
Copy link
Contributor

Not sure I agree with the direction

Another thing can be a more generic allocation retry where a failed volume/pod binding does not result in a failed Task. Instead, we cancel the allocation from the shim and let the core re-schedule it at a later time.

I think that that is the only correct way to handle this. It is a larger change but anything else is a simple bandaid.

We also already have the option to increase the bind timeout via the config so wrapping the retry that is already in the binder again is not a good idea. If it takes too long increase the timeout. The check is run every second so increasing the configured timeout from 10s to 30s will only affect these failure cases.

The documentation for the BindPodVolumes call shows:

//     i.  BindPodVolumes() is called first in PreBind phase. It makes all the necessary API updates and waits for
//     PV controller to fully bind and provision the PVCs. If binding fails, the Pod is sent
//     back through the scheduler.

So even the default scheduler just dumps it back into the scheduling cycle and retries if after the timeout it has failed.
Looking at the code the reason for the error might be something that cannot be solved. For instance the node selected for the pod might not work for the volume.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Aug 15, 2024

Thanks @wilfred-s for clarify, do i make sense right?
So even we increase timeout and retry, we can’t totally fix the problems, we should move back the retry cycle to scheduling cycle, and sometimes it caused by the problems can’t recover with retry, for example:
The volume doesn’t match the node at that time, we should move back to scheduling cycle to do the retry to pick up a good node?

@wilfred-s
Copy link
Contributor

correct, we need to have the option to reject the allocation by the k8shim this late in the cycle.
We need to check what is needed for that first. Currently we fail the task. That should not happen, but it is only one part on the k8shim side. We might need to revert things in the caches etc.
We also need re-trigger the scheduling in the core. The allocation is already “done” so we need to revert that.
With Craig’s changes from YUNIKORN-2460 in place we might be able to do that as an allocation update from the k8shim to the core which removes the node.
Not, sure that would need to be investigated and properly planned

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuqi-lucas had a quick discussion w/ Wilfred, we do believe this PR can be closed. There is a retry loop inside the volume binder itself. The solution is to change service.volumeBindTimeout.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Aug 16, 2024

Sure @pbacsko , @wilfred-s , close this PR now, i will do follow-up:

  1. Add troubleshooting docs.
  2. Do the general retry policy for Yunikorn 1.7.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants