[YUNIKORN-2253] Support retry when bind volume failed case instead of… #890

zhuqi-lucas · 2024-08-14T09:05:32Z

… failing the task

What is this PR for?

Currently, we support bind volume to pass the time out parameter, but we'd better support retry bind volume, because the timeout is one of the error for bind volume fails.

We will benefit a lot if we can retry successfully, it will make task not failed.

And i can see more cases which cusomters want to retry when volume bind fails, such as:
https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1723510519206459

What type of PR is it?

Todos

- Task

What is the Jira issue?

Open an issue on Jira https://issues.apache.org/jira/browse/YUNIKORN-2253
Put link here, and add [YUNIKORN-Jira number] in PR title, eg. [YUNIKORN-2] Gang scheduling interface parameters

How should this be tested?

Screenshots (if appropriate)

Questions:

- The licenses files need update.
- There is breaking changes for older versions.
- It needs documentation.

… failing the task

codecov · 2024-08-14T09:19:14Z

Codecov Report

Attention: Patch coverage is 91.30435% with 4 lines in your changes missing coverage. Please review.

Project coverage is 68.27%. Comparing base (eed4ea1) to head (01644ba).
Report is 1 commits behind head on master.

Files	Patch %	Lines
pkg/cache/context.go	80.95%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #890      +/-   ##
==========================================
+ Coverage   68.07%   68.27%   +0.19%     
==========================================
  Files          70       70              
  Lines        7575     7616      +41     
==========================================
+ Hits         5157     5200      +43     
+ Misses       2203     2201       -2     
  Partials      215      215

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pbacsko

I'm not against the change, we need to greatly simplify the retry logic. We can re-use the existing one from K8s which is properly tested and gives us a much simpler solution.

pbacsko · 2024-08-14T12:40:36Z

pkg/cache/context.go

+	for i := 0; i < maxRetries; i++ {
+		err = ctx.apiProvider.GetAPIs().VolumeBinder.BindPodVolumes(context.Background(), assumedPod, volumes)
+		if err == nil {
+			return nil
+		}
+
+		log.Log(log.ShimContext).Error("Failed to bind pod volumes",
+			zap.String("podName", assumedPod.Name),
+			zap.String("nodeName", assumedPod.Spec.NodeName),
+			zap.Int("dynamicProvisions", len(volumes.DynamicProvisions)),
+			zap.Int("staticBindings", len(volumes.StaticBindings)),
+			zap.Int("retryCount", i+1),
+			zap.Error(err))
+
+		if i == maxRetries-1 {
+			log.Log(log.ShimContext).Error("Failed to bind pod volumes after retry",
+				zap.String("podName", assumedPod.Name),
+				zap.String("nodeName", assumedPod.Spec.NodeName),
+				zap.Int("dynamicProvisions", len(volumes.DynamicProvisions)),
+				zap.Int("staticBindings", len(volumes.StaticBindings)),
+				zap.Error(err))
+			return err
+		}
+
+		delay := baseDelay * time.Duration(1<<uint(i))
+		if delay > maxDelay {
+			delay = maxDelay
+		}
+
+		retryStrategy.Sleep(delay) // Use the retry strategy
+	}


We have a retry logic in the K8s codebase that we can reuse:

import "k8s.io/client-go/util/retry" backoff := wait.Backoff{ Steps: 5, Duration: time.Second, Factor: 2.0, Jitter: 0, } err := retry.OnError(backoff, func(_ error) bool { return true // retry on all error }, func() error { return ctx.apiProvider.GetAPIs().VolumeBinder.BindPodVolumes(context.Background(), assumedPod, volumes) })

There's retry.DefaultRetry and retry.DefaultBackoff but those don't look suitable for us. With no network delay this retries 5 times with a total wait time of 30 seconds.

I was think about this, perhaps we're better off with a normal, non-exponential retry (steps: 5, factor: 1.0, Duration: 10*time.Second).

This is a good wrapper, i will take a look to replace.

I think exponential retry is enough, because each retry with internal timeout for k8s itself.

Addressed the new API in latest PR.

pbacsko · 2024-08-14T12:41:23Z

pkg/cache/context.go

+type RetryStrategy interface {
+	// Sleep function used for retry delays
+	Sleep(duration time.Duration)
+}
+
+// DefaultRetryStrategy is a simple retry strategy that sleeps for a fixed duration
+// We can extend this to support more advanced retry strategies in the future and also for testing purposes
+type DefaultRetryStrategy struct{}
+
+func (r *DefaultRetryStrategy) Sleep(duration time.Duration) {
+	time.Sleep(duration)
+}


Lot of extra code, not needed

Thanks @pbacsko for review, i added this to expose to testing code to mock the sleep time interval, we don't have a the context mock class.

Addressed the new API in latest PR, also removed the extra code.

pbacsko · 2024-08-14T12:41:57Z

pkg/cache/context_test.go

+type MockRetryStrategy struct {
+	totalSleep time.Duration
+}
+
+func (m *MockRetryStrategy) Sleep(duration time.Duration) {
+	m.totalSleep += duration
+}


Extra code, not needed

I added this to expose to testing code to mock the sleep time interval, we don't have a the context mock class.

Addressed the new API in latest PR, also removed the extra code.

pbacsko · 2024-08-14T12:52:06Z

Another thing we can consider is wrapping the entire bindPodVolumes() in a retry loop, I'm not sure if that makes sense.

As a follow-up, we can think about retrying while doing the pod binding in Task.postTaskAllocated().
Another thing can be a more generic allocation retry where a failed volume/pod binding does not result in a failed Task. Instead, we cancel the allocation from the shim and let the core re-schedule it at a later time.

zhuqi-lucas · 2024-08-15T02:50:58Z

Another thing we can consider is wrapping the entire bindPodVolumes() in a retry loop, I'm not sure if that makes sense.

As a follow-up, we can think about retrying while doing the pod binding in Task.postTaskAllocated(). Another thing can be a more generic allocation retry where a failed volume/pod binding does not result in a failed Task. Instead, we cancel the allocation from the shim and let the core re-schedule it at a later time.

Thanks @pbacsko for review.
I first wanted to do this in Task.postTaskAllocated(), but we can see the function include a lot of fine-grained operation besides the bindvolume operation, so i choose to retry the fine-grained function just including the bind volume function.

This is a good idea, we can follow up in future, we can retry some other cases task failed provide a general retry logic for those tasks, may be need a specific config to enable it.

zhuqi-lucas · 2024-08-15T03:52:35Z

@pbacsko Also added a follow-up jira targe 1.7.0:
https://issues.apache.org/jira/browse/YUNIKORN-2804

wilfred-s · 2024-08-15T04:42:58Z

Not sure I agree with the direction

Another thing can be a more generic allocation retry where a failed volume/pod binding does not result in a failed Task. Instead, we cancel the allocation from the shim and let the core re-schedule it at a later time.

I think that that is the only correct way to handle this. It is a larger change but anything else is a simple bandaid.

We also already have the option to increase the bind timeout via the config so wrapping the retry that is already in the binder again is not a good idea. If it takes too long increase the timeout. The check is run every second so increasing the configured timeout from 10s to 30s will only affect these failure cases.

The documentation for the BindPodVolumes call shows:

//     i.  BindPodVolumes() is called first in PreBind phase. It makes all the necessary API updates and waits for
//     PV controller to fully bind and provision the PVCs. If binding fails, the Pod is sent
//     back through the scheduler.

So even the default scheduler just dumps it back into the scheduling cycle and retries if after the timeout it has failed.
Looking at the code the reason for the error might be something that cannot be solved. For instance the node selected for the pod might not work for the volume.

zhuqi-lucas · 2024-08-15T05:32:40Z

Thanks @wilfred-s for clarify, do i make sense right?
So even we increase timeout and retry, we can’t totally fix the problems, we should move back the retry cycle to scheduling cycle, and sometimes it caused by the problems can’t recover with retry, for example:
The volume doesn’t match the node at that time, we should move back to scheduling cycle to do the retry to pick up a good node?

wilfred-s · 2024-08-15T05:57:12Z

correct, we need to have the option to reject the allocation by the k8shim this late in the cycle.
We need to check what is needed for that first. Currently we fail the task. That should not happen, but it is only one part on the k8shim side. We might need to revert things in the caches etc.
We also need re-trigger the scheduling in the core. The allocation is already “done” so we need to revert that.
With Craig’s changes from YUNIKORN-2460 in place we might be able to do that as an allocation update from the k8shim to the core which removes the node.
Not, sure that would need to be investigated and properly planned

pbacsko

@zhuqi-lucas had a quick discussion w/ Wilfred, we do believe this PR can be closed. There is a retry loop inside the volume binder itself. The solution is to change service.volumeBindTimeout.

zhuqi-lucas · 2024-08-16T09:16:40Z

Sure @pbacsko , @wilfred-s , close this PR now, i will do follow-up:

Add troubleshooting docs.
Do the general retry policy for Yunikorn 1.7.0 release.

[YUNIKORN-2253] Support retry when bind volume failed case instead of…

1c848ab

… failing the task

zhuqi-lucas self-assigned this Aug 14, 2024

zhuqi-lucas requested review from pbacsko, wilfred-s, craigcondit and chenyulin0719 August 14, 2024 09:05

Fix golint

5dbd748

zhuqi-lucas requested review from pbacsko, chenyulin0719 and craigcondit and removed request for craigcondit, pbacsko and chenyulin0719 August 14, 2024 09:26

pbacsko requested changes Aug 14, 2024

View reviewed changes

Address new comments

7f05791

zhuqi-lucas requested a review from pbacsko August 15, 2024 03:52

zhuqi-lucas added 2 commits August 15, 2024 11:53

Fix go lint

bf86cfb

Fix go lint

01644ba

pbacsko reviewed Aug 16, 2024

View reviewed changes

zhuqi-lucas closed this Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-2253] Support retry when bind volume failed case instead of… #890

[YUNIKORN-2253] Support retry when bind volume failed case instead of… #890

zhuqi-lucas commented Aug 14, 2024 •

edited

Loading

codecov bot commented Aug 14, 2024 •

edited

Loading

pbacsko left a comment

pbacsko Aug 14, 2024

pbacsko Aug 14, 2024

zhuqi-lucas Aug 15, 2024

zhuqi-lucas Aug 15, 2024

zhuqi-lucas Aug 15, 2024

pbacsko Aug 14, 2024 •

edited

Loading

zhuqi-lucas Aug 15, 2024

zhuqi-lucas Aug 15, 2024

pbacsko Aug 14, 2024

zhuqi-lucas Aug 15, 2024

zhuqi-lucas Aug 15, 2024

pbacsko commented Aug 14, 2024

zhuqi-lucas commented Aug 15, 2024

zhuqi-lucas commented Aug 15, 2024

wilfred-s commented Aug 15, 2024

zhuqi-lucas commented Aug 15, 2024 •

edited

Loading

wilfred-s commented Aug 15, 2024

pbacsko left a comment

zhuqi-lucas commented Aug 16, 2024 •

edited

Loading

[YUNIKORN-2253] Support retry when bind volume failed case instead of… #890

[YUNIKORN-2253] Support retry when bind volume failed case instead of… #890

Conversation

zhuqi-lucas commented Aug 14, 2024 • edited Loading

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

codecov bot commented Aug 14, 2024 • edited Loading

Codecov Report

pbacsko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pbacsko Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pbacsko commented Aug 14, 2024

zhuqi-lucas commented Aug 15, 2024

zhuqi-lucas commented Aug 15, 2024

wilfred-s commented Aug 15, 2024

zhuqi-lucas commented Aug 15, 2024 • edited Loading

wilfred-s commented Aug 15, 2024

pbacsko left a comment

Choose a reason for hiding this comment

zhuqi-lucas commented Aug 16, 2024 • edited Loading

zhuqi-lucas commented Aug 14, 2024 •

edited

Loading

codecov bot commented Aug 14, 2024 •

edited

Loading

pbacsko Aug 14, 2024 •

edited

Loading

zhuqi-lucas commented Aug 15, 2024 •

edited

Loading

zhuqi-lucas commented Aug 16, 2024 •

edited

Loading