From 3692bfaa1079890982d9de6931184ba79beffcd9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fabio=20Gr=C3=A4tz?= Date: Thu, 27 Jul 2023 18:30:21 +0200 Subject: [PATCH 1/8] Add RFC to simplify retry behaviour and unify it between plugins MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Fabio Grätz --- rfc/system/0000-simplify-retry-behaviour.md | 120 ++++++++++++++++++++ 1 file changed, 120 insertions(+) create mode 100644 rfc/system/0000-simplify-retry-behaviour.md diff --git a/rfc/system/0000-simplify-retry-behaviour.md b/rfc/system/0000-simplify-retry-behaviour.md new file mode 100644 index 0000000000..ae73351def --- /dev/null +++ b/rfc/system/0000-simplify-retry-behaviour.md @@ -0,0 +1,120 @@ +# [RFC Template] Simplify retry behaviour and unify it between plugins + +**Authors:** + +- [@fellhorn](https://github.com/fellhorn) +- [@hamersaw](https://github.com/hamersaw) +- [@fg91](https://github.com/fg91) + +## 1 Executive Summary + +Flyte implements a retry mechanism to make workflows robust against failure. This retry mechanism has two different budgets, one for which the user defines the maximum number of retries in the `@task` decorator and one for system failures, which is defined on the platform side. + +> We distinguish between Flytekit/system-level exceptions and user exceptions. For instance, if Flytekit encounters an issue while uploading outputs, it is considered a system exception. On the other hand, if a user raises a ValueError due to an unexpected input in the task code, it is classified as a user exception. [(Source)](https://docs.flyte.org/projects/flytekit/en/latest/design/authoring.html#exception-handling) + +Especially when it comes to interruptions/node terminations, the details of the retry behaviour (which budget a retry counts against and how many retry possibilities are remaining) are intransparent and difficult to understand. The behavior is unfortunately also not consistent between plugins or even within the Pod plugin. + +The goal of this RFC is to make the behaviour easy to understand and consistent between plugins. + +## 2 Motivation + +For regular `PythonFunctionTasks` the interruptible behavior (in case of a preemption) is as follows: + +The Pod plugin tries to [demistify failures](https://github.com/flyteorg/flyteplugins/blob/dfdf6f95aef7bebff160d6660f5c62f5832c39e4/go/tasks/plugins/k8s/pod/plugin.go#L182) and in case a preemption is detected, [returns a system retriable failure](https://github.com/flyteorg/flyteplugins/blob/dfdf6f95aef7bebff160d6660f5c62f5832c39e4/go/tasks/pluginmachinery/flytek8s/pod_helper.go#L628). + +* For retries number 1 to SystemRetryBudget-1 it runs on a Spot instance. Failed attempts are shown to the user as follows: + + + +* The last retry [happens on a non-interruptible machine to ensure completion of the task](https://github.com/flyteorg/flytepropeller/blob/a3c6e91f19c19601a957b29891437112868845de/pkg/controller/nodes/node_exec_context.go#L213) + +(Unintuitively, in case of a `SIGKILL` received during node termination, the failure [is counted towards the user defined retry budget](https://github.com/flyteorg/flyteplugins/blob/dfdf6f95aef7bebff160d6660f5c62f5832c39e4/go/tasks/pluginmachinery/flytek8s/pod_helper.go#L645) in contrast to a preemption.) + +When using for instance the kubeflow operators plugin (e.g. Pytorch Task) all preemptions are counted towards the [user retry budget](https://github.com/flyteorg/flyteplugins/blob/master/go/tasks/plugins/k8s/kfoperators/common/common_operator.go#L79). The last attempt is not performed on a non-preemptible node. + +Preempted attempts are shown as follows in the console: + + +The incoherent behaviour is intransparent and counterintuitive for platform users. As one can see in the previous screenshots the user can't distinguish which retry budget an attempt counted against. We as platform engineers have been approached multiple times with questions such as: *"Why did my task retry only x times when I specified y times."* + +Several users were also surprised to learn that the `retries` parameter in the `@task` decorator has no effect on how many times a task can get preempted - at least in the case of python tasks (not in the case of e.g. pytorch tasks). For longer trainings, a larger number of preemptions can be acceptable to a user. Some of our users remarked they would like to have control over the maximum number of preemptions. + +* We would like Flyte to handle interruptible tasks the same, no matter which plugin is responsible. +* We would like users to be able to control how often a task can get preempted. +* We would like the retry mechanism and budgets to be easier to understand for platform users. + + +## 3 Proposed Implementation + +We propose to introduce a new simplified retry behaviour (which will have to be activated in the platform configuration and which will **not** be the default behaviour in order to not make a breaking change). + +This simplified retry behavior does not distinguish between user and system failures but considers a single maximum number of retries (defined by the user in the task decorator) regardless of the cause. + +This will require a surprisingly small amount of changes to Flyte: + +We propose to: + +* Add a configuration flag in FlytePropeller for counting system and user retries the same. +* Apply the new behaviour in the [`isEligibleForRetry`](https://github.com/flyteorg/flytepropeller/blob/294f47a18c9b11892f2d1a3573019e85257c3b19/pkg/controller/nodes/executor.go#L440) function where, in contrast to the current logic, if the flag is set, we check if the number of attempts + the number of system failures is larger than the max number of retries set in the task decorator. For the last retries (details see section 8), non-spot instances are used. +* Add configuration for default number of retries in FlytePropeller. Currently the default maximum number of attempts is hardcoded. This should be exposed in the configuration. + + +In addition to a much easier to understand behaviour, all plugins will automatically count preemptions the same way and users will have control over the maximum number of preemptions acceptable for their workflow. + +## 4 Metrics & Dashboards + +_NA_ + +## 5 Drawbacks + +Currently the behaviour is intransparent and confusing for platform users. We as platform engineers have to look deep into the code to understand which budget a certain failure counts against. Users cannot increase the number of accepted preemptions when their training runs longer. Different plugins handle preemptions in different ways. +We should not leave these quirks untackled. + +## 6 Alternatives + +### 6.1 Count preemptions towards user retry budget + +If we didn't want to try to make the retry behaviour easier to understand but only wanted to 1) remove the incoherence between plugins when it comes to preemption handling and 2) give users the ability the control the number of accepted preemptions, we could simply always count preemptions towards the user retry budget. + +In this case, [here](https://github.com/flyteorg/flyteplugins/blob/dfdf6f95aef7bebff160d6660f5c62f5832c39e4/go/tasks/pluginmachinery/flytek8s/pod_helper.go#L628), we would return a `PhaseInfoRetryableFailure` instead of a `PhaseInfoSystemRetryableFailure`. + +* This would mean that most users wouldn't need to know about the system retry budget and they could adapt the number of allowed preemptions to the length of their training. +* This would also mean that the Pod plugin and e.g. the kubeflow plugin would deal with preemptions the same way. +* If preemptions counted against the user retry budget, we would have to switch to non-preemptible instances for the last retry of the user retry budget [instead of the system retry budget](https://github.com/flyteorg/flytepropeller/blob/a3c6e91f19c19601a957b29891437112868845de/pkg/controller/nodes/node_exec_context.go#L213). + +The downside of this approach is that the general retry behaviour remains as complex and difficult to understand as it is now. + +### 6.2 Demistify failure of Pods belonging to plugins/CRDs + +We discussed ways to give plugins access to the status of not only the custom resource they create (e.g. `PyTorchJob`) but also to the statuses of the pods created from the custom resource by the respective third-party operator (e.g. the kubeflow training operator). (This might require changes to the plugin interface, see details [here](https://github.com/flyteorg/flyte/discussions/3793).) + +This approach would allow plugins to demistify failures of tasks by looking at the underlying pods which, in contrast to the custom resource, do contain hints of preemptions. This way, we could unify the retry behaviour upon preemptions of the different plugins. However, wouldn't make the retry behaviour easier to understand and users still wouldn't have control over the number of preemptions which is why we opted to not continue these discussions either. + + +### 6.3 Inform the user which retry budget an attempt counted against + +We considered showing the user in the Flyte console whether an attempt counted against the user retry budget or the system retry budget. + +This only improves the transparency and does not remove the different behavior between plugins. Doing so is not necessary when every failure is counted against the same retry budget. + +![budgets](https://github.com/flyteorg/flyte/assets/26092524/3e425528-ce3f-42af-8cc3-f0ca0cdab660) + + + +## 7 Potential Impact and Dependencies + +By having to activate the new retry behaviour in the platform configuration, we will not break existing workflows. + +## 8 Unresolved questions + +An open question is how we deal with the `interruptibleFailureThreshold`. Currently, this value is a positive integer that is the number of retries that are interruptible. For example, 4 (system failure) retries are allowed where 3 of them are interruptible. Defining it this way does not make as much sense when we allow for task-level configuration of the retries. + +We could define the number of retries that ARE NOT interruptible. For example a value of `nonInterruptibleRetries` that is set to 1 means that the last retry is not interruptible regardless of how many retries there are. We also have to discuss whether this value would be configured on the platform side (only for the new retry behaviour) or whether we would add this value to to task-level metadata and expose it to users. Adding this flag while ensuring backwards compatibility will require further discussion. + +## 9 Conclusion + +With the proposed new retry behaviour + +* Flyte could handle interruptible tasks the same, no matter which plugin is responsible. +* users would be able to control how often a task can get preempted. +* the retry mechanism and budgets would be much be easier to understand for platform users. From 1355a81dee0ec9568e16ba86b8c0561637473618 Mon Sep 17 00:00:00 2001 From: "Fabio M. Graetz, Ph.D" Date: Thu, 3 Aug 2023 15:53:46 +0200 Subject: [PATCH 2/8] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com> Signed-off-by: Fabio M. Graetz, Ph.D. Signed-off-by: Fabio Grätz --- rfc/system/0000-simplify-retry-behaviour.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfc/system/0000-simplify-retry-behaviour.md b/rfc/system/0000-simplify-retry-behaviour.md index ae73351def..8085391c8f 100644 --- a/rfc/system/0000-simplify-retry-behaviour.md +++ b/rfc/system/0000-simplify-retry-behaviour.md @@ -1,4 +1,4 @@ -# [RFC Template] Simplify retry behaviour and unify it between plugins +# [RFC Template] Simplify retry behavior and unify it between plugins **Authors:** @@ -12,7 +12,7 @@ Flyte implements a retry mechanism to make workflows robust against failure. Thi > We distinguish between Flytekit/system-level exceptions and user exceptions. For instance, if Flytekit encounters an issue while uploading outputs, it is considered a system exception. On the other hand, if a user raises a ValueError due to an unexpected input in the task code, it is classified as a user exception. [(Source)](https://docs.flyte.org/projects/flytekit/en/latest/design/authoring.html#exception-handling) -Especially when it comes to interruptions/node terminations, the details of the retry behaviour (which budget a retry counts against and how many retry possibilities are remaining) are intransparent and difficult to understand. The behavior is unfortunately also not consistent between plugins or even within the Pod plugin. +Especially when it comes to interruptions/node terminations, the details of the retry behavior (which budget a retry counts against and how many retry possibilities are remaining) are intransparent and difficult to understand. The behavior is unfortunately also not consistent between plugins or even within the Pod plugin. The goal of this RFC is to make the behaviour easy to understand and consistent between plugins. From 4d1a8c9c018fa526eddb17d2edba44ebe691fe4a Mon Sep 17 00:00:00 2001 From: "Fabio M. Graetz, Ph.D" Date: Thu, 3 Aug 2023 15:54:59 +0200 Subject: [PATCH 3/8] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com> Signed-off-by: Fabio M. Graetz, Ph.D. Signed-off-by: Fabio Grätz --- rfc/system/0000-simplify-retry-behaviour.md | 24 ++++++++++----------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/rfc/system/0000-simplify-retry-behaviour.md b/rfc/system/0000-simplify-retry-behaviour.md index 8085391c8f..8bd3d72ad6 100644 --- a/rfc/system/0000-simplify-retry-behaviour.md +++ b/rfc/system/0000-simplify-retry-behaviour.md @@ -14,7 +14,7 @@ Flyte implements a retry mechanism to make workflows robust against failure. Thi Especially when it comes to interruptions/node terminations, the details of the retry behavior (which budget a retry counts against and how many retry possibilities are remaining) are intransparent and difficult to understand. The behavior is unfortunately also not consistent between plugins or even within the Pod plugin. -The goal of this RFC is to make the behaviour easy to understand and consistent between plugins. +The goal of this RFC is to make the behavior easy to understand and consistent between plugins. ## 2 Motivation @@ -35,7 +35,7 @@ When using for instance the kubeflow operators plugin (e.g. Pytorch Task) all pr Preempted attempts are shown as follows in the console: -The incoherent behaviour is intransparent and counterintuitive for platform users. As one can see in the previous screenshots the user can't distinguish which retry budget an attempt counted against. We as platform engineers have been approached multiple times with questions such as: *"Why did my task retry only x times when I specified y times."* +The incoherent behavior is intransparent and counterintuitive for platform users. As one can see in the previous screenshots the user can't distinguish which retry budget an attempt counted against. We as platform engineers have been approached multiple times with questions such as: *"Why did my task retry only x times when I specified y times."* Several users were also surprised to learn that the `retries` parameter in the `@task` decorator has no effect on how many times a task can get preempted - at least in the case of python tasks (not in the case of e.g. pytorch tasks). For longer trainings, a larger number of preemptions can be acceptable to a user. Some of our users remarked they would like to have control over the maximum number of preemptions. @@ -46,7 +46,7 @@ Several users were also surprised to learn that the `retries` parameter in the ` ## 3 Proposed Implementation -We propose to introduce a new simplified retry behaviour (which will have to be activated in the platform configuration and which will **not** be the default behaviour in order to not make a breaking change). +We propose to introduce a new simplified retry behavior (which will have to be activated in the platform configuration and which will **not** be the default behavior in order to not make a breaking change). This simplified retry behavior does not distinguish between user and system failures but considers a single maximum number of retries (defined by the user in the task decorator) regardless of the cause. @@ -55,11 +55,11 @@ This will require a surprisingly small amount of changes to Flyte: We propose to: * Add a configuration flag in FlytePropeller for counting system and user retries the same. -* Apply the new behaviour in the [`isEligibleForRetry`](https://github.com/flyteorg/flytepropeller/blob/294f47a18c9b11892f2d1a3573019e85257c3b19/pkg/controller/nodes/executor.go#L440) function where, in contrast to the current logic, if the flag is set, we check if the number of attempts + the number of system failures is larger than the max number of retries set in the task decorator. For the last retries (details see section 8), non-spot instances are used. +* Apply the new behavior in the [`isEligibleForRetry`](https://github.com/flyteorg/flytepropeller/blob/294f47a18c9b11892f2d1a3573019e85257c3b19/pkg/controller/nodes/executor.go#L440) function where, in contrast to the current logic, if the flag is set, we check if the number of attempts + the number of system failures is larger than the max number of retries set in the task decorator. For the last retries (details see section 8), non-spot instances are used. * Add configuration for default number of retries in FlytePropeller. Currently the default maximum number of attempts is hardcoded. This should be exposed in the configuration. -In addition to a much easier to understand behaviour, all plugins will automatically count preemptions the same way and users will have control over the maximum number of preemptions acceptable for their workflow. +In addition to a much easier to understand behavior, all plugins will automatically count preemptions the same way and users will have control over the maximum number of preemptions acceptable for their workflow. ## 4 Metrics & Dashboards @@ -67,14 +67,14 @@ _NA_ ## 5 Drawbacks -Currently the behaviour is intransparent and confusing for platform users. We as platform engineers have to look deep into the code to understand which budget a certain failure counts against. Users cannot increase the number of accepted preemptions when their training runs longer. Different plugins handle preemptions in different ways. +Currently the behavior is intransparent and confusing for platform users. We as platform engineers have to look deep into the code to understand which budget a certain failure counts against. Users cannot increase the number of accepted preemptions when their training runs longer. Different plugins handle preemptions in different ways. We should not leave these quirks untackled. ## 6 Alternatives ### 6.1 Count preemptions towards user retry budget -If we didn't want to try to make the retry behaviour easier to understand but only wanted to 1) remove the incoherence between plugins when it comes to preemption handling and 2) give users the ability the control the number of accepted preemptions, we could simply always count preemptions towards the user retry budget. +If we didn't want to try to make the retry behavior easier to understand but only wanted to 1) remove the incoherence between plugins when it comes to preemption handling and 2) give users the ability the control the number of accepted preemptions, we could simply always count preemptions towards the user retry budget. In this case, [here](https://github.com/flyteorg/flyteplugins/blob/dfdf6f95aef7bebff160d6660f5c62f5832c39e4/go/tasks/pluginmachinery/flytek8s/pod_helper.go#L628), we would return a `PhaseInfoRetryableFailure` instead of a `PhaseInfoSystemRetryableFailure`. @@ -82,13 +82,13 @@ In this case, [here](https://github.com/flyteorg/flyteplugins/blob/dfdf6f95aef7b * This would also mean that the Pod plugin and e.g. the kubeflow plugin would deal with preemptions the same way. * If preemptions counted against the user retry budget, we would have to switch to non-preemptible instances for the last retry of the user retry budget [instead of the system retry budget](https://github.com/flyteorg/flytepropeller/blob/a3c6e91f19c19601a957b29891437112868845de/pkg/controller/nodes/node_exec_context.go#L213). -The downside of this approach is that the general retry behaviour remains as complex and difficult to understand as it is now. +The downside of this approach is that the general retry behavior remains as complex and difficult to understand as it is now. ### 6.2 Demistify failure of Pods belonging to plugins/CRDs We discussed ways to give plugins access to the status of not only the custom resource they create (e.g. `PyTorchJob`) but also to the statuses of the pods created from the custom resource by the respective third-party operator (e.g. the kubeflow training operator). (This might require changes to the plugin interface, see details [here](https://github.com/flyteorg/flyte/discussions/3793).) -This approach would allow plugins to demistify failures of tasks by looking at the underlying pods which, in contrast to the custom resource, do contain hints of preemptions. This way, we could unify the retry behaviour upon preemptions of the different plugins. However, wouldn't make the retry behaviour easier to understand and users still wouldn't have control over the number of preemptions which is why we opted to not continue these discussions either. +This approach would allow plugins to demistify failures of tasks by looking at the underlying pods which, in contrast to the custom resource, do contain hints of preemptions. This way, we could unify the retry behavior upon preemptions of the different plugins. However, wouldn't make the retry behavior easier to understand and users still wouldn't have control over the number of preemptions which is why we opted to not continue these discussions either. ### 6.3 Inform the user which retry budget an attempt counted against @@ -103,17 +103,17 @@ This only improves the transparency and does not remove the different behavior b ## 7 Potential Impact and Dependencies -By having to activate the new retry behaviour in the platform configuration, we will not break existing workflows. +By having to activate the new retry behavior in the platform configuration, we will not break existing workflows. ## 8 Unresolved questions An open question is how we deal with the `interruptibleFailureThreshold`. Currently, this value is a positive integer that is the number of retries that are interruptible. For example, 4 (system failure) retries are allowed where 3 of them are interruptible. Defining it this way does not make as much sense when we allow for task-level configuration of the retries. -We could define the number of retries that ARE NOT interruptible. For example a value of `nonInterruptibleRetries` that is set to 1 means that the last retry is not interruptible regardless of how many retries there are. We also have to discuss whether this value would be configured on the platform side (only for the new retry behaviour) or whether we would add this value to to task-level metadata and expose it to users. Adding this flag while ensuring backwards compatibility will require further discussion. +We could define the number of retries that ARE NOT interruptible. For example a value of `nonInterruptibleRetries` that is set to 1 means that the last retry is not interruptible regardless of how many retries there are. We also have to discuss whether this value would be configured on the platform side (only for the new retry behavior) or whether we would add this value to to task-level metadata and expose it to users. Adding this flag while ensuring backwards compatibility will require further discussion. ## 9 Conclusion -With the proposed new retry behaviour +With the proposed new retry behavior * Flyte could handle interruptible tasks the same, no matter which plugin is responsible. * users would be able to control how often a task can get preempted. From 8cb45ac35a0aa58d3b0e361ab255eb450128f72e Mon Sep 17 00:00:00 2001 From: "Fabio M. Graetz, Ph.D" Date: Thu, 3 Aug 2023 15:55:40 +0200 Subject: [PATCH 4/8] Update rfc/system/0000-simplify-retry-behaviour.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Fabio M. Graetz, Ph.D. Signed-off-by: Fabio Grätz --- rfc/system/0000-simplify-retry-behaviour.md | 7 ------- 1 file changed, 7 deletions(-) diff --git a/rfc/system/0000-simplify-retry-behaviour.md b/rfc/system/0000-simplify-retry-behaviour.md index 8bd3d72ad6..7bef4cdda5 100644 --- a/rfc/system/0000-simplify-retry-behaviour.md +++ b/rfc/system/0000-simplify-retry-behaviour.md @@ -91,13 +91,6 @@ We discussed ways to give plugins access to the status of not only the custom re This approach would allow plugins to demistify failures of tasks by looking at the underlying pods which, in contrast to the custom resource, do contain hints of preemptions. This way, we could unify the retry behavior upon preemptions of the different plugins. However, wouldn't make the retry behavior easier to understand and users still wouldn't have control over the number of preemptions which is why we opted to not continue these discussions either. -### 6.3 Inform the user which retry budget an attempt counted against - -We considered showing the user in the Flyte console whether an attempt counted against the user retry budget or the system retry budget. - -This only improves the transparency and does not remove the different behavior between plugins. Doing so is not necessary when every failure is counted against the same retry budget. - -![budgets](https://github.com/flyteorg/flyte/assets/26092524/3e425528-ce3f-42af-8cc3-f0ca0cdab660) From e752b9b619d5f660d6e8833837548a45ec70141a Mon Sep 17 00:00:00 2001 From: "Fabio M. Graetz, Ph.D" Date: Thu, 3 Aug 2023 15:58:34 +0200 Subject: [PATCH 5/8] Update rfc/system/0000-simplify-retry-behaviour.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Fabio M. Graetz, Ph.D. Signed-off-by: Fabio Grätz --- rfc/system/0000-simplify-retry-behaviour.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfc/system/0000-simplify-retry-behaviour.md b/rfc/system/0000-simplify-retry-behaviour.md index 7bef4cdda5..5652b4b114 100644 --- a/rfc/system/0000-simplify-retry-behaviour.md +++ b/rfc/system/0000-simplify-retry-behaviour.md @@ -104,6 +104,8 @@ An open question is how we deal with the `interruptibleFailureThreshold`. Curren We could define the number of retries that ARE NOT interruptible. For example a value of `nonInterruptibleRetries` that is set to 1 means that the last retry is not interruptible regardless of how many retries there are. We also have to discuss whether this value would be configured on the platform side (only for the new retry behavior) or whether we would add this value to to task-level metadata and expose it to users. Adding this flag while ensuring backwards compatibility will require further discussion. +Note: while unifying the retry budgets, the new retry behaviour still distinguishes between retriable and non-retriable exceptions and should do so consistently across plugins. + ## 9 Conclusion With the proposed new retry behavior From 0c818a64f698640ac44542e1057455c37fd32d6e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fabio=20Gr=C3=A4tz?= Date: Sun, 10 Sep 2023 11:15:27 +0200 Subject: [PATCH 6/8] Rename RFC document with PR number MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Fabio Grätz --- ...mplify-retry-behaviour.md => 3902-simplify-retry-behaviour.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename rfc/system/{0000-simplify-retry-behaviour.md => 3902-simplify-retry-behaviour.md} (100%) diff --git a/rfc/system/0000-simplify-retry-behaviour.md b/rfc/system/3902-simplify-retry-behaviour.md similarity index 100% rename from rfc/system/0000-simplify-retry-behaviour.md rename to rfc/system/3902-simplify-retry-behaviour.md From 4484dccddb736c95f6883f194e961e517918c89d Mon Sep 17 00:00:00 2001 From: "Fabio M. Graetz, Ph.D" Date: Wed, 27 Sep 2023 19:45:28 +0200 Subject: [PATCH 7/8] Update rfc/system/3902-simplify-retry-behaviour.md Signed-off-by: Fabio M. Graetz, Ph.D. --- rfc/system/3902-simplify-retry-behaviour.md | 7 ------- 1 file changed, 7 deletions(-) diff --git a/rfc/system/3902-simplify-retry-behaviour.md b/rfc/system/3902-simplify-retry-behaviour.md index 5652b4b114..993baeae93 100644 --- a/rfc/system/3902-simplify-retry-behaviour.md +++ b/rfc/system/3902-simplify-retry-behaviour.md @@ -98,13 +98,6 @@ This approach would allow plugins to demistify failures of tasks by looking at t By having to activate the new retry behavior in the platform configuration, we will not break existing workflows. -## 8 Unresolved questions - -An open question is how we deal with the `interruptibleFailureThreshold`. Currently, this value is a positive integer that is the number of retries that are interruptible. For example, 4 (system failure) retries are allowed where 3 of them are interruptible. Defining it this way does not make as much sense when we allow for task-level configuration of the retries. - -We could define the number of retries that ARE NOT interruptible. For example a value of `nonInterruptibleRetries` that is set to 1 means that the last retry is not interruptible regardless of how many retries there are. We also have to discuss whether this value would be configured on the platform side (only for the new retry behavior) or whether we would add this value to to task-level metadata and expose it to users. Adding this flag while ensuring backwards compatibility will require further discussion. - -Note: while unifying the retry budgets, the new retry behaviour still distinguishes between retriable and non-retriable exceptions and should do so consistently across plugins. ## 9 Conclusion From bda17f2b002d745ba3ca31d0521df5e089c78330 Mon Sep 17 00:00:00 2001 From: "Fabio M. Graetz, Ph.D" Date: Wed, 27 Sep 2023 19:45:38 +0200 Subject: [PATCH 8/8] Update rfc/system/3902-simplify-retry-behaviour.md Signed-off-by: Fabio M. Graetz, Ph.D. --- rfc/system/3902-simplify-retry-behaviour.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/rfc/system/3902-simplify-retry-behaviour.md b/rfc/system/3902-simplify-retry-behaviour.md index 993baeae93..c15bc185b6 100644 --- a/rfc/system/3902-simplify-retry-behaviour.md +++ b/rfc/system/3902-simplify-retry-behaviour.md @@ -55,7 +55,8 @@ This will require a surprisingly small amount of changes to Flyte: We propose to: * Add a configuration flag in FlytePropeller for counting system and user retries the same. -* Apply the new behavior in the [`isEligibleForRetry`](https://github.com/flyteorg/flytepropeller/blob/294f47a18c9b11892f2d1a3573019e85257c3b19/pkg/controller/nodes/executor.go#L440) function where, in contrast to the current logic, if the flag is set, we check if the number of attempts + the number of system failures is larger than the max number of retries set in the task decorator. For the last retries (details see section 8), non-spot instances are used. +* Apply the new behavior in the [`isEligibleForRetry`](https://github.com/flyteorg/flytepropeller/blob/294f47a18c9b11892f2d1a3573019e85257c3b19/pkg/controller/nodes/executor.go#L440) function where, in contrast to the current logic, if the flag is set, we check if the number of attempts + the number of system failures is larger than the max number of retries set in the task decorator. For the last retries, non-spot instances are used. + * To not introduce an additional platform config flag, the existing flag `interruptibleFailureThreshold` is reused to decide when to switch to a regular instance. With the existing retry behaviour, this number is a positive integer that is the number of retries that are interruptible. For example, 4 (system failure) retries are allowed where 3 of them are interruptible. With the new retry behaviour, a negative number such as e.g. `-2` means that the last to retries are non-interruptible. * Add configuration for default number of retries in FlytePropeller. Currently the default maximum number of attempts is hardcoded. This should be exposed in the configuration.