From 48dc724ec59ebe6d00cd57a9e7a7d4243edb1541 Mon Sep 17 00:00:00 2001 From: "Fabio M. Graetz, Ph.D" Date: Sun, 10 Sep 2023 11:32:04 +0200 Subject: [PATCH] Document simplified retry behaviour introduced in #3902 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Fabio M. Graetz, Ph.D. Signed-off-by: Fabio Grätz --- rsts/concepts/tasks.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/rsts/concepts/tasks.rst b/rsts/concepts/tasks.rst index 1ca43d5ea8..e38e16839d 100644 --- a/rsts/concepts/tasks.rst +++ b/rsts/concepts/tasks.rst @@ -106,6 +106,10 @@ System retry can be of two types: Recoverable vs. Non-Recoverable failures: Recoverable failures will be retried and counted against the task's retry count. Non-recoverable failures will just fail, i.e., the task isn’t retried irrespective of user/system retry configurations. All user exceptions are considered non-recoverable unless the exception is a subclass of FlyteRecoverableException. +.. note:: + + `RFC 3902 `_ implements an alternative, simplified retry behaviour with which both system and user retries are counted towards a single retry budget defined in the task decorator (thus, without a second retry budget defined in the platform configuration). The last retries are always performed on non-spot instances to guarantee completion. To activate this behaviour, set ``TODO`` to ``TODO`` in the helm values. + **Timeouts** To ensure that the system is always making progress, tasks must be guaranteed to end gracefully/successfully. The system defines a default timeout period for the tasks. It is possible for task authors to define a timeout period, after which the task is marked as ``failure``. Note that a timed-out task will be retried if it has a retry strategy defined. The timeout can be handled in the `TaskMetadata `__.