Skip to content

Commit

Permalink
fixed [Docs] Spot/interruptible docs imply retries come from the user…
Browse files Browse the repository at this point in the history
… retry budget #3956

Signed-off-by: Anirban Pal <[email protected]>
  • Loading branch information
ap0calypse8 committed Oct 29, 2024
1 parent 553a702 commit fc5f5ac
Show file tree
Hide file tree
Showing 2 changed files with 74 additions and 0 deletions.
33 changes: 33 additions & 0 deletions docs/user_guide/concepts/main_concepts/tasks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -123,3 +123,36 @@ Caching/Memoization

Flyte supports memoization of task outputs to ensure that identical invocations of a task are not executed repeatedly, thereby saving compute resources and execution time. For example, if you wish to run the same piece of code multiple times, you can reuse the output instead of re-computing it.
For more information on memoization, refer to the :std:doc:`/user_guide/development_lifecycle/caching`.

### Retries and Spot Instances

Tasks can define a retry strategy to handle different types of failures:

1. **System Retries**: Used for infrastructure-level failures outside of user control:
- Spot instance preemptions
- Network issues
- Service unavailability
- Hardware failures

*Important*: When running on spot/interruptible instances, preemptions count against the system retry budget, not the user retry budget. The last retry attempt automatically runs on a non-preemptible instance to ensure task completion.

2. **User Retries**: Specified in the `@task` decorator (via `retries` parameter), used for:
- Application-level errors
- Invalid input handling
- Business logic failures

```python
@task(retries=3) # Sets user retry budget to 3
def my_task() -> None:
...
```

### Alternative Retry Behavior

Starting with RFC 3902, Flyte offers a simplified retry behavior where both system and user retries count towards a single retry budget defined in the task decorator. To enable this:

1. Set `configmap.core.propeller.node-config.ignore-retry-cause` to `true` in helm values
2. Define retries in the task decorator to set the total retry budget
3. The last retries will automatically run on non-spot instances

This provides a simpler, more predictable retry behavior while maintaining reliability.
41 changes: 41 additions & 0 deletions docs/user_guide/flyte_fundamentals/optimizing_tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,47 @@ the resources that you need. In this case, that need is distributed
training, but Flyte also provides integrations for {ref}`Spark <plugins-spark-k8s>`,
{ref}`Ray <kube-ray-op>`, {ref}`MPI <kf-mpi-op>`, {ref}`Snowflake <snowflake_agent>`, and more.

## Retries and Spot Instances

When running tasks on spot/interruptible instances, it's important to understand how retries work:

```python
from flytekit import task

@task(
retries=3, # User retry budget
interruptible=True # Enables running on spot instances
)
def my_task() -> None:
...
```

### Default Retry Behavior
- Spot instance preemptions count against the system retry budget (not user retries)
- The last system retry automatically runs on a non-preemptible instance
- User retries (specified in `@task` decorator) are only used for application errors

### Simplified Retry Behavior
Flyte also offers a simplified retry model where both system and user retries count towards a single budget:

```python
@task(
retries=5, # Total retry budget for both system and user errors
interruptible=True
)
def my_task() -> None:
...
```

To enable this behavior:
1. Set `configmap.core.propeller.node-config.ignore-retry-cause=true` in platform config
2. Define total retry budget in task decorator
3. Last retries automatically run on non-spot instances

Choose the retry model that best fits your use case:
- Default: Separate budgets for system vs user errors
- Simplified: Single retry budget with guaranteed completion

Even though Flyte itself is a powerful compute engine and orchestrator for
data engineering, machine learning, and analytics, perhaps you have existing
code that leverages other platforms. Flyte recognizes the pain of migrating code,
Expand Down

0 comments on commit fc5f5ac

Please sign in to comment.