[BUG] Failing Databricks tasks are not retried #5210

rambrus · 2024-04-10T14:37:58Z

Describe the bug

I created a Flyte task running in Databricks, task retries are set to 3 like this:

@task(task_config=databricks_conf, retries = 3)
def db_bad_luck_task():
    raise ValueError("Bad luck, this one failed")

Unfortunately, this task is never retried, we only have a single attempt to run the task.

Just for double-checking, I created a Python tasks running on K8 with a similar configs:

@task(task_config=databricks_conf, retries = 3)
def db_bad_luck_task():
    raise ValueError("Bad luck, this one failed")

In this case, we have 3 attempts to run the task. (this is the behavior we expect in the case of the Databricks task also)

I assume it needs to be defined in the plugin implementation what LifeCycleState / ResultState can be retried. Is that correct?

Expected behavior

Flyte will attempt to run the Databricks task 3 times.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

dosubot · 2024-04-10T14:38:29Z

Hey @rambrus! 😄 Long time no see in the issue wilderness! How's it going in the coding jungle?

It appears that the issue you're encountering with task retries not working as expected for Databricks tasks in Flyte might be related to how the Databricks plugin or Databricks itself handles job failures, rather than an issue with Flyte's retry mechanism. Flyte's retry functionality is designed to work at a higher level and should automatically retry tasks based on the defined retry policy, without requiring specific handling within the plugin code.

If retries are functioning correctly for Kubernetes tasks but not for Databricks tasks, this suggests a potential discrepancy in how task failures are recognized or handled between the two environments. It's also possible that there's an issue with the version of the Databricks plugin you're using. Checking the failure messages or logs when a task fails on Databricks could provide more insight into why retries aren't being triggered as expected.

For further investigation, it would be beneficial to know the specific version of the Databricks plugin in use, as there may have been updates or fixes in newer versions that address this issue. Additionally, reviewing the TaskMetadata class in the Flyte codebase and the Databricks plugin code could offer more insights into how retries are handled and implemented for Databricks tasks.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

pingsutw · 2024-04-11T21:08:19Z

@rambrus that's because we always return non-recoverable failure here.

This is also an expected behavior. When you raise a ValueError in the regular task, Flyte won't retry as well.

@task(retries = 3)
def t1():
    raise ValueError("Bad luck, this one failed")

rambrus · 2024-04-19T08:40:06Z

I did some investigation on the Databricks side and found that if the Databricks cluster start fails for any reason, Get a single job run will return this state:

...
    "state": {
        "life_cycle_state": "INTERNAL_ERROR",
        "result_state": "FAILED",
        "state_message": "<Error details>",
        "user_cancelled_or_timedout": false
    },
...

In fact, Databricks' built-in orchestration framework Databricks Workflows classifies life_cycle_state: INTERNAL_ERROR as a retryable error (see here), it would probably make sense to provide an identical behavior in Flyte.

I'm proposing this change in plugins/webapi/databricks/plugin.go:

	case "INTERNAL_ERROR":
		return core.PhaseInfoRetryableFailure(string(rune(http.StatusInternalServerError)), message, taskInfo), nil
	}

@pingsutw How does that sound to you?

pingsutw · 2024-04-23T16:45:29Z

#5277

rambrus added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Apr 10, 2024

wild-endeavor added backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Apr 11, 2024

pingsutw self-assigned this May 6, 2024

Future-Outlier closed this as completed May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Failing Databricks tasks are not retried #5210

[BUG] Failing Databricks tasks are not retried #5210

rambrus commented Apr 10, 2024

dosubot bot commented Apr 10, 2024 •

edited

Loading

About Dosu

pingsutw commented Apr 11, 2024

rambrus commented Apr 19, 2024

pingsutw commented Apr 23, 2024

[BUG] Failing Databricks tasks are not retried #5210

[BUG] Failing Databricks tasks are not retried #5210

Comments

rambrus commented Apr 10, 2024

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

dosubot bot commented Apr 10, 2024 • edited Loading

Sources

About Dosu

pingsutw commented Apr 11, 2024

rambrus commented Apr 19, 2024

pingsutw commented Apr 23, 2024

dosubot bot commented Apr 10, 2024 •

edited

Loading