Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failing Databricks tasks are not retried #5210

Closed
2 tasks done
rambrus opened this issue Apr 10, 2024 · 4 comments
Closed
2 tasks done

[BUG] Failing Databricks tasks are not retried #5210

rambrus opened this issue Apr 10, 2024 · 4 comments
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working

Comments

@rambrus
Copy link

rambrus commented Apr 10, 2024

Describe the bug

I created a Flyte task running in Databricks, task retries are set to 3 like this:

@task(task_config=databricks_conf, retries = 3)
def db_bad_luck_task():
    raise ValueError("Bad luck, this one failed") 

Unfortunately, this task is never retried, we only have a single attempt to run the task.

Just for double-checking, I created a Python tasks running on K8 with a similar configs:

@task(task_config=databricks_conf, retries = 3)
def db_bad_luck_task():
    raise ValueError("Bad luck, this one failed")

In this case, we have 3 attempts to run the task. (this is the behavior we expect in the case of the Databricks task also)

I assume it needs to be defined in the plugin implementation what LifeCycleState / ResultState can be retried. Is that correct?

Expected behavior

Flyte will attempt to run the Databricks task 3 times.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@rambrus rambrus added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Apr 10, 2024
Copy link

dosubot bot commented Apr 10, 2024

Hey @rambrus! 😄 Long time no see in the issue wilderness! How's it going in the coding jungle?

It appears that the issue you're encountering with task retries not working as expected for Databricks tasks in Flyte might be related to how the Databricks plugin or Databricks itself handles job failures, rather than an issue with Flyte's retry mechanism. Flyte's retry functionality is designed to work at a higher level and should automatically retry tasks based on the defined retry policy, without requiring specific handling within the plugin code.

If retries are functioning correctly for Kubernetes tasks but not for Databricks tasks, this suggests a potential discrepancy in how task failures are recognized or handled between the two environments. It's also possible that there's an issue with the version of the Databricks plugin you're using. Checking the failure messages or logs when a task fails on Databricks could provide more insight into why retries aren't being triggered as expected.

For further investigation, it would be beneficial to know the specific version of the Databricks plugin in use, as there may have been updates or fixes in newer versions that address this issue. Additionally, reviewing the TaskMetadata class in the Flyte codebase and the Databricks plugin code could offer more insights into how retries are handled and implemented for Databricks tasks.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@wild-endeavor wild-endeavor added backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Apr 11, 2024
@pingsutw
Copy link
Member

@rambrus that's because we always return non-recoverable failure here.

This is also an expected behavior. When you raise a ValueError in the regular task, Flyte won't retry as well.

@task(retries = 3)
def t1():
    raise ValueError("Bad luck, this one failed") 

@rambrus
Copy link
Author

rambrus commented Apr 19, 2024

I did some investigation on the Databricks side and found that if the Databricks cluster start fails for any reason, Get a single job run will return this state:

...
    "state": {
        "life_cycle_state": "INTERNAL_ERROR",
        "result_state": "FAILED",
        "state_message": "<Error details>",
        "user_cancelled_or_timedout": false
    },
...

In fact, Databricks' built-in orchestration framework Databricks Workflows classifies life_cycle_state: INTERNAL_ERROR as a retryable error (see here), it would probably make sense to provide an identical behavior in Flyte.

I'm proposing this change in plugins/webapi/databricks/plugin.go:

	case "INTERNAL_ERROR":
		return core.PhaseInfoRetryableFailure(string(rune(http.StatusInternalServerError)), message, taskInfo), nil
	}

@pingsutw How does that sound to you?

@pingsutw
Copy link
Member

#5277

@pingsutw pingsutw self-assigned this May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants