-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Failing Databricks tasks are not retried #5210
Comments
Hey @rambrus! 😄 Long time no see in the issue wilderness! How's it going in the coding jungle? It appears that the issue you're encountering with task retries not working as expected for Databricks tasks in Flyte might be related to how the Databricks plugin or Databricks itself handles job failures, rather than an issue with Flyte's retry mechanism. Flyte's retry functionality is designed to work at a higher level and should automatically retry tasks based on the defined retry policy, without requiring specific handling within the plugin code. If retries are functioning correctly for Kubernetes tasks but not for Databricks tasks, this suggests a potential discrepancy in how task failures are recognized or handled between the two environments. It's also possible that there's an issue with the version of the Databricks plugin you're using. Checking the failure messages or logs when a task fails on Databricks could provide more insight into why retries aren't being triggered as expected. For further investigation, it would be beneficial to know the specific version of the Databricks plugin in use, as there may have been updates or fixes in newer versions that address this issue. Additionally, reviewing the TaskMetadata class in the Flyte codebase and the Databricks plugin code could offer more insights into how retries are handled and implemented for Databricks tasks. Sources
|
I did some investigation on the Databricks side and found that if the Databricks cluster start fails for any reason, Get a single job run will return this state:
In fact, Databricks' built-in orchestration framework Databricks Workflows classifies I'm proposing this change in plugins/webapi/databricks/plugin.go:
@pingsutw How does that sound to you? |
Describe the bug
I created a Flyte task running in Databricks, task retries are set to 3 like this:
Unfortunately, this task is never retried, we only have a single attempt to run the task.
Just for double-checking, I created a Python tasks running on K8 with a similar configs:
In this case, we have 3 attempts to run the task. (this is the behavior we expect in the case of the Databricks task also)
I assume it needs to be defined in the plugin implementation what LifeCycleState / ResultState can be retried. Is that correct?
Expected behavior
Flyte will attempt to run the Databricks task 3 times.
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: