Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Optimization][dinky-getaway] After a task fails, it is automatically deployed #3953

Closed
3 tasks done
jianjun159 opened this issue Nov 26, 2024 · 6 comments
Closed
3 tasks done
Assignees
Labels
Optimization Optimization function Waiting for reply Waiting for reply

Comments

@jianjun159
Copy link
Contributor

Search before asking

  • I had searched in the issues and found no similar optimization requirement.

Description

When I use k8s application to start a task, if there is an error in my task that causes the program to fail to start, the pod container of k8s will not automatically clear, resulting in an exception that already exists in the container when I start the next time. I hope that the deployment of the current job will be cleared after the abnormal start. This allows the job to restart without problems that already exist in the container

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@jianjun159 jianjun159 added Optimization Optimization function Waiting for reply Waiting for reply labels Nov 26, 2024
@Jam804
Copy link
Contributor

Jam804 commented Nov 26, 2024

I have two solutions to fix this issue:

 1.  Capture exceptions when starting a service and delete the service for all exceptions except timeouts.
 2.  Launch a background thread that performs a full scan of services every minute and deletes unhealthy services.

I believe both mechanisms should be implemented. The first one should promptly delete services that fail at startup, while the second should delete services that fail during runtime.

Which approach do you think is more appropriate?

@aiwenmo @Zzm0809 @zackyoungh

@gaoyan1998
Copy link
Contributor

我有两种解决方案来解决此问题:

 1.  Capture exceptions when starting a service and delete the service for all exceptions except timeouts.
 2.  Launch a background thread that performs a full scan of services every minute and deletes unhealthy services.

我认为这两种机制都应该实施。第一个选项应立即删除启动时失败的服务,而第二个选项应删除在运行时失败的服务。

您认为哪种方法更合适?

@aiwenmo @Zzm0809 @zackyoungh

Automatically deleting failed tasks is inappropriate because when something goes wrong with a task, the user has to go to the k8s cluster to check the logs to troubleshoot the error, and if it's deleted, it will result in a very bad experience because k8s doesn't keep anything

@Jam804
Copy link
Contributor

Jam804 commented Nov 27, 2024

我有两种解决方案来解决此问题:

 1.  Capture exceptions when starting a service and delete the service for all exceptions except timeouts.
 2.  Launch a background thread that performs a full scan of services every minute and deletes unhealthy services.

我认为这两种机制都应该实施。第一个选项应立即删除启动时失败的服务,而第二个选项应删除在运行时失败的服务。
您认为哪种方法更合适?
@aiwenmo @Zzm0809 @zackyoungh

自动删除失败的任务是不合适的,因为当任务出现问题时,用户必须去 k8s 集群查看日志排查错误,如果删除了,会导致非常糟糕的体验,因为 k8s 没有保留任何东西

Then I can meet the requirements by deleting the corresponding service before submitting the task, right?

@jianjun159
Copy link
Contributor Author

我有两种解决方案来解决此问题:

 1.  Capture exceptions when starting a service and delete the service for all exceptions except timeouts.
 2.  Launch a background thread that performs a full scan of services every minute and deletes unhealthy services.

我认为这两种机制都应该实施。第一个选项应立即删除启动时失败的服务,而第二个选项应删除在运行时失败的服务。
您认为哪种方法更合适?
@aiwenmo @Zzm0809 @zackyoungh

Automatically deleting failed tasks is inappropriate because when something goes wrong with a task, the user has to go to the k8s cluster to check the logs to troubleshoot the error, and if it's deleted, it will result in a very bad experience because k8s doesn't keep anything

我有两种解决方案来解决此问题:

 1.  Capture exceptions when starting a service and delete the service for all exceptions except timeouts.
 2.  Launch a background thread that performs a full scan of services every minute and deletes unhealthy services.

我认为这两种机制都应该实施。第一个选项应立即删除启动时失败的服务,而第二个选项应删除在运行时失败的服务。
您认为哪种方法更合适?
@aiwenmo @Zzm0809 @zackyoungh

Automatically deleting failed tasks is inappropriate because when something goes wrong with a task, the user has to go to the k8s cluster to check the logs to troubleshoot the error, and if it's deleted, it will result in a very bad experience because k8s doesn't keep anything

I think it is OK not to delete the pod after the task fails, because the user needs to see the log, but if you run it again after modification, there will be a problem that the pod already exists, so I think you can check whether the current pod already exists when the task starts, and remove the pod when the state is unhealthy

@gaoyan1998
Copy link
Contributor

Then I can meet the requirements by deleting the corresponding service before submitting the task, right?

@Jam804 yes,
That's a good idea

@gaoyan1998
Copy link
Contributor

Then I can meet the requirements by deleting the corresponding service before submitting the task, right?

Are you interested in doing this work, I can assign this issue to you @Jam804

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Optimization Optimization function Waiting for reply Waiting for reply
Projects
None yet
Development

No branches or pull requests

3 participants