[Optimization][dinky-getaway] After a task fails, it is automatically deployed #3953

jianjun159 · 2024-11-26T08:43:57Z

Search before asking

I had searched in the issues and found no similar optimization requirement.

Description

When I use k8s application to start a task, if there is an error in my task that causes the program to fail to start, the pod container of k8s will not automatically clear, resulting in an exception that already exists in the container when I start the next time. I hope that the deployment of the current job will be cleared after the abnormal start. This allows the job to restart without problems that already exist in the container

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

Jam804 · 2024-11-26T09:04:22Z

I have two solutions to fix this issue:

 1.  Capture exceptions when starting a service and delete the service for all exceptions except timeouts.
 2.  Launch a background thread that performs a full scan of services every minute and deletes unhealthy services.

I believe both mechanisms should be implemented. The first one should promptly delete services that fail at startup, while the second should delete services that fail during runtime.

Which approach do you think is more appropriate?

@aiwenmo @Zzm0809 @zackyoungh

gaoyan1998 · 2024-11-27T04:00:31Z

我有两种解决方案来解决此问题：
 1.  Capture exceptions when starting a service and delete the service for all exceptions except timeouts.
 2.  Launch a background thread that performs a full scan of services every minute and deletes unhealthy services.
我认为这两种机制都应该实施。第一个选项应立即删除启动时失败的服务，而第二个选项应删除在运行时失败的服务。

您认为哪种方法更合适？

@aiwenmo @Zzm0809 @zackyoungh

Automatically deleting failed tasks is inappropriate because when something goes wrong with a task, the user has to go to the k8s cluster to check the logs to troubleshoot the error, and if it's deleted, it will result in a very bad experience because k8s doesn't keep anything

Jam804 · 2024-11-27T05:52:00Z

我有两种解决方案来解决此问题：
 1.  Capture exceptions when starting a service and delete the service for all exceptions except timeouts.
 2.  Launch a background thread that performs a full scan of services every minute and deletes unhealthy services.
我认为这两种机制都应该实施。第一个选项应立即删除启动时失败的服务，而第二个选项应删除在运行时失败的服务。
您认为哪种方法更合适？
@aiwenmo @Zzm0809 @zackyoungh
自动删除失败的任务是不合适的，因为当任务出现问题时，用户必须去 k8s 集群查看日志排查错误，如果删除了，会导致非常糟糕的体验，因为 k8s 没有保留任何东西

Then I can meet the requirements by deleting the corresponding service before submitting the task, right?

jianjun159 · 2024-11-27T06:27:24Z

我有两种解决方案来解决此问题：
 1.  Capture exceptions when starting a service and delete the service for all exceptions except timeouts.
 2.  Launch a background thread that performs a full scan of services every minute and deletes unhealthy services.
我认为这两种机制都应该实施。第一个选项应立即删除启动时失败的服务，而第二个选项应删除在运行时失败的服务。
您认为哪种方法更合适？
@aiwenmo @Zzm0809 @zackyoungh
Automatically deleting failed tasks is inappropriate because when something goes wrong with a task, the user has to go to the k8s cluster to check the logs to troubleshoot the error, and if it's deleted, it will result in a very bad experience because k8s doesn't keep anything

I think it is OK not to delete the pod after the task fails, because the user needs to see the log, but if you run it again after modification, there will be a problem that the pod already exists, so I think you can check whether the current pod already exists when the task starts, and remove the pod when the state is unhealthy

gaoyan1998 · 2024-11-27T06:49:50Z

Then I can meet the requirements by deleting the corresponding service before submitting the task, right?

@Jam804 yes，
That's a good idea

gaoyan1998 · 2024-11-27T06:53:56Z

Then I can meet the requirements by deleting the corresponding service before submitting the task, right?

Are you interested in doing this work, I can assign this issue to you @Jam804

jianjun159 added Optimization Optimization function Waiting for reply Waiting for reply labels Nov 26, 2024

Jam804 mentioned this issue Nov 27, 2024

[Optimization][dinky-getaway] Delete the previously failed cluster when resubmitting the task. #3969

Merged

gaoyan1998 assigned Jam804 Nov 27, 2024

gaoyan1998 closed this as completed Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimization][dinky-getaway] After a task fails, it is automatically deployed #3953

[Optimization][dinky-getaway] After a task fails, it is automatically deployed #3953

jianjun159 commented Nov 26, 2024

Jam804 commented Nov 26, 2024 •

edited

Loading

gaoyan1998 commented Nov 27, 2024

Jam804 commented Nov 27, 2024

jianjun159 commented Nov 27, 2024

gaoyan1998 commented Nov 27, 2024

gaoyan1998 commented Nov 27, 2024

[Optimization][dinky-getaway] After a task fails, it is automatically deployed #3953

[Optimization][dinky-getaway] After a task fails, it is automatically deployed #3953

Comments

jianjun159 commented Nov 26, 2024

Search before asking

Description

Are you willing to submit a PR?

Code of Conduct

Jam804 commented Nov 26, 2024 • edited Loading

gaoyan1998 commented Nov 27, 2024

Jam804 commented Nov 27, 2024

jianjun159 commented Nov 27, 2024

gaoyan1998 commented Nov 27, 2024

gaoyan1998 commented Nov 27, 2024

Jam804 commented Nov 26, 2024 •

edited

Loading