Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restartjob Action Not Working with PodFailed Event When Job Phase Entering Finished Job State #3967

Open
yuyue9284 opened this issue Jan 13, 2025 · 3 comments
Labels
kind/question Categorizes issue related to a new question

Comments

@yuyue9284
Copy link

Please describe your problem in detail

vcjob with one worker pod and policy like following, if the pod failed, the vcjob will transit into failed state without restarting the job/pod.

      policies:
        - events:
          - PodFailed
          - PodEvicted
          action: RestartJob

According to the code, the finished state won't execution the action defined in the policies, just want to confirm if this is the by design behaviour:

func (ps *finishedState) Execute(action v1alpha1.Action) error {
// In finished state, e.g. Completed, always kill the whole job.
return KillJob(ps.job, PodRetainPhaseSoft, nil)
}

Any other relevant information

No response

@yuyue9284 yuyue9284 added the kind/question Categorizes issue related to a new question label Jan 13, 2025
@Monokaix
Copy link
Member

We are working on that #3813

@yuyue9284
Copy link
Author

Hi @Monokaix , in the current implementation, with the above setup, if the pod state changes from running to failed, the failed event should trigger a restart action on the job. However, if the Volcano controller restarts and the default out-of-sync request is handled first, the job's state will become finished before the next failed event, preventing the restart action from being executed, will this be covered by #3813?

Happy path:
job running -> pod failed event -> request with restart -> running state with restart job action executed.

Corner case:
job running -> pod failed event -> somehow out of sync request handled -> job status changed to failed -> request with restart (generated by pod failed) -> finished state won't execute the actions.

@Monokaix
Copy link
Member

We are working on that #3813

You can config like

spec:
  policies:
  - event: PodFailed
    action: RestartPod
  - event: PodEvicted
    action: RestartJob
    timeout: 10m

after that merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Categorizes issue related to a new question
Projects
None yet
Development

No branches or pull requests

2 participants