Restartjob Action Not Working with PodFailed Event When Job Phase Entering Finished Job State #3967

yuyue9284 · 2025-01-13T23:51:01Z

Please describe your problem in detail

vcjob with one worker pod and policy like following, if the pod failed, the vcjob will transit into failed state without restarting the job/pod.

      policies:
        - events:
          - PodFailed
          - PodEvicted
          action: RestartJob

According to the code, the finished state won't execution the action defined in the policies, just want to confirm if this is the by design behaviour:

volcano/pkg/controllers/job/state/finished.go

Lines 28 to 31 in 68fba2c

    
           func (ps *finishedState) Execute(action v1alpha1.Action) error { 
        
           	// In finished state, e.g. Completed, always kill the whole job. 
        
           	return KillJob(ps.job, PodRetainPhaseSoft, nil) 
        
           }

Any other relevant information

No response

Monokaix · 2025-01-14T01:24:45Z

We are working on that #3813

yuyue9284 · 2025-01-14T22:24:33Z

Hi @Monokaix , in the current implementation, with the above setup, if the pod state changes from running to failed, the failed event should trigger a restart action on the job. However, if the Volcano controller restarts and the default out-of-sync request is handled first, the job's state will become finished before the next failed event, preventing the restart action from being executed, will this be covered by #3813?

Happy path:
job running -> pod failed event -> request with restart -> running state with restart job action executed.

Corner case:
job running -> pod failed event -> somehow out of sync request handled -> job status changed to failed -> request with restart (generated by pod failed) -> finished state won't execute the actions.

Monokaix · 2025-01-15T01:26:41Z

We are working on that #3813

You can config like

spec:
  policies:
  - event: PodFailed
    action: RestartPod
  - event: PodEvicted
    action: RestartJob
    timeout: 10m

after that merged.

yuyue9284 added the kind/question Categorizes issue related to a new question label Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restartjob Action Not Working with PodFailed Event When Job Phase Entering Finished Job State #3967

Restartjob Action Not Working with PodFailed Event When Job Phase Entering Finished Job State #3967

yuyue9284 commented Jan 13, 2025

Monokaix commented Jan 14, 2025

yuyue9284 commented Jan 14, 2025

Monokaix commented Jan 15, 2025

Restartjob Action Not Working with PodFailed Event When Job Phase Entering Finished Job State #3967

Restartjob Action Not Working with PodFailed Event When Job Phase Entering Finished Job State #3967

Comments

yuyue9284 commented Jan 13, 2025

Please describe your problem in detail

Any other relevant information

Monokaix commented Jan 14, 2025

yuyue9284 commented Jan 14, 2025

Monokaix commented Jan 15, 2025