Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.0] Proper handling of waiting jobs when set to be killed #7690

Merged
merged 4 commits into from
Aug 13, 2024

Conversation

michmx
Copy link

@michmx michmx commented Jun 24, 2024

When jobs are in status SUBMITTING, WAITING, etc, and they are set to be killed, they are not added to the list markKilledJobList. This pull request fix the identation when kill instead of delete is used.

BEGINRELEASENOTES

*WMS
FIX: Proper killing of jobs when not matched, running or stalled

ENDRELEASENOTES

@DIRACGridBot DIRACGridBot added the alsoTargeting:integration Cherry pick this PR to integration after merge label Jun 24, 2024
@michmx michmx changed the title fix(wms): proper handling of markKilledJobList when killing jobs with… [8.0] Proper handling of waiting jobs when set to be killed Jun 24, 2024
@iueda
Copy link
Contributor

iueda commented Jun 24, 2024

I think the current indentation is correct, for eg. SUBMITTING is not allowed to go to KILLED:

            SUBMITTING: State(0, [RECEIVED, CHECKING, DELETED], defState=SUBMITTING),  # initial state

WAITING is allowed to go to KILLED since #7276, so it should be moved from this list of job statuses which are not allowed to go to KILLED to that which can go to KILLED.

Or, can't the possible states transitions (https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/WorkloadManagementSystem/Client/JobStatus.py) be checked instead of these hard-coded list?

@fstagni
Copy link
Contributor

fstagni commented Jun 25, 2024

I think @iueda is correct. For checking the state transitions,

def filterJobStateTransition(jobIDs, candidateState, jobMonitoringClient=None):
(`‎filterJobStateTransition) can be used.

@michmx
Copy link
Author

michmx commented Jun 25, 2024

@fstagni we have a fundamental question before moving on with this PR or a more elaborated implementation. When you have a job in Waiting status, is it meaningful to use sendKillCommand=True in killJob()?

I ask because with the current indentation, only statuses not hard-coded are appended in the markKilledJobList, which use sendKillCommand=False (this was the behavior on DIRAC v7r2 for waiting jobs).

Moving the Waiting status to line 520 would add them to killJobList, using sendKillCommand=True.

@fstagni
Copy link
Contributor

fstagni commented Jun 26, 2024

sendKillCommand=True is effectively only meaningful when the job is already running. For any other state it does not make sense.

@iueda
Copy link
Contributor

iueda commented Jul 4, 2024

According to WorkloadManagementSystem/Client/JobStatus.py,

the states that can go to Killed are

  • COMPLETING
  • STALLED
  • RUNNING
  • MATCHED
  • WAITING
  • STAGING
  • SCOUTING

The current code does not kill jobs in 'WAITING' state

  • RUNNING, MATCHED, STALLED => killJobList.append(jobID)
  • WAITING => if not right == RIGHT_KILL; deleteJobList.append(jobID)
  • COMPLETING, STAGING, SCOUTING => markKilledJobList.append(jobID)

Do we want
a) to treat them all the same, eg. killJobList.append(jobID),
for sendKillCommand=True or False doesn't matter for jobs not running,
or
b) to do as follows?

  • RUNNING, MATCHED, STALLED, WAITING => killJobList.append(jobID)
  • COMPLETING, STAGING, SCOUTING => markKilledJobList.append(jobID)

If the latter (b), then we would need to keep the hard-coding.

@fstagni
Copy link
Contributor

fstagni commented Jul 4, 2024

I would go for option "a".

@michmx
Copy link
Author

michmx commented Aug 6, 2024

I would go for option "a".

Sorry for the delay @fstagni @iueda . I am back on this issue. Now the new proposal uses filterJobStateTransition() to check if the job can go to killed, plus deleted if requested.

Also, all jobs to be killed (after the filtering) go to killJobList, it means, no usage of sendKillCommand=False

@fstagni fstagni force-pushed the fix-wms-kill-jobs branch from 6e3e831 to e9ec307 Compare August 8, 2024 14:38
@fstagni
Copy link
Contributor

fstagni commented Aug 8, 2024

I took the freedom to push for adding a unit test.

@fstagni fstagni force-pushed the fix-wms-kill-jobs branch from e9ec307 to 4cf8434 Compare August 10, 2024 08:28
@michmx
Copy link
Author

michmx commented Aug 12, 2024

I took the freedom to push for adding a unit test.

Thanks a lot!

@fstagni fstagni merged commit 8715e1e into DIRACGrid:rel-v8r0 Aug 13, 2024
26 checks passed
@DIRACGridBot DIRACGridBot added sweep:done All sweeping actions have been done for this PR sweep:failed Sweeping failed and needs manual intervention labels Aug 13, 2024
@DIRACGridBot
Copy link

Sweep summary

Sweep ran in https://github.com/DIRACGrid/DIRAC/actions/runs/10367347752

Failed:

  • integration
    cherry-pick 8715e1e into integration failed
    check merge conflicts on a local copy of this repository
    git fetch upstream
    git checkout upstream/integration -b cherry-pick-2-8715e1e0e-integration
    git cherry-pick -x -m 1 8715e1e0e
    # Fix the conflicts
    git cherry-pick --continue
    git commit --amend -m 'sweep: #7690 Proper handling of waiting jobs when set to be killed' --author=''
    git push -u origin cherry-pick-2-8715e1e0e-integration
    
    # If you have the GitHub CLI installed the PR can be made with
    gh pr create \
         --label 'sweep:from rel-v8r0' \
         --base integration \
         --repo DIRACGrid/DIRAC \
         --title '[sweep:integration] Proper handling of waiting jobs when set to be killed' \
         --body 'Sweep #7690 `Proper handling of waiting jobs when set to be killed` to `integration`.
    
    Adding original author @michmx as watcher.
    
    BEGINRELEASENOTES
    
    *WMS
    FIX: Proper killing of jobs when not matched, running or stalled
    
    ENDRELEASENOTES
    Closes #7749'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alsoTargeting:integration Cherry pick this PR to integration after merge sweep:done All sweeping actions have been done for this PR sweep:failed Sweeping failed and needs manual intervention
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants