From 83518f5c672cfa87e44059eab898094997e85fbc Mon Sep 17 00:00:00 2001 From: Jaime Frey Date: Thu, 17 Oct 2024 14:12:25 -0500 Subject: [PATCH] HTCONDOR-1323 Additional docs about held jobs --- docs/v23/troubleshooting/common-issues.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/docs/v23/troubleshooting/common-issues.md b/docs/v23/troubleshooting/common-issues.md index 1cd4be2f..6cf2102d 100644 --- a/docs/v23/troubleshooting/common-issues.md +++ b/docs/v23/troubleshooting/common-issues.md @@ -422,7 +422,7 @@ Notice the failures in the above message: `Remote Mapping: gsi@unmapped` and `Au ### Jobs go on hold -Jobs will be put on held with a `HoldReason` attribute that can be inspected with +Jobs can be put on hold with a `HoldReason` attribute that can be inspected with [condor\_ce\_q](debugging-tools.md#condor_ce_q): ``` console @@ -430,6 +430,20 @@ user@host $ condor_ce_q -l -attr HoldReason HoldReason = "CE job in status 5 put on hold by SYSTEM_PERIODIC_HOLD due to no matching routes, route job limit, or route failure threshold." ``` +The CE (and CE client) will put a job on hold when it encounters a problem +with the job that it doesn't know how to resolve. + +If the HTCondor schedd believes that the existing job it has submitted +to a remote queue may be recoverable, then it will leave the remote job +queued and keep the `GridJobId` attribute defined in the local job ad. +If you release the local job (with `condor_ce_release`), then the schedd +will attempt to re-establish contact with the remote scheduler. + +If the schedd believes the existing remote job is not recoverable, then it +willremove the job from the remote queue and set `GridJobId` to `Undefined` +in the local job ad. If you release the local job, then a new job instance +will be submitted to the remote scheduler. + #### Held jobs: no matching routes, route job limit, or route failure threshold Jobs on the CE will be put on hold if they are not claimed by the job router within 30 minutes.