From 1fb086fd521fc4ed6fc80df77c9017147b06e532 Mon Sep 17 00:00:00 2001 From: Jaime Frey Date: Thu, 17 Oct 2024 12:53:43 -0500 Subject: [PATCH 1/2] HTCONDOR-1323 Add batch job removal debugging help --- docs/v23/troubleshooting/common-issues.md | 34 +++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/docs/v23/troubleshooting/common-issues.md b/docs/v23/troubleshooting/common-issues.md index d818cb3d..1cd4be2f 100644 --- a/docs/v23/troubleshooting/common-issues.md +++ b/docs/v23/troubleshooting/common-issues.md @@ -550,6 +550,40 @@ This means that the `condor_job_router_info` (note this is not the CE version), 2. You have installed HTCondor in a non-standard location that is not in your `PATH`. 3. The `condor_job_router_info` tool itself wasn't available until Condor-8.2.3-1.1 (available in osg-upcoming). +### Jobs removed from the local batch system + +When the CE removes a job from the local batch system, it may be due to +a problem the CE encountered with managing the job or it may be at the +behest of the submitter to the CE (which may be a remote HTCondor +Access Point). + +Given a specific job ID in the CE logs, first find the job ad in CE +queue with the `condor_ce_q` tool and check the value of the `GridJobID` +attribute: + +``` console +user@host $ condor_ce_q -af GridJobId +``` + +If the job is no longer in the queue, you will have to check the history +using the `condor_ce_history` tool: + +``` console +user@host $ condor_ce_history -af GridJobId +``` + +If the `GridJobId` is *undefined*, then the CE did the removal due to a +problem interacting with the local batch system. +Check the `HoldReason` and `LastHoldReason` attributes for why the CE +removed the job. + +If `GridJobID` is not *undefined*, and is set to some value, then the +submitter to the CE removed the job. +If the submitter is a remote HTCondor Access Point, its daemons may have +done the removal as part of putting its local job on hold. +In that case, the `HoldReason` attribute in the remote job queue should +indicate the source of the problem. + Getting Help ------------ From 83518f5c672cfa87e44059eab898094997e85fbc Mon Sep 17 00:00:00 2001 From: Jaime Frey Date: Thu, 17 Oct 2024 14:12:25 -0500 Subject: [PATCH 2/2] HTCONDOR-1323 Additional docs about held jobs --- docs/v23/troubleshooting/common-issues.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/docs/v23/troubleshooting/common-issues.md b/docs/v23/troubleshooting/common-issues.md index 1cd4be2f..6cf2102d 100644 --- a/docs/v23/troubleshooting/common-issues.md +++ b/docs/v23/troubleshooting/common-issues.md @@ -422,7 +422,7 @@ Notice the failures in the above message: `Remote Mapping: gsi@unmapped` and `Au ### Jobs go on hold -Jobs will be put on held with a `HoldReason` attribute that can be inspected with +Jobs can be put on hold with a `HoldReason` attribute that can be inspected with [condor\_ce\_q](debugging-tools.md#condor_ce_q): ``` console @@ -430,6 +430,20 @@ user@host $ condor_ce_q -l -attr HoldReason HoldReason = "CE job in status 5 put on hold by SYSTEM_PERIODIC_HOLD due to no matching routes, route job limit, or route failure threshold." ``` +The CE (and CE client) will put a job on hold when it encounters a problem +with the job that it doesn't know how to resolve. + +If the HTCondor schedd believes that the existing job it has submitted +to a remote queue may be recoverable, then it will leave the remote job +queued and keep the `GridJobId` attribute defined in the local job ad. +If you release the local job (with `condor_ce_release`), then the schedd +will attempt to re-establish contact with the remote scheduler. + +If the schedd believes the existing remote job is not recoverable, then it +willremove the job from the remote queue and set `GridJobId` to `Undefined` +in the local job ad. If you release the local job, then a new job instance +will be submitted to the remote scheduler. + #### Held jobs: no matching routes, route job limit, or route failure threshold Jobs on the CE will be put on hold if they are not claimed by the job router within 30 minutes.