Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTCONDOR-1323 job removal debug #616

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 49 additions & 1 deletion docs/v23/troubleshooting/common-issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -422,14 +422,28 @@ Notice the failures in the above message: `Remote Mapping: gsi@unmapped` and `Au

### Jobs go on hold

Jobs will be put on held with a `HoldReason` attribute that can be inspected with
Jobs can be put on hold with a `HoldReason` attribute that can be inspected with
[condor\_ce\_q](debugging-tools.md#condor_ce_q):

``` console
user@host $ condor_ce_q -l <JOB-ID> -attr HoldReason
HoldReason = "CE job in status 5 put on hold by SYSTEM_PERIODIC_HOLD due to no matching routes, route job limit, or route failure threshold."
```

The CE (and CE client) will put a job on hold when it encounters a problem
with the job that it doesn't know how to resolve.

If the HTCondor schedd believes that the existing job it has submitted
to a remote queue may be recoverable, then it will leave the remote job
queued and keep the `GridJobId` attribute defined in the local job ad.
If you release the local job (with `condor_ce_release`), then the schedd
will attempt to re-establish contact with the remote scheduler.

If the schedd believes the existing remote job is not recoverable, then it
willremove the job from the remote queue and set `GridJobId` to `Undefined`
in the local job ad. If you release the local job, then a new job instance
will be submitted to the remote scheduler.

#### Held jobs: no matching routes, route job limit, or route failure threshold

Jobs on the CE will be put on hold if they are not claimed by the job router within 30 minutes.
Expand Down Expand Up @@ -550,6 +564,40 @@ This means that the `condor_job_router_info` (note this is not the CE version),
2. You have installed HTCondor in a non-standard location that is not in your `PATH`.
3. The `condor_job_router_info` tool itself wasn't available until Condor-8.2.3-1.1 (available in osg-upcoming).

### Jobs removed from the local batch system

When the CE removes a job from the local batch system, it may be due to
a problem the CE encountered with managing the job or it may be at the
behest of the submitter to the CE (which may be a remote HTCondor
Access Point).

Given a specific job ID in the CE logs, first find the job ad in CE
queue with the `condor_ce_q` tool and check the value of the `GridJobID`
attribute:

``` console
user@host $ condor_ce_q <JOB_ID> -af GridJobId
```

If the job is no longer in the queue, you will have to check the history
using the `condor_ce_history` tool:

``` console
user@host $ condor_ce_history <JOB_ID> -af GridJobId
```

If the `GridJobId` is *undefined*, then the CE did the removal due to a
problem interacting with the local batch system.
Check the `HoldReason` and `LastHoldReason` attributes for why the CE
removed the job.

If `GridJobID` is not *undefined*, and is set to some value, then the
submitter to the CE removed the job.
If the submitter is a remote HTCondor Access Point, its daemons may have
done the removal as part of putting its local job on hold.
In that case, the `HoldReason` attribute in the remote job queue should
indicate the source of the problem.

Getting Help
------------

Expand Down
Loading