FAQ frequently asked questions

This wiki is meant to contain the most common questions and answers related to the Workload Management operations.

Just a reminder about the usual monitoring tools though:

WMAgent monitoring: https://monit-grafana.cern.ch/d/lhVKAhNik/cms-wmagent-monitoring?from=now-2d&orgId=11&refresh=5m&to=now
WMCore Workflow monitoring: https://cmsweb.cern.ch/wmstats/index.html
CMS Job monitoring: MonIT_JobMonitoring
Job/Condor pool monitoring: https://cms-gwmsmon.cern.ch/
Production Condor pool summary: http://cms-htcondor-monitor.t2.ucsd.edu/letts/production.html

Why are there so many workflows stuck in `acquired` state?

While there is no clear answer for such question, there is likely enough monitoring information to get to a conclusion.

From the monitoring links above, one can check the Condor pool summary link, go to the Site Table: table and check the last row of the IdleCpus column. Right now the value is 3723, so there are 3723 cpus that are free in the system, and the likely reason they are not used comes from the fact that (some) workflows are not properly dimensioned, sometimes taking more memory than the usual 2.5GB/core.

The WMAgent monitoring also has some interesting plots on this respect, especially those for "GQ elements by priority", for instance this one, which shows thousands of GQEs in Available above the 80k priority. This would answer why 80k workflows are not going through as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ frequently asked questions

Why are there so many workflows stuck in `acquired` state?

Clone this wiki locally

FAQ frequently asked questions

Why are there so many workflows stuck in acquired state?

Clone this wiki locally

Why are there so many workflows stuck in `acquired` state?