-
Notifications
You must be signed in to change notification settings - Fork 108
WMCore debugging tools
This wiki is meant to list debugging use cases, either to solve/debug Operations issues or internal Dev ones.
Problem: Ops request us to check why the workflow hasn't processed 100% of the lumi sections, even though all the failures have been recovered via ACDCs
Solution: first we need to make sure that ACDCs have been created AND executed for every single task path (fileset_name, in terms of ACDC collection).
Details: what we need to retrieve/check, is:
- did the ACDCs get created after the initial/original workflow moved to
completed
status? - list the amount of jobs/lumis in each
fileset_name
, from the ACDC collection - query reqmgr2 for ACDC workflows recovering that workflow (and fetch their
InitialTaskPath
) - make sure that those ACDC workflows are in
completed
status - anything else
Problem: Ops request us to investigate why the output datasets are missing statistics, even though there are no job failures reported (or they have all been recovered).
Solution: not necessarily a solution. However, part of the solution above has to be applied here, thus check whether all lumis have been recovered. In addition to that, we could have a tool that takes a workflow as input, it finds all the run/lumis meant to be processed, randomly selects one output dataset and compare it against the input dataset. Finally, yielding a list of run/lumis missing in the output dataset.
Problem: When we are completing the agent draining procedure, there are some rare cases where subscriptions are stuck in unfinished state (finished=0
). It also usually means that there is - at least - one GQ workqueue element in Running
state (and its equivalent LQ workqueue/workqueue_inbox element).
Solution: there are many possible reasons for having a subscription stuck, so there is no common solution. Among the checks we can perform are: correlate the subscription to its fileset and workflow task; check whether they have files either in the available or acquired tables.
Details: further details can be extracted from this github issue: https://github.com/dmwm/WMCore/issues/9568
Problem: There might be a situation where we need to invalidate (in PhEDEx and DBS) blocks produced by a given workflow. Among the reasons, it could be that there were two workflows writing to the same output (like a duplicate ACDC).
Solution: we need to find out which agents were processing that given workflow. With that information in hands, we can then query their local SQL database and list all the output blocks (from all the tasks). What to do then with the output blocks, is out of the scope of this debugging.
Details: a SQL query like the following can yield all the output blocks (starting from files associated to blocks) for a given workflow
SELECT dbsbuffer_block.id AS blockid, dbsbuffer_block.blockname AS blockname FROM dbsbuffer_block
INNER JOIN dbsbuffer_file ON dbsbuffer_block.id = dbsbuffer_file.block_id
INNER JOIN dbsbuffer_workflow ON dbsbuffer_file.workflow = dbsbuffer_workflow.id
WHERE dbsbuffer_workflow.name='cmsunified_ACDC0_Run2016B-v2-ZeroBias2-21Feb2020_UL2016_HIPM_1068p1_200313_133114_5167';