Skip to content

WMCore debugging tools

Alan Malta Rodrigues edited this page Mar 25, 2020 · 11 revisions

This wiki is meant to list debugging use cases, either to solve/debug Operations issues or internal Dev ones.

[Ops] Debug whether all jobs have been recovered via ACDCs

Problem: Ops request us to check why the workflow hasn't processed 100% of the lumi sections, even though all the failures have been recovered via ACDCs

Solution: first we need to make sure that ACDCs have been created AND executed for every single task path (fileset_name, in terms of ACDC collection).

Details: what we need to retrieve/check, is:

  • did the ACDCs get created after the initial/original workflow moved to completed status?
  • list the amount of jobs/lumis in each fileset_name, from the ACDC collection
  • query reqmgr2 for ACDC workflows recovering that workflow (and fetch their InitialTaskPath)
  • make sure that those ACDC workflows are in completed status
  • anything else

[Ops] Find out which run/lumi is missing in the output dataset

Problem: Ops request us to investigate why the output datasets are missing statistics, even though there are no job failures reported (or they have all been recovered).

Solution: not necessarily a solution. However, part of the solution above has to be applied here, thus check whether all lumis have been recovered. In addition to that, we could have a tool that takes a workflow as input, it finds all the run/lumis meant to be processed, randomly selects one output dataset and compare it against the input dataset. Finally, yielding a list of run/lumis missing in the output dataset.

[Dev] Debugging subscriptions not finished

Problem: When we are completing the agent draining procedure, there are some rare cases where subscriptions are stuck in unfinished state (finished=0). It also usually means that there is - at least - one GQ workqueue element in Running state (and its equivalent LQ workqueue/workqueue_inbox element).

Solution: there are many possible reasons for having a subscription stuck, so there is no common solution. Among the checks we can perform are: correlate the subscription to its fileset and workflow task; check whether they have files either in the available or acquired tables.

Details: further details can be extracted from this github issue: https://github.com/dmwm/WMCore/issues/9568

[Dev/Ops] Find all the output blocks for a given workflow name

Problem: There might be a situation where we need to invalidate (in PhEDEx and DBS) blocks produced by a given workflow. Among the reasons, it could be that there were two workflows writing to the same output (like a duplicate ACDC).

Solution: we need to find out which agents were processing that given workflow. With that information in hands, we can then query their local SQL database and list all the output blocks (from all the tasks). What to do then with the output blocks, is out of the scope of this debugging.

Details: a SQL query like the following can yield all the output blocks (starting from files associated to blocks) for a given workflow

SELECT dbsbuffer_block.id AS blockid, dbsbuffer_block.blockname AS blockname FROM dbsbuffer_block
  INNER JOIN dbsbuffer_file ON dbsbuffer_block.id = dbsbuffer_file.block_id
  INNER JOIN dbsbuffer_workflow ON dbsbuffer_file.workflow = dbsbuffer_workflow.id
  WHERE dbsbuffer_workflow.name='cmsunified_ACDC0_Run2016B-v2-ZeroBias2-21Feb2020_UL2016_HIPM_1068p1_200313_133114_5167';
Clone this wiki locally