Skip to content

Logbook for production issues

Alan Malta Rodrigues edited this page Jun 24, 2020 · 6 revisions

This week is meant to keep track of WMCore-related problems affecting the central production infrastructure, allowing to have a history of such problems as well as making it easier to identify problems after many weeks/months have passed.

Jobs executing where there is no pileup (exit code: 8029 - NoSecondaryFiles)

Problem: as reported in this GH issue #9658, with the HG2004 CMSWEB production deployment in April, we have fully enabled the MSTransferor and MSMonitor; at the same time, automatic input data placement has been disabled on the Unified side. This means that workflows no longer get assigned to (SiteWhitelist) only where the data is available. So, it's quite common to have a large SiteWhitelist, while only a couple of sites host the pileup dataset. In addition to that, we have also noticed that MSTransferor does NOT enforce primary block data placement in the same sites hosting the pileup dataset (eventually causing jobs to have an empty list of secondary files).

Solution: this PR #9659 makes an intersection of the MCFakeFile location against the secondary locations. Thus, jobs without input dataset but with a secondary at a later stage would only get executed at sites that also host part of the secondary data. This fix is only available starting in WMAgent 1.3.3 releases.

Impacted dates: Unified was disabled on 8/april/2020, so any WQE acquired between that date and 24/april/2020 could have been affected.

WMAgent setting an unsupported configuration parameter: enforceGUIDInFileName (exit code: 8009 - Configuration)

Problem: this parameter enforceGUIDInFileName is a new feature requested in this GH ticket #9468. However, our implementation did not cover all the use cases and WMAgent was setting that parameter for source modules that did not support it, thus raising a Configuration exception.

Solution: here is the second PR #9660 where an explicit check for the source type has been added. With this, we do not expect any more Configuration problems involving this parameter. This feature/fix is only available starting in WMAgent 1.3.3 releases.

Impacted dates: It affects any workflows that had their sandboxes created between 22/april/2020 and 24/april/2020. So, if an agent keeps pulling WQE down for the same workflow, during a long time, it will likely keep hitting the same problem (sandboxes have not been patched).

Clone this wiki locally