-
Notifications
You must be signed in to change notification settings - Fork 107
Deterministic Pileup
The latest modification to deterministic pileup occurred here: https://github.com/dmwm/WMCore/pull/5954/files
This reference page explains how the concept of deterministic pileup is implemented in the WMAgent system. The first section can be skipped if you are familiar with the WMAgent and WorkQueue and how pileup is treated, otherwise read on.
This is based on the idea of this document
In an nutshell, when a request specifies a MCPileup or DataPileup dataset, the workload definition includes this information in the processing task in order to have the files in these datasets included in the PSet file for the cmsRun processes in the jobs.
The role of the WorkQueue for workloads with pileup is to read from DBS2 the list of all blocks with locations, and files in the dataset and store this in a JSON file that is used as a payload for the jobs.
In runtime, each job will read the JSON payload and filter the blocks that are present at the site where it is currently running at. The files that are expected in the site's SE will be included in the fileNames attribute of the corresponding mixing module. For example, the following modifications will made to the PSet (in the process attribute) if both MCPileup and DataPileup is specified (this is a modified version of the code in the repo.
def modifyPSetForPileup(dataFilesInSE, mcFilesInSE):
# First we find the MixingModules and DataMixingModules
mixModules, dataMixModules = [], []
prodsAndFilters = {}
prodsAndFilters.update(self.process.producers)
prodsAndFilters.update(self.process.filters)
for key, value in prodsAndFilters.items():
if value.type_() == "MixingModule":
mixModules.append(value)
if value.type_() == "DataMixingModule":
dataMixModules.append(value)
# Then we add the files to modules separately depending on the type
for m in mixModules:
inputTypeAttrib = getattr(m, "input", None) or getattr(m, "secsource", None)
if inputTypeAttrib is None:
continue
inputTypeAttrib.fileNames = cms.untracked.vstring()
for lfn in mcFilesInSE:
inputTypeAttrib.fileNames.append(lfn)
for m in dataMixModules:
inputTypeAttrib = getattr(m, "input", None) or getattr(m, "secsource", None)
if inputTypeAttrib is None:
continue
inputTypeAttrib.fileNames = cms.untracked.vstring()
for lfn in dataFilesInSE:
inputTypeAttrib.fileNames.append(lfn)
return
There is an argument for ReDigi which indicates the WMAgent that the DataPileup should be handled differently, the objective is to have reproducible pileup mixing. Now we describe the changes in the workflow when dealing with this deterministic pileup.
In the WMAgent side, there is an addition in the JobSplitting for the LumiBased and EventAwareLumiBased algorithms. The splitting algorithms will keep track of the number of processing jobs that have been created and issue a number of events to skip for each job which is equal to:
# Assuming job N
eventsToSkipInPileup = ((N-1) * eventsPerLumi * lumisPerJob)
The runtime payload includes two new elements for this workflow:
- Number of events to skip in the data pileup
- Number of events in the pileup dataset per block
Since we filter the blocks that are present at the site, we first calculate the total number of events present in the blocks we will use from the pileup (usually the total number of events in the dataset). Then we determine the number of events to skip in the pileup by doing a modulo with the total number of events in the dataset, this way we rollback to the beginning of the pileup dataset if needed.
The modifications to the PSet look like this:
def modifyPSetForPileup(dataFilesInSE, mcFilesInSE, eventsToSkip):
# First we find the MixingModules and DataMixingModules
...
# Then we add the files to modules separately depending on the type,
# It only changes for data pileup
...
for m in dataMixModules:
inputTypeAttrib = getattr(m, "input", None) or getattr(m, "secsource", None)
if inputTypeAttrib is None:
continue
inputTypeAttrib.fileNames = cms.untracked.vstring()
# We use Python sorting so the files are added always in the same order in all jobs
for lfn in sorted(dataFilesInSE):
inputTypeAttrib.fileNames.append(lfn)
# Then we do the modifications for deterministic pileup
inputTypeAttrib.skipEvents = cms.untracked.uint32(eventsToSkip)
inputTypeAttrib.sequential = cms.untracked.bool(True)
return
- In the WorkQueue, the input dataset is split by blocks and each block is acquired individually by the WMAgents. It is possible that two blocks from the input dataset land in different WMAgents, then the job counts are not known between WMAgents so the jobs in each WMAgent will start using the pileup dataset from the beginning.
- If the input dataset doesn't have an uniform number of events per lumi (e.g. due to filter efficiencies in MC datasets), then the calculation of events to skip in the pileup dataset won't be accurate and there could be holes in the intervals of events used from the pileup dataset in the jobs.