Deterministic Pileup

Current implementation and pre-mixing

The latest modification to deterministic pileup occurred here: https://github.com/dmwm/WMCore/pull/5954/files

Old information

Introduction

This reference page explains how the concept of deterministic pileup is implemented in the WMAgent system. The first section can be skipped if you are familiar with the WMAgent and WorkQueue and how pileup is treated, otherwise read on.

This is based on the idea of this document

How does non-deterministic pileup works?

In an nutshell, when a request specifies a MCPileup or DataPileup dataset, the workload definition includes this information in the processing task in order to have the files in these datasets included in the PSet file for the cmsRun processes in the jobs.

WorkQueue

The role of the WorkQueue for workloads with pileup is to read from DBS2 the list of all blocks with locations, and files in the dataset and store this in a JSON file that is used as a payload for the jobs.

WMRuntime

In runtime, each job will read the JSON payload and filter the blocks that are present at the site where it is currently running at. The files that are expected in the site's SE will be included in the fileNames attribute of the corresponding mixing module. For example, the following modifications will made to the PSet (in the process attribute) if both MCPileup and DataPileup is specified (this is a modified version of the code in the repo.

def modifyPSetForPileup(dataFilesInSE, mcFilesInSE):
  # First we find the MixingModules and DataMixingModules
  mixModules, dataMixModules = [], []
  prodsAndFilters = {}
  prodsAndFilters.update(self.process.producers)
  prodsAndFilters.update(self.process.filters)
  for key, value in prodsAndFilters.items():
    if value.type_() == "MixingModule":
      mixModules.append(value)
    if value.type_() == "DataMixingModule":
      dataMixModules.append(value)
  # Then we add the files to modules separately depending on the type
  for m in mixModules:
    inputTypeAttrib = getattr(m, "input", None) or getattr(m, "secsource", None)
    if inputTypeAttrib is None:
      continue
    inputTypeAttrib.fileNames = cms.untracked.vstring()
    for lfn in mcFilesInSE:
      inputTypeAttrib.fileNames.append(lfn)
  for m in dataMixModules:
    inputTypeAttrib = getattr(m, "input", None) or getattr(m, "secsource", None)
    if inputTypeAttrib is None:
      continue
    inputTypeAttrib.fileNames = cms.untracked.vstring()
    for lfn in dataFilesInSE:
      inputTypeAttrib.fileNames.append(lfn)
  return

Deterministic Pileup

There is an argument for ReDigi which indicates the WMAgent that the DataPileup should be handled differently, the objective is to have reproducible pileup mixing. Now we describe the changes in the workflow when dealing with this deterministic pileup.

WMAgent

In the WMAgent side, there is an addition in the JobSplitting for the LumiBased and EventAwareLumiBased algorithms. The splitting algorithms will keep track of the number of processing jobs that have been created and issue a number of events to skip for each job which is equal to:

# Assuming job N
eventsToSkipInPileup = ((N-1) * eventsPerLumi * lumisPerJob)

WMRuntime

The runtime payload includes two new elements for this workflow:

Number of events to skip in the data pileup
Number of events in the pileup dataset per block

Since we filter the blocks that are present at the site, we first calculate the total number of events present in the blocks we will use from the pileup (usually the total number of events in the dataset). Then we determine the number of events to skip in the pileup by doing a modulo with the total number of events in the dataset, this way we rollback to the beginning of the pileup dataset if needed.

The modifications to the PSet look like this:

def modifyPSetForPileup(dataFilesInSE, mcFilesInSE, eventsToSkip):
  # First we find the MixingModules and DataMixingModules
  ...
  # Then we add the files to modules separately depending on the type,
  # It only changes for data pileup
  ...
  for m in dataMixModules:
    inputTypeAttrib = getattr(m, "input", None) or getattr(m, "secsource", None)
    if inputTypeAttrib is None:
      continue
    inputTypeAttrib.fileNames = cms.untracked.vstring()
    # We use Python sorting so the files are added always in the same order in all jobs
    for lfn in sorted(dataFilesInSE):
      inputTypeAttrib.fileNames.append(lfn)
    # Then we do the modifications for deterministic pileup
    inputTypeAttrib.skipEvents = cms.untracked.uint32(eventsToSkip)
    inputTypeAttrib.sequential = cms.untracked.bool(True)
  return

Possible pitfalls

In the WorkQueue, the input dataset is split by blocks and each block is acquired individually by the WMAgents. It is possible that two blocks from the input dataset land in different WMAgents, then the job counts are not known between WMAgents so the jobs in each WMAgent will start using the pileup dataset from the beginning.
If the input dataset doesn't have an uniform number of events per lumi (e.g. due to filter efficiencies in MC datasets), then the calculation of events to skip in the pileup dataset won't be accurate and there could be holes in the intervals of events used from the pileup dataset in the jobs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic Pileup

Current implementation and pre-mixing

Old information

Introduction

How does non-deterministic pileup works?

WorkQueue

WMRuntime

Deterministic Pileup

WMAgent

WMRuntime

Possible pitfalls

Clone this wiki locally