Skip to content

CMS H->µµ search implemented using columnar approach and efficient parallelization of data processing

Notifications You must be signed in to change notification settings

Run3HmmAnalysis/copperhead

Repository files navigation

pre-commit Code style: black CI/CD

🐍 copperhead - Columnar Parallel Pythonic framEwork for Run3 H→µµ Decay search

Currently the framework is under development to integrate both of the main Higgs production modes (ggH and VBF), and to prepare for analyzing the Run 3 data when it becomes available.

Framework structure, data formats, used packages

The input data for the framework should be in NanoAOD format.

The analysis workflow contains three stages:

  • Stage 1 includes event and object selection, application of corrections, and construction of new variables. The data processing is implemented via columnar approach, making use of the tools provided by coffea package. The data columns are handled via coffea's NanoEvents format which relies on jagged arrays implemented in Awkward Array package. After event selection, the jagged arrays are converted to flat pandas dataframes and saved into Apache Parquet files.

  • Stage 2 (WIP) contains / will contain evaluation of MVA methods (boosted decision trees, deep neural networks), event categorization, and producing histograms. The stage 2 workflow is structured as follows:

    • Outputs of Stage 1 (Parquet files) are loaded as partitions of a Dask DataFrame (similar to Pandas DF, but partitioned and "lazy").
    • The Dask DataFrame is (optionally) re-partitioned to decrease number of partitions.
    • The partitions are processed in parallel; for each partition, the following sequence is executed:
      • Partition of the Dask DataFrame is "computed" (converted to a Pandas Dataframe).
      • Evaluation of MVA models (can also be done after categorization). Current inmplementation includes evaluation of PyTorch DNN models and/or XGBoost BDT models. Other methods can be implemented, but one has to verify that they would work well in a distributed environment (e.g. Tensorflow sessions are not very good for that).
      • Definition of event categories and/or MVA bins.
      • Creating histograms using scikit-hep/hist.
      • Saving histograms.
      • (Optionally) Saving individual columns (can be used later for unbinned fits).
  • Stage 3 (WIP) contains / will contain plotting, parametric fits, preparation of datacards for statistical analysis. The plotting is done via scikit-hep/mplhep.

Job parallelization

The analysis workflow is efficiently parallelised using dask/distributed with either a local cluster (uses CPUs on the same node where the job is launched), or a distributed Slurm cluster initialized over multiple computing nodes. The instructions for the Dask client initialization in both modes can be found here.

It is possible to create a cluster with other batch submission systems (HTCondor, PBS, etc., see full list in Dask-Jobqueue API).

Installation instructions

Work from a conda environment to avoid version conflicts:

module load anaconda/5.3.1-py37
conda create --name hmumu python=3.7
source activate hmumu

Installation:

git clone https://github.com/Run3HmmAnalysis/copperhead
cd copperhead
python3 -m pip install --user --upgrade -r requirements.txt

If accessing datasets via xRootD will be needed:

source /cvmfs/cms.cern.ch/cmsset_default.sh
. setup_proxy.sh

Test runs

Run each stage individually, or the full analysis workflow on a single input file.

python3 -W ignore tests/test_stage1.py
python3 -W ignore tests/test_stage2.py
python3 -W ignore tests/test_stage3.py
python3 -W ignore tests/test_continuous.py

Credits

About

CMS H->µµ search implemented using columnar approach and efficient parallelization of data processing

Resources

Stars

Watchers

Forks

Languages