GitHub - Run3HmmAnalysis/copperhead: CMS H->µµ search implemented using columnar approach and efficient parallelization of data processing

🐍 copperhead - Columnar Parallel Pythonic framEwork for Run3 H→µµ Decay search

The framework inherits the analysis code originally designed for the Run 2 H→µµ decay search with the CMS detector at the LHC. The published Run 2 results in the channel targeting the VBF Higgs production mode were reproduced with 1% precision.
The tools developed for this framework were also used for the projections of the H→µµ search sensitivity to HL-LHC.

Currently the framework is under development to integrate both of the main Higgs production modes (ggH and VBF), and to prepare for analyzing the Run 3 data when it becomes available.

Framework structure, data formats, used packages

The input data for the framework should be in NanoAOD format.

The analysis workflow contains three stages:

Stage 1 includes event and object selection, application of corrections, and construction of new variables. The data processing is implemented via columnar approach, making use of the tools provided by coffea package. The data columns are handled via coffea's NanoEvents format which relies on jagged arrays implemented in Awkward Array package. After event selection, the jagged arrays are converted to flat pandas dataframes and saved into Apache Parquet files.
Stage 2 (WIP) contains / will contain evaluation of MVA methods (boosted decision trees, deep neural networks), event categorization, and producing histograms. The stage 2 workflow is structured as follows:
- Outputs of Stage 1 (Parquet files) are loaded as partitions of a Dask DataFrame (similar to Pandas DF, but partitioned and "lazy").
- The Dask DataFrame is (optionally) re-partitioned to decrease number of partitions.
- The partitions are processed in parallel; for each partition, the following sequence is executed:
  - Partition of the Dask DataFrame is "computed" (converted to a Pandas Dataframe).
  - Evaluation of MVA models (can also be done after categorization). Current inmplementation includes evaluation of PyTorch DNN models and/or XGBoost BDT models. Other methods can be implemented, but one has to verify that they would work well in a distributed environment (e.g. Tensorflow sessions are not very good for that).
  - Definition of event categories and/or MVA bins.
  - Creating histograms using scikit-hep/hist.
  - Saving histograms.
  - (Optionally) Saving individual columns (can be used later for unbinned fits).
Stage 3 (WIP) contains / will contain plotting, parametric fits, preparation of datacards for statistical analysis. The plotting is done via scikit-hep/mplhep.

Job parallelization

The analysis workflow is efficiently parallelised using dask/distributed with either a local cluster (uses CPUs on the same node where the job is launched), or a distributed Slurm cluster initialized over multiple computing nodes. The instructions for the Dask client initialization in both modes can be found here.

It is possible to create a cluster with other batch submission systems (HTCondor, PBS, etc., see full list in Dask-Jobqueue API).

Installation instructions

Work from a conda environment to avoid version conflicts:

module load anaconda/5.3.1-py37
conda create --name hmumu python=3.7
source activate hmumu

Installation:

git clone https://github.com/Run3HmmAnalysis/copperhead
cd copperhead
python3 -m pip install --user --upgrade -r requirements.txt

If accessing datasets via xRootD will be needed:

source /cvmfs/cms.cern.ch/cmsset_default.sh
. setup_proxy.sh

Test runs

Run each stage individually, or the full analysis workflow on a single input file.

python3 -W ignore tests/test_stage1.py
python3 -W ignore tests/test_stage2.py
python3 -W ignore tests/test_stage3.py
python3 -W ignore tests/test_continuous.py

Credits

Original developer: Dmitry Kondratyev
Contributors: Arnab Purohit, Stefan Piperov

Name		Name	Last commit message	Last commit date
Latest commit History 862 Commits
.github/workflows		.github/workflows
config		config
data		data
docs		docs
python		python
stage1		stage1
stage2		stage2
stage3		stage3
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements.txt		requirements.txt
run_stage1.py		run_stage1.py
run_stage2.py		run_stage2.py
run_stage3.py		run_stage3.py
setup_proxy.sh		setup_proxy.sh
slurm_cluster_prep.py		slurm_cluster_prep.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐍 copperhead - Columnar Parallel Pythonic framEwork for Run3 H→µµ Decay search

Framework structure, data formats, used packages

Job parallelization

Installation instructions

Test runs

Credits

About

Contributors 2

Languages

Run3HmmAnalysis/copperhead

Folders and files

Latest commit

History

Repository files navigation

🐍 copperhead - Columnar Parallel Pythonic framEwork for Run3 H→µµ Decay search

Framework structure, data formats, used packages

Job parallelization

Installation instructions

Test runs

Credits

About

Resources

Stars

Watchers

Forks

Contributors 2

Languages