- The framework inherits the analysis code originally designed for the Run 2 H→µµ decay search with the CMS detector at the LHC. The published Run 2 results in the channel targeting the VBF Higgs production mode were reproduced with 1% precision.
- The tools developed for this framework were also used for the projections of the H→µµ search sensitivity to HL-LHC.
Currently the framework is under development to integrate both of the main Higgs production modes (ggH and VBF), and to prepare for analyzing the Run 3 data when it becomes available.
The input data for the framework should be in NanoAOD
format.
The analysis workflow contains three stages:
-
Stage 1 includes event and object selection, application of corrections, and construction of new variables. The data processing is implemented via columnar approach, making use of the tools provided by coffea package. The data columns are handled via
coffea
'sNanoEvents
format which relies on jagged arrays implemented in Awkward Array package. After event selection, the jagged arrays are converted to flat pandas dataframes and saved into Apache Parquet files. -
Stage 2 (WIP) contains / will contain evaluation of MVA methods (boosted decision trees, deep neural networks), event categorization, and producing histograms. The stage 2 workflow is structured as follows:
- Outputs of Stage 1 (
Parquet
files) are loaded as partitions of a Dask DataFrame (similar to Pandas DF, but partitioned and "lazy"). - The Dask DataFrame is (optionally) re-partitioned to decrease number of partitions.
- The partitions are processed in parallel; for each partition, the following sequence is executed:
- Partition of the Dask DataFrame is "computed" (converted to a Pandas Dataframe).
- Evaluation of MVA models (can also be done after categorization). Current inmplementation includes evaluation of
PyTorch
DNN models and/orXGBoost
BDT models. Other methods can be implemented, but one has to verify that they would work well in a distributed environment (e.g. Tensorflow sessions are not very good for that). - Definition of event categories and/or MVA bins.
- Creating histograms using scikit-hep/hist.
- Saving histograms.
- (Optionally) Saving individual columns (can be used later for unbinned fits).
- Outputs of Stage 1 (
-
Stage 3 (WIP) contains / will contain plotting, parametric fits, preparation of datacards for statistical analysis. The plotting is done via scikit-hep/mplhep.
The analysis workflow is efficiently parallelised using dask/distributed with either a local cluster (uses CPUs on the same node where the job is launched), or a distributed Slurm
cluster initialized over multiple computing nodes. The instructions for the Dask client initialization in both modes can be found here.
It is possible to create a cluster with other batch submission systems (HTCondor
, PBS
, etc., see full list in Dask-Jobqueue API).
Work from a conda
environment to avoid version conflicts:
module load anaconda/5.3.1-py37
conda create --name hmumu python=3.7
source activate hmumu
Installation:
git clone https://github.com/Run3HmmAnalysis/copperhead
cd copperhead
python3 -m pip install --user --upgrade -r requirements.txt
If accessing datasets via xRootD
will be needed:
source /cvmfs/cms.cern.ch/cmsset_default.sh
. setup_proxy.sh
Run each stage individually, or the full analysis workflow on a single input file.
python3 -W ignore tests/test_stage1.py
python3 -W ignore tests/test_stage2.py
python3 -W ignore tests/test_stage3.py
python3 -W ignore tests/test_continuous.py
- Original developer: Dmitry Kondratyev
- Contributors: Arnab Purohit, Stefan Piperov