CMS Open Data for ML

This shows an exemplary way to read CMS Open Data MiniAOD, and end up with parquet files containing the data in the form of awkward arrays. This is a storage-friendly alternative to just saving in numpy / pandas etc. which would already fill empty or non-existing features (varying number of particles per event...) by padding with some placeholder value to create flat tuples.

Running on OpenData requires at least some CMS-software release to read and process the samples, packed in a virtual machine or from /cvmfs, if you have access. Then convert them to easier-to-process root-files, and make them usable by any deep learning framework.

Tested (local + batch system) at RWTH HPC (CLAIX), tested local setup at lx3agpu.

Overview, local setup and example

CMS-part

The magic that needs to be done first:

source /cvmfs/cms.cern.ch/cmsset_default.sh
cmsrel CMSSW_10_6_30
cd CMSSW_10_6_30/src/
cmsenv
git cms-init
git cms-merge-topic 39040
git clone -b master [email protected]:ErUM-AISafety/OpenNano.git PhysicsTools/OpenNano
scram b -j 18

(If you encounter problems with cloning via ssh, either setup an ssh-key between the device and github.com, or try https instead, for the latter, replace [email protected]: with https://github.com/).

Using cmsRun on a "non-official" site though means a local site-configuration is necessary. Let's trick cmsRun to think we are an actual site, despite working locally (getting lucky is not a crime, there appears to be a config for RWTH-HPC already):

mkdir -p /home/um106329/BMBF_AISafety/OpenDataAISafety/CMS/CMSSW_10_6_30/src/SITECONF/local/JobConfig
cp /cvmfs/cms.cern.ch/SITECONF/T2_DE_RWTH/RWTH-HPC/JobConfig/site-local-config.xml /home/um106329/BMBF_AISafety/OpenDataAISafety/CMS/CMSSW_10_6_30/src/SITECONF/local/JobConfig
export CMS_PATH=/home/um106329/BMBF_AISafety/OpenDataAISafety/CMS/CMSSW_10_6_30/src
scram b -j 18

(kudos to https://twiki.cern.ch/twiki/bin/view/Main/RobinGitlabCICMSSW which is a similar use case). Of course you need to modify these paths for your own filesystem.

Conda-Setup

No conda setup yet? Then do

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

We will need to use some conda environment containing relevant packages. The full suite is delivered with the repository and can be installed like so:

conda env create -f env.yml

Optional for CMS-members:

Optional (e.g. as CMS-member, not necessary at all): With the help of a conda environment and your certificate .pem in the ~/.globus directory, create a proxy (e.g. voms-proxy-init --voms cms --vomses .grid-security/vomses --valid=192:00) and copy the proxy into some accessible directory (cp /tmp/x509up_u40434 /home/um106329/BMBF_AISafety/OpenDataAISafety/CMS). This would allow you to use the framework also for official samples.

Problems with that? Do

mkdir ~/.grid-security
cp -r /cvmfs/grid.cern.ch/etc/grid-security/vomses ~/.grid-security
cp -r /cvmfs/grid.cern.ch/etc/grid-security/vomsdir ~/.grid-security/

A first configuration with example file

Now assume there already is a MiniAOD file (after doing xrdcp, or if you used the https-option of the opendata-client).
Example: xrdcp root://eospublic.cern.ch//eos/opendata/cms/mc/RunIIFall15MiniAODv2/TTToSemiLeptonic_13TeV-powheg/MINIAODSIM/PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext1-v1/00000/001CCEB6-4EC4-E511-B8DC-00259074AEAC.root /home/um106329/BMBF_AISafety/OpenDataAISafety/CMS/CMSSW_10_6_30/src/PhysicsTools/OpenNano/test/tt_miniaod.root Then one can finally do:

cmsDriver.py --python_filename custom_tt_cfg.py --eventcontent NANOAODSIM --datatier NANOAODSIM \
  --fileout file:custom_tt_nanoaod.root --conditions 102X_mcRun2_asymptotic_v8 --step NANO \
  --filein file:tt_miniaod.root --era Run2_25ns,run2_nanoAOD_106X2015 --no_exec --mc -n 100 \
  --customise PhysicsTools/OpenNano/opennano_cff.Opennano_customizeMC_allPF_add_CustomTagger_and_Truth

and voilà: cmsRun custom_tt_cfg.py finally works, also inside a SLURM job.

Processing OpenNano more systematically

From root to something else

The output of the previous step (a NanoAOD-like root-file) can be further processed, examples provided in test/process_opennano.py.

Prepare dataset lists (to process batch-wise)

It's convenient to have a set of files to run over, stored in a .txt file. Use something similar to that: xrdfs eospublic.cern.ch ls /eos/opendata/cms/mc/RunIIFall15MiniAODv2/TTToSemiLeptonic_13TeV-powheg/MINIAODSIM/PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext1-v1/00000 > samples_MiniAOD/ttsemi.txt
Some hints: the running number 00000 is not necessarily the only one, look for the next ones in line like 00001... and if you want, concatenate them together into a super-txt file containing more than 1K lines / filepaths.

Do relevant steps (after initial setup) via SLURM

Now one would like to do all those steps without manually pasting such commands for every file. That's what sbatch test/opennano_rocky_multiFile.sh is for. A set of filelists can be stored in samples_MiniAOD for that purpose. And because OpenData does not require authentication as CMS member, this will even work without the active proxy. When running over a new kind of dataset, make sure to call the correct path to the .txt and also give it a new name (directory) to store the resulting files.

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
interface		interface
plugins		plugins
python		python
samples_MiniAOD		samples_MiniAOD
test		test
.gitignore		.gitignore
BuildFile.xml		BuildFile.xml
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CMS Open Data for ML

Overview, local setup and example

CMS-part

Conda-Setup

A first configuration with example file

Processing OpenNano more systematically

From root to something else

Prepare dataset lists (to process batch-wise)

Do relevant steps (after initial setup) via SLURM

About

Releases 1

Packages

Languages

License

AnnikaStein/OpenNano

Folders and files

Latest commit

History

Repository files navigation

CMS Open Data for ML

Overview, local setup and example

CMS-part

Conda-Setup

A first configuration with example file

Processing OpenNano more systematically

From root to something else

Prepare dataset lists (to process batch-wise)

Do relevant steps (after initial setup) via SLURM

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages