-
Notifications
You must be signed in to change notification settings - Fork 393
settinguptheanalysis
The input to combine, which defines the details of the experiment, is a datacard file is a plain ASCII file. This is true whether the experiment is a simple counting experiment or a shape analysis.
The file data/tutorials/counting/realistic-counting-experiment.txt shows an example of a counting experiment.
The first lines can be used as a description and are not parsed by the program. They have to begin with a "#":
# Simple counting experiment, with one signal and a few background processes
# Simplified version of the 35/pb H->WW analysis for mH = 160 GeV
Then one declares the number of observables, imax
, that are present in the model used to calculate limits/significances. The number of observables will typically be the number of channels in a counting experiment or the number of bins in a binned shape fit. (If one specifies for imax
the value *
it means "figure it out from the rest of the datacard", but in order to better catch mistakes it's recommended to specify it explicitly)
imax 1 number of channels
Then one declares the number of background sources to be considered, jmax
, and the number of independent sources of systematic uncertainties, kmax
:
jmax 3 number of backgrounds
kmax 5 number of nuisance parameters (sources of systematic uncertainties)
In the example there is 1 channel, there are 3 background sources, and there are 5 independent sources of systematic uncertainty.
Then there are the lines describing what is actually observed: the number of events observed in each channel. The first line, starting with bin
defines the label used for each channel. In the example we have 1 channel, labelled 1
, and in the following line, observation
, are listed the events observed, 0
in this example:
# we have just one channel, in which we observe 0 events
bin bin1
observation 0
Following is the part related to the number of events expected, for each bin and process, arranged in (#channels)*(#processes) columns.
bin bin1 bin1 bin1 bin1
process ggH qqWW ggWW others
process 0 1 2 3
rate 1.47 0.63 0.06 0.22
- The
bin
line identifies the channel the column is referring to. It goes from1
to theimax
declared above. - The first
process
line contains the labels of the various sources - The second
process
line must have a positive number for backgrounds, and0
or a negative number for the signals. You should use different process ids for different processes. - The last line,
rate
, tells the expected yield of events in the specified bin and process
All bins should be declared in increasing order, and within each bin one should include all processes in increasing order, specifying a 0 for processes that do not contribute.
The last section contains the description of the systematic uncertainties:
lumi lnN 1.11 - 1.11 - lumi affects both signal and gg->WW (mc-driven). lnN = lognormal
xs_ggH lnN 1.16 - - - gg->H cross section + signal efficiency + other minor ones.
WW_norm gmN 4 - 0.16 - - WW estimate of 0.64 comes from sidebands: 4 events in sideband times 0.16 (=> ~50% statistical uncertainty)
xs_ggWW lnN - - 1.50 - 50% uncertainty on gg->WW cross section
bg_others lnN - - - 1.30 30% uncertainty on the rest of the backgrounds
- the first columns is a label identifying the uncertainty
- the second column identifies the type of distribution used
-
lnN
stands for Log-normal, which is the recommended choice for multiplicative corrections (efficiencies, cross sections, ...). If Δx/x is the relative uncertainty on the multiplicative correction, one should put 1+Δx/x in the column corresponding to the process and channel. Asymetric log-normals are supported by providing κdown/κup where κdown is the ratio of the the yield to the nominal value for a -1σ deviation of the nuisance and κup is the ratio of the the yield to the nominal value for a +1σ deviation. -
gmN
stands for Gamma, and is the recommended choice for the statistical uncertainty on a background coming from the number of events in a control region (or in a MC sample with limited statistics). If the control region or MC contains N events, and the extrapolation factor from the control region to the signal region is α then one shoud put N just after thegmN
keyword, and then the value of α in the proper column. Also, the yield in therate
row should match with Nα -
lnU
stands for log-uniform distribution. A value of 1+ε in the column will imply that the yield of this background is allowed to float freely between x(1+ε) and x/(1+ε) (in particular, if ε is small, then this is approximately (x-Δx,x+Δx) with ε=Δx/x ) This is normally useful when you want to set a large a-priori uncertainty on a given background and then rely on the correlation between channels to constrain it. Beware that while Gaussian-like uncertainties behave in a similar way under profiling and marginalization, uniform uncertainties do not, so the impact of the uncertainty on the result will depend on how the nuisances are treated.
-
- then there are (#channels)*(#processes) columns reporting the relative effect of the systematic uncertainty on the rate of each process in each channel. The columns are aligned with the ones in the previous lines declaring bins, processes and rates.
In the example, there are 5 uncertainties:
- the first uncertainty affects the signal by 11%, and affects the
ggWW
process by 11% - the second uncertainty affects the signal by 16% leaving the backgrounds unaffected
- the third line specifies that the
qqWW
background comes from a sideband with 4 observed events and an extrapolation factor of 0.16; the resulting uncertainty on the expected yield is$1/\sqrt{4+1}$ = 45% - the fourth uncertainty does not affect the signal, affects the
ggWW
background by 50%, leaving the other backgrounds unaffected - the last uncertainty does not affect the signal, affects by 30% the
others
backgrounds, leaving the rest of the backgrounds unaffected
The datacard has to be supplemented with two extensions:
- A new block of lines defining how channels and processes are mapped into shapes
- The block for systematics that can contain also rows with shape uncertainties.
The expected shape can be parametric or not parametric. In the first case the parametric pdfs have to be given as input to the tool. In the latter case, for each channel, histograms have to be provided for the expected shape of each process. For what concerns data, they have to be provided as input to the tool as a histogram to perform a binned shape analysis and as a RooDataSet to perform an unbinned shape analysis.
[danger] If using RooFit based inputs (RooDataHists/RooDataSets/RooAbsPdfs) then you should be careful to use different RooRealVars as the observable in each category being combined. It is possible to use the same RooRealVar if the observable has the same range (and binning if using binned data) in each category though in most cases it is simpler to avoid doing so.
As with the counting experiment, the total nominal rate of a given process must be identified in the rate line of the datacard. However, there are special options for shape based analyses as follows
- A value of -1 in the rate line indicates to combine to calculate the rate from the input TH1 (via TH1::Integral) or RooDataSet/RooDataHist (via RooAbsData::sumEntries)
- For parametric shapes (RooAbsPdf), if a parameter is found in the input workspace with the name pdfname**_norm** the rate will be multiplied by the value of that parameter. Note that since this parameter can be freely floating, the normalization of a shape can be made to freely float this way. This can also be achieved through the use of
rateParams
For each channel, histograms have to be provided for the observed shape and for the expected shape of each process.
- Within each channel, all histograms must have the same binning.
- The normalization of the data histogram must correspond to the number of observed events
- The normalization of the expected histograms must match the expected yields
The combine tool can take as input histograms saved as TH1 or as RooAbsHist in a RooFit workspace (an example of how to create a RooFit workspace and save histograms is available in github).
The block of lines defining the mapping (first block in the datacard) contains one or more rows in the form
- **shapes process channel file histogram [histogram_with_systematics] **
In this line
- process is any one the process names, or * for all processes, or data_obs for the observed data
- channel is any one the process names, or * for all channels
-
file, histogram and histogram_with_systematics identify the names of the files and of the histograms within the file, after doing some replacements (if any are found):
- $PROCESS is replaced with the process name (or "data_obs" for the observed data)
- $CHANNEL is replaced with the channel name
- $SYSTEMATIC is replaced with the name of the systematic + (Up, Down)
- $MASS is replaced with the higgs mass value which is passed as option in the command line used to run the limit tool
In addition, user defined keywords can be included to be replaced. Any word in the datacard $WORD will be replaced by VALUE when including the option --keyword-value WORD=VALUE
. The option can be repeated multiple times for multiple keywords.
Shape uncertainties can be taken into account by vertical interpolation of the histograms. The shapes are interpolated quadratically for shifts below 1σ and linearly beyond. The normalizations are interpolated linearly in log scale just like we do for log-normal uncertainties.
For each shape uncertainty and process/channel affected by it, two additional input shapes have to be provided, obtained shifting that parameter up and down by one standard deviation. When building the likelihood, each shape uncertainty is associated to a nuisance parameter taken from a unit gaussian distribution, which is used to interpolate or extrapolate using the specified histograms.
For each given source of shape uncertainty, in the part of the datacard containing shape uncertainties (last block), there must be a row
- ** name shape effect_for_each_process_and_channel **
The effect can be "-" or 0 for no effect, 1 for normal effect, and possibly something different from 1 to test larger or smaller effects (in that case, the unit gaussian is scaled by that factor before using it as parameter for the interpolation)
The datacard in data/tutorials/shapes/simple-shapes-TH1.txt is a clear example of how to include shapes in the datacard. In the first block the following line specifies the shape mapping:
shapes * * simple-shapes-TH1.root $PROCESS $PROCESS_$SYSTEMATIC
The last block concerns the treatment of the systematics affecting shapes. In this part the two uncertainties effecting on the shape are listed.
alpha shape - 1 uncertainty on background shape and normalization
sigma shape 0.5 - uncertainty on signal resolution. Assume the histogram is a 2 sigma shift,
# so divide the unit gaussian by 2 before doing the interpolation
There are two options for the interpolation algorithm in the "shape" uncertainty. Putting shape
will result in a quadratic interpolation (within +/-1 sigma) and a linear extrapolation (beyond +/-1 sigma) of the fraction of events in each bin - i.e the histograms are first normalised before interpolation. Putting shapeN
while instead base the interpolation on the logs of the fraction in each bin. For both shape
and shapeN
, the total normalisation is interpolated using an asymmetric log-normal so that the effect of the systematic on both the shape and normalisation are accounted for. The following image shows a comparison of those two algorithms for this datacard.
Show comparison
In this case there are two processes, signal and background, and two uncertainties affecting background (alpha) and signal shape (sigma). Within the root file 2 histograms per systematic have to be provided, they are the shape obtained, for the specific process, shifting up and down the parameter associated to the uncertainty: background_alphaUp
and background_alphaDown
, signal_sigmaUp
and signal_sigmaDown
.
This is the content of the root file simple-shapes-TH1.root associated to the datacard data/tutorials/shapes/simple-shapes-TH1.txt:
root [0]
Attaching file simple-shapes-TH1.root as _file0...
root [1] _file0->ls()
TFile** simple-shapes-TH1.root
TFile* simple-shapes-TH1.root
KEY: TH1F signal;1 Histogram of signal__x
KEY: TH1F signal_sigmaUp;1 Histogram of signal__x
KEY: TH1F signal_sigmaDown;1 Histogram of signal__x
KEY: TH1F background;1 Histogram of background__x
KEY: TH1F background_alphaUp;1 Histogram of background__x
KEY: TH1F background_alphaDown;1 Histogram of background__x
KEY: TH1F data_obs;1 Histogram of data_obs__x
KEY: TH1F data_sig;1 Histogram of data_sig__x
For example, without shape uncertainties you could have just one row with
shapes * * shapes.root $CHANNEL/$PROCESS
Then for a simple example for two channels "e", "mu" with three processes "higgs", "zz", "top" you should create a rootfile that contains the following
histogram | meaning |
---|---|
e/data_obs |
observed data in electron channel |
e/higgs |
expected shape for higgs in electron channel |
e/zz |
expected shape for ZZ in electron channel |
e/top |
expected shape for top in electron channel |
mu/data_obs |
observed data in muon channel |
mu/higgs |
expected shape for higgs in muon channel |
mu/zz |
expected shape for ZZ in muon channel |
mu/top |
expected shape for top in muon channel |
If you also have one uncertainty that affects the shape, e.g. jet energy scale, you should create shape histograms for the jet energy scale shifted up by one sigma, you could for example do one folder for each process and write a like like
shapes * * shapes.root $CHANNEL/$PROCESS/nominal $CHANNEL/$PROCESS/$SYSTEMATIC
or just attach a postifx to the name of the histogram
shapes * * shapes.root $CHANNEL/$PROCESS $CHANNEL/$PROCESS_$SYSTEMATIC
For a detailed example of a template based binned analysis see the H→ττ 2014 DAS tutorial
In some cases, it can be convenient to describe the expected signal and background shapes in terms of analytical functions rather than templates; a typical example are the searches where the signal is apparent as a narrow peak over a smooth continuum background. In this context, uncertainties affecting the shapes of the signal and backgrounds can be implemented naturally as uncertainties on the parameters of those analytical functions. It is also possible to adapt an agnostic approach in which the parameters of the background model are left freely floating in the fit to the data, i.e. only requiring the background to be well described by a smooth function.
Technically, this is implemented by means of the RooFit package, that allows writing generic probability density functions, and saving them into ROOT files. The pdfs can be either taken from RooFit's standard library of functions (e.g. Gaussians, polynomials, ...) or hand-coded in C++, and combined together to form even more complex shapes.
In the datacard using templates, the column after the file name would have been the name of the histogram. For the parametric analysis we need two names to identify the mapping, separated by a colon (:
).
shapes process channel shapes.root workspace_name:pdf_name
The first part identifies the name of the input RooWorkspace containing the pdf, and the second part the name of the RooAbsPdf inside it (or, for the observed data, the RooAbsData). There can be multiple input workspaces, just as there can be multiple input root files. You can use any of the usual RooFit pre-defined pdfs for your signal and background models.
[danger] If you are using RooAddPdfs in your model in which the coefficients are not defined recursively, combine will not interpret them properly. You can add the option
--X-rtd ADDNLL_RECURSIVE=0
to any combine command in order to recover the correct interpretation, however we recommend that you instead redefine your pdf so that the coefficients are recursive (as described on the RooAddPdf documentation) and keep the total normalisation (i.e extended term) as a separate object as in the case of the tutorial datacard.
For example, take a look at the data/tutorials/shapes/simple-shapes-parametric.txt. We see the following line.
shapes * * simple-shapes-parametric_input.root w:$PROCESS
[...]
bin 1 1
process sig bkg
which indicates that the input file simple-shapes-parametric_input.root
should contain an input workspace (w
) with pdfs named sig
and bkg
since these are the names of the two processes in the datacard. Additionally, we expect there to be a dataset named data_obs
. If we look at the contents of the workspace inside data/tutorials/shapes/simple-shapes-parametric_input.root
, this is indeed what we see...
root [1] w->Print()
RooWorkspace(w) w contents
variables
---------
(MH,bkg_norm,cc_a0,cc_a1,cc_a2,rmdj,vogian_sigma,vogian_width)
p.d.f.s
-------
RooChebychev::bkg[ x=rmdj coefList=(cc_a0,cc_a1,cc_a2) ] = 2.6243
RooVoigtian::sig[ x=rmdj mean=MH width=vogian_width sigma=vogian_sigma ] = 0.000639771
datasets
--------
RooDataSet::data_obs(rmdj)
In this datacard, the signal is parameterised in terms of the hypothesised mass (MH
). Combine will use this variable, instead of creating its own, which will be interpreted as the value for -m
. For this reason, we should add the option -m 30
(or something else within the observable range) when running combine. You will also see there is a variable named bkg_norm
. This is used to normalize the background rate (see the section on Rate parameters below for details).
[warning] Combine will not accept RooExtendedPdfs as an input. This is to alleviate a bug that lead to improper treatment of normalization when using multiple RooExtendedPdfs to describe a single process. You should instead use RooAbsPdfs and provide the rate as a separate object (see the Rate parameters section).
The part of the datacard related to the systematics can include lines with the syntax
- name param X Y
These lines encode uncertainties on the parameters of the signal and background pdfs. The parameter is to be assigned a Gaussian uncertainty of Y around its mean value of X. One can change the mean value from 0 to 1 (or really any value, if one so chooses) if the parameter in question is multiplicative instead of additive.
In the data/tutorials/shapes/simple-shapes-parametric.txt datacard, there are lines for one such parametric uncertainty,
sigma param 1.0 0.1
meaning there is a parameter already contained in the input workspace called sigma
which should be constrained with a Gaussian centered at 1.0 with a width of 0.1. Note that, the exact interpretation (i.e all combine knows is that 1.0 should be the most likely value and 0.1 is its 1σ uncertainy) of these parameters is left to the user since the signal pdf is constructed externally by you. Asymmetric uncertainties are written as with lnN
using the syntax -1σ/+1σ in the datacard.
If one wants to specify a parameter that is freely floating across its given range, and not gaussian constrained, the following syntax is used:
- **name flatParam **
Though this is not strictly necessary in frequentist methods using profiled likelihoods as combine will still profile these nuisances when performing fits (as is the case for the simple-shapes-parametric.txt
datacard).
[danger] All parameters which are floating or constant in the user's input workspaces will remain floating or constant. Combine will not modify those for you!
A full example of a parametric analysis can be found in this H→γγ 2014 DAS tutorial
Users should be aware of a feature that affects the use of parametric pdfs together with binned datasets.
RooFit uses the integral of the pdf, computed analytically (or numerically, but disregarding the binning), to normalize it, but then computes the expected event yield in each bin evaluating only the pdf at the bin center. This means that if the variation of the pdf is sizeable within the bin then there is a mismatch between the sum of the event yields per bin and the pdf normalization, and that can cause a bias in the fits (more properly, the bias is there if the contribution of the second derivative integrated on the bin size is not negligible, since for linear functions evaluating them at the bin center is correct). There are two reccomended ways to work around this ...
1. Use narrow bins
It is recommended to use bins that are significantly finer than the characteristic scale of the pdfs - which would anyway be the recommended thing even in the absence of this feature. Obviously, this caveat does not apply to analyses using templates (they're constant across each bin, so there's no bias), or using unbinned datasets.
2. Use a RooParametricShapeBinPdf
Another solution (currently implemented for 1-dimensional histograms only) is to use a custom pdf which performs the correct integrals internally as in RooParametricShapeBinPdf
Note that this pdf class now allows parameters that are themselves RooAbsReal objects (i.e. functions of other variables). The integrals are handled internally by calling the underlying pdf’s createIntegral()
method with named ranges created for each of the bins. This means that if the analytical integrals for the underlying pdf are available, they will be used.
The constructor for this class requires a RooAbsReal (eg any RooAbsPdf)along with a list of RooRealVars (the parameters, excluding the observable
RooParametricShapeBinPdf(const char *name, const char *title, RooAbsReal& _pdf, RooAbsReal& _x, RooArgList& _pars, const TH1 &_shape )
Below is a comparison of a fit to a binned dataset containing 1000 events with one observable
Show Comparison
In the upper plot, the data are binned in 100 evenly spaced bins, while in the lower plot, there are 3 irregular bins. The blue lines show the result of the fit when using the RooExponential directly while the red shows the result when wrapping the pdf inside a RooParametricShapeBinPdf. In the narrow binned case, the two agree well while for wide bins, accounting for the integral over the bin yields a better fit.
You should note that using this class will result in slower fits so you should first decide if the added accuracy is enough to justify the reduced efficiency.
Datacards can be extended in order to provide additional functionality and flexibility during runtime. These can also allow for the production of more complicated models and performing advanced computation of results beyond limits and significances.
The overall rate "expected" of a particular process in a particular bin does not necessarily need to be a fixed quantity. Scale factors can be introduced to modify the rate directly in the datacards for ANY type of analysis. This can be achieved using the directive rateParam
in the datacard with the following syntax,
name rateParam bin process initial_value [min,max]
The [min,max]
argument is optional and if not included, combine will remove the range of this parameter. This will produce a new parameter in the model (unless it already exists) which multiplies the rate of that particular process in the given bin by its value.
You can attach the same rateParam
to multiple processes/bins by either using a wild card (eg *
will match everything, QCD_*
will match everything starting with QCD_
etc.) in the name of the bin and/or process or by repeating the rateParam
line in the datacard for different bins/processes with the same name.
[warning]
rateParam
is not a shortcut to evaluate the post-fit yield of a process since other nuisances can also change the normalisation. E.g., finding that therateParam
best-fit value is 0.9 does not necessarily imply that the process yield is 0.9 times the initial one. The best is to evaluate the yield taking into account the values of all nuisance parameters using--saveNormalizations
.
This parameter is by default, freely floating. It is possible to include a Gaussian constraint on any rateParam
which is floating (i.e not a formula
or spline) by adding a param
nuisance line in the datacard with the same name.
In addition to rate modifiers which are freely floating, modifiers which are functions of other parameters can be included using the following syntax,
name rateParam bin process formula args
where args
is a comma separated list of the arguments for the string formula
. You can include other nuisance parameters in the formula
, including ones which are Gaussian constrained (i,e via the param
directive.)
Below is an example datacard which uses the rateParam
directive to implement an ABCD like method in combine. For a more realistic description of it's use for ABCD, see the single-lepton SUSY search implementation described here
Show ABCD datacard
imax 4 number of channels
jmax 0 number of processes -1
kmax * number of nuisance parameters (sources of systematical uncertainties)
-------
bin B C D A
observation 50 100 500 10
-------
bin B C D A
process bkg bkg bkg bkg
process 1 1 1 1
rate 1 1 1 1
-------
alpha rateParam A bkg (@0*@1/@2) beta,gamma,delta
beta rateParam B bkg 50
gamma rateParam C bkg 100
delta rateParam D bkg 500
For more examples of using rateParam
(eg for fitting process normalisations in control regions and signal regions simultaneously) see this 2016 CMS tutorial
Finally, any pre-existing RooAbsReal inside some rootfile with a workspace can be imported using the following
name rateParam bin process rootfile:workspacename
The name should correspond to the name of the object which is being picked up inside the RooWorkspace. A simple example using the SM XS and BR splines available in HiggsAnalysis/CombinedLimit can be found under data/tutorials/rate_params/simple_sm_datacard.txt
After running text2workspace.py
on your datacard, you can check the normalisation objects using the tool test/printWorkspaceNormalisations.py
. See the example below for the data/tutorials/shapes/simple-shapes-parametric.txt datacard.
text2workspace.py data/tutorials/shapes/simple-shapes-parametric.txt
python test/printWorkspaceNormalisations.py data/tutorials/shapes/simple-shapes-parametric.root
---------------------------------------------------------------------------
---------------------------------------------------------------------------
Channel - bin1
---------------------------------------------------------------------------
Top-level normalisation for process bkg -> n_exp_final_binbin1_proc_bkg
-------------------------------------------------------------------------
RooProduct::n_exp_final_binbin1_proc_bkg[ n_exp_binbin1_proc_bkg * shapeBkg_bkg_bin1__norm ] = 521.163
... is a product, which contains n_exp_binbin1_proc_bkg
RooRealVar::n_exp_binbin1_proc_bkg = 1 C L(-INF - +INF)
-------------------------------------------------------------------------
default value = 521.163204829
---------------------------------------------------------------------------
Top-level normalisation for process sig -> n_exp_binbin1_proc_sig
-------------------------------------------------------------------------
Dumping ProcessNormalization n_exp_binbin1_proc_sig @ 0x464f700
nominal value: 1
log-normals (1):
kappa = 1.1, logKappa = 0.0953102, theta = lumi = 0
asymm log-normals (0):
other terms (1):
term r (class RooRealVar), value = 1
-------------------------------------------------------------------------
default value = 1.0
This tells us that the normalisation for the background process, named n_exp_final_binbin1_proc_bkg
is a product of two objects n_exp_binbin1_proc_bkg * shapeBkg_bkg_bin1__norm
. The first object is just from the rate line in the datacard (equal to 1) and the second is a floating parameter. For the signal, the normalisation is called n_exp_binbin1_proc_sig
and is a ProcessNormalization
object which contains the rate modifications due to the systematic uncertainties. You can see that it also has a "nominal value" which again is just from the value given in the rate line of the datacard (again=1).
If a parameter is intended to be used and it is not a user defined param
or rateParam
, it can be picked up by first issuing an extArgs
directive before this line in the datacard. The syntax for extArgs
is
name extArg rootfile:workspacename
The string ":RecycleConflictNodes" can be added at the end of the final argument (i.e. rootfile:workspacename:RecycleConflictNodes) to apply the corresponding RooFit option when the object is imported into the workspace. It is also possible to simply add a RooRealVar using extArg
for use in function rateParams
with the following
name extArg init [min,max]
Note that the [min,max]
argument is optional and if not included, the code will remove the range of this parameter.
It can often be useful to modify datacards, or the runtime behavior, without having to modify individual systematics lines. This can be acheived through the following.
If a nuisance parameter needs to be renamed for certain processes/channels, it can be done so using a single nuisance edit
directive at the end of a datacard
nuisance edit rename process channel oldname newname
This will have the effect that nuisance parameter effecting a given process/channel will be renamed, thereby de-correlating it from other processes/channels. Use options ifexists
to skip/avoid error if nuisance not found. Other edits are also supported as follows,
-
nuisance edit add process channel name pdf value [options]
-> add a new or add to a nuisance. If options isaddq
, value will be added in quadrature to this nuisance for this process/channel. If options isoverwrite
, the nuisance value will be replaced with this value -
nuisance edit drop process channel name [options]
-> remove this nuisance from the process/channel. Use optionsifexists
to skip/avoid error if nuisance not found. -
nuisance edit changepdf name newpdf
-> change the pdf type of a given nuisance tonewpdf
. -
nuisance edit split process channel oldname newname1 newname2 value1 value2
-> split a nuisance line into two separate nuisances callednewname1
andnewname2
with valuesvalue1
andvalue2
. Will produce two separate lines to that the original nuisanceoldname
becomes two uncorrelated nuisances. -
nuisance edit freeze name [options]
-> set nuisance to frozen by default. Can be over-ridden incombine
command line using--floatNuisances
option Use optionsifexists
to skip/avoid error if nuisance not found. -
nuisance edit merge process channel name1 name2
-> merge systematicname2
intoname1
by adding their values in quadrature and removingname2
. This only works if, for each process and channel included, they go in the same direction. For example, you can add 1.1 to 1.2, but not to 0.9.
Note that the wildcard (*) can be used for either/both of process and channel.
The above edits support nuisances which are any of shape[N]
, lnN
, lnU
, gmN
, param
, flatParam
, rateParam
or discrete
types.
Often it is desirable to freeze one or more nuisances to check the impact they have on limits, likelihood scans, significances etc.
However, for large groups of nuisances (eg everything associated to theory) it is easier to define nuisance groups in the datacard. The following line in a datacard will, for example, produce a group of nuisances with the group name
theory
which contains two parameters, QCDscale
and pdf
.
theory group = QCDscale pdf
Multiple groups can be defined in this way. It is also possible to extend nuisance groups in datacards using += in place of =.
These groups can be manipulated at runtime (eg for freezing all nuisances associated to a group at runtime, see Running the tool). You can find more info on groups of nuisances here
Note that when using the automatic addition of statistical uncertainties (autoMCStats), the corresponding nuisance parameters are created by text2workspace.py
and so do not exist in the datacards. It is therefore not possible to add autoMCStats parameters to groups of nuisances in the way described above. However, text2workspace.py
will automatically create a group labelled autoMCStats
which contains all autoMCStats parameters.
This group is useful for freezing all parameters created by autoMCStats. For freezing subsets of the parameters, for example if the datacard contains two categories, cat_label_1 and cat_label_2, to only freeze the autoMCStat parameters created for category cat_label_1 the regular expression features can be used. In this example this can be achieved by using --freezeParameters 'rgx{prop_bincat_label_1_bin.*}'
.
If you have separate channels each with it's own datacard, it is possible to produce a combined datacard using the script combineCards.py
The syntax is simple: combineCards.py Name1=card1.txt Name2=card2.txt .... > card.txt
If the input datacards had just one bin each, then the output channels will be called Name1
, Name2
, and so on. Otherwise, a prefix Name1_
... Name2_
will be added to the bin labels in each datacard. The supplied bin names Name1
, Name2
, etc. must themselves conform to valid C++/python identifier syntax.
[warning] Systematics which have different names will be assumed to be uncorrelated, and the ones with the same name will be assumed 100% correlated. A systematic correlated across channels must have the same p.d.f. in all cards (i.e. always
lnN
, or allgmN
with sameN
)
The combineCards.py
script will complain if you are trying to combine a shape datacard with a counting datacard. You can however convert a counting datacard in an equivalent shape-based one by adding a line shapes * * FAKE
in the datacard after the imax
, jmax
and kmax
section. Alternatively, you can add the option -S
in combineCards.py
which will do this for you while making the combination.
For complicated analyses or cases in which multiple datacards are needed (e.g. optimisation studies), you can avoid writing these by hand. The object Datacard defines the analysis and can be created as a python object. The template python script below will produce the same workspace as running textToWorkspace.py
(see the section on Physics Models) on the realistic-counting-experiment.txt datacard.
Show python script
from HiggsAnalysis.CombinedLimit.DatacardParser import *
from HiggsAnalysis.CombinedLimit.ModelTools import *
from HiggsAnalysis.CombinedLimit.ShapeTools import *
from HiggsAnalysis.CombinedLimit.PhysicsModel import *
from sys import exit
from optparse import OptionParser
parser = OptionParser()
addDatacardParserOptions(parser)
options,args = parser.parse_args()
options.bin = True # make a binary workspace
DC = Datacard()
MB = None
############## Setup the datacard (must be filled in) ###########################
DC.bins = ['bin1'] # <type 'list'>
DC.obs = {'bin1': 0.0} # <type 'dict'>
DC.processes = ['ggH', 'qqWW', 'ggWW', 'others'] # <type 'list'>
DC.signals = ['ggH'] # <type 'list'>
DC.isSignal = {'qqWW': False, 'ggWW': False, 'ggH': True, 'others': False} # <type 'dict'>
DC.keyline = [('bin1', 'ggH', True), ('bin1', 'qqWW', False), ('bin1', 'ggWW', False), ('bin1', 'others', False)] # <type 'list'>
DC.exp = {'bin1': {'qqWW': 0.63, 'ggWW': 0.06, 'ggH': 1.47, 'others': 0.22}} # <type 'dict'>
DC.systs = [('lumi', False, 'lnN', [], {'bin1': {'qqWW': 0.0, 'ggWW': 1.11, 'ggH': 1.11, 'others': 0.0}}), ('xs_ggH', False, 'lnN', [], {'bin1': {'qqWW': 0.0, 'ggWW': 0.0, 'ggH': 1.16, 'others': 0.0}}), ('WW_norm', False, 'gmN', [4], {'bin1': {'qqWW': 0.16, 'ggWW': 0.0, 'ggH': 0.0, 'others': 0.0}}), ('xs_ggWW', False, 'lnN', [], {'bin1': {'qqWW': 0.0, 'ggWW': 1.5, 'ggH': 0.0, 'others': 0.0}}), ('bg_others', False, 'lnN', [], {'bin1': {'qqWW': 0.0, 'ggWW': 0.0, 'ggH': 0.0, 'others': 1.3}})] # <type 'list'>
DC.shapeMap = {} # <type 'dict'>
DC.hasShapes = False # <type 'bool'>
DC.flatParamNuisances = {} # <type 'dict'>
DC.rateParams = {} # <type 'dict'>
DC.extArgs = {} # <type 'dict'>
DC.rateParamsOrder = set([]) # <type 'set'>
DC.frozenNuisances = set([]) # <type 'set'>
DC.systematicsShapeMap = {} # <type 'dict'>
DC.nuisanceEditLines = [] # <type 'list'>
DC.groups = {} # <type 'dict'>
DC.discretes = [] # <type 'list'>
###### User defined options #############################################
options.out = "combine_workspace.root" # Output workspace name
options.fileName = "./" # Path to input ROOT files
options.verbose = "1" # Verbosity
##########################################################################
if DC.hasShapes:
MB = ShapeBuilder(DC, options)
else:
MB = CountingModelBuilder(DC, options)
# Set physics models
MB.setPhysics(defaultModel)
MB.doModel()
Any existing datacard can be converted into such a template python script by using the --dump-datacard
option in text2workspace.py
in case a more complicated template is needed.
[warning] The above is not advised for final results as this script is not easily combined with other analyses so should only be used for internal studies.
For the automatic generation of datacards (which are combinable), you should instead use the CombineHarvester package which includes many features for producing complex datacards in a reliable, automated way.
- Introduction
- Getting started
- Setting up the analysis
- Running Combine
- Useful Links
- FAQ