Exercise_ROOT_No8.tex

\documentclass[11pt]{article}

\usepackage[lmargin=1.5cm,rmargin=1.5cm,tmargin=2.50cm,bmargin=2.50cm]{geometry}
\usepackage{graphicx}
\usepackage{color}
\usepackage{amsmath,amssymb,url}

\begin{document}

\begin{center}
\Large{\bf{Making a look-up table in ROOT}}\\
\end{center}
\begin{center}
{\small\today}
\end{center}

\bigskip

\noindent In this ROOT exercise we cover the following classes:
%
\begin{enumerate}
	\item TExMap (\url{http://root.cern.ch/root/html/TExMap.html})
\end{enumerate} 
%

\bigskip\bigskip

\end{document}

\noindent The frequently encountered problem, after the experiment is over, is how to report the statistical uncertainty for the measured observables of interest. This is particularly difficult for the compound observables. For instance, if the directly measured observables are $x$ and $y$, but we are interested in the quantity $z\equiv x^2 + e^{-y} + \cos xy$, it is very difficult, if not impossible, to perform the standard error propagation and obtain the statistical uncertainty for $z$, from the statistical uncertainties of $\sigma_x$, $\sigma_y$ and their covariance Cov$(x,y)$, which can be obtained directly. This sort of problem in general can be resolved with the {\it bootstrap} technique.

There is a long story about how to apply the bootstrap technique in the data analysis (see for instance \url{ https://en.wikipedia.org/wiki/Bootstrapping_(statistics)}). To make this long story short, we have simplified procedure and endorsed in ALICE the simplified (``hijacked'') version of this technique. That is the main subject of this exercise. 

The simplified version of bootstrap technique consists of the following steps:
%
\begin{enumerate}
\item You have to divide the initial sample in 10 subsamples, so that each subsample has about the same statistics;
%
\item For each subsample you perform an independent measurement and get for your observable of interest the results $\mu_0$, $\mu_1$, ... , $\mu_9$, respectively;
%
\item You then compute an unbiased variance from these 10 subsamples, i.e. 
%
\begin{equation}
\mathrm {Var} = \frac{1}{10-1} \sum_i^{10} (\mu_i-\mu)^2\,, 
\end{equation}
%
where the result for the starting large sample of your observable of interest is denoted by $\mu$ in the above formula.
%
\item The final result for your observable of interest and its statistical uncertainty are:
%
\begin{equation}
\mu \pm \sqrt\frac{\rm Var}{10}\,.
\end{equation}
%
\end{enumerate}
%
If the statistics permits, one can divide the initial sample in $N > 10$ subsamples, but the results do not change much (it's easy to implement the above steps for arbitrary steps). On the other hand, for $N<10$ subsamples the bootstrap technique is not stable and therefore cannot be used.

In practice:
1/ For simple fundamental observable, whatever you get from ROOT for the statistical error (i.e. from histogram or profile), is perfectly fine;
2/ For compound observable, please do not bother with the error propagation and covariance matrrices, simply use boostrap. Here, in addition, you only need to develop some clever algorithm that will divide your total dataset in 10 subsets of approx. equal size, but this is typically straightforward. 

In general, two random observables $x$ and $y$, with sample spaces $X$ and $Y$, are said to be statistically independent if their joint 2-variate p.d.f. $f_{xy}(x,y)$ fully factorizes into the product of two marginal p.d.f.'s $f_{x}(x)$ and $f_{y}(y)$:  
%
\begin{equation}
f_{xy}(x,y) = f_{x}(x)f_{y}(y)\,.
\end{equation}
%
In the above equation, $f_{x}(x)$ and $f_{y}(y)$ are two marginal distributions defined as:
% 
\begin{eqnarray}
f_x(x) &=& \int_Y f_{xy}(x,y)\,dy\,,\nonumber\\
f_y(y) &=& \int_X f_{xy}(x,y)\,dx\,.
\label{eq:marginalizedPDFs}
\end{eqnarray}
%
In general, $f_{x}(x)$ and $f_{y}(y)$ can have completely different functional forms. 

\bigskip\bigskip\bigskip

\noindent\textbf{Question 1:} {\it Consider the following 2-variate p.d.f.'s: a) $f_{xy}(x,y) = e^{-(x-4)^2}\log y$, and b) $f_{xy}(x,y) = e^{-(x-4)^2}+\log y$. Demonstrate that in the first case observables $x$ and $y$ are statistically independent, and that in the second case they are not statistically independent.}
	
\bigskip

\noindent\textbf{Hint \#1}: Implement in ROOT 2-variate p.d.f. as TF2 object, and sample observables $x$ and $y$ simultaneously from it. With all sampled pairs fill some 2D histogram (see e.g. the class TH2F). Normalize that 2D histogram, and then obtain two marginal histograms by making projections along $x$ and $y$ axis, i.e. schematically: 
{\small
%
\begin{verbatim}
TH1F *marginalX = (TH1F*)f2DhistNormalized->ProjectionX();
TH1F *marginalY = (TH1F*)f2DhistNormalized->ProjectionY();
\end{verbatim}
%
}
\noindent Note that if the starting 2D histogram is normalized, the two 1D marginal histograms obtained from it are automatically normalized. Finally, inspect a 2D ratio (e.g. introduce the new TH2F object for the ratio), between the starting 2D histogram, and the product of the two 1D marginal histograms. If this ratio is 1 everywhere, $x$ and $y$ are statistically independent.

\bigskip

\noindent\textbf{Hint \#2}: In order to sample 2 observables from 2-variate p.d.f. simultaneously, the following code snippet might be helpful:
{\small
% 
\begin{verbatim}
Double_t x = 0., y = 0.;
for(Int_t n = 0; n < nSamplings; n++)
{
 f2D->GetRandom2(x,y); // f2D was declared as TF2 object
 f2Dhist->Fill(x,y); // f2Dhist was declared as TH2F object
}
\end{verbatim}
%
} % small 
\noindent This procedure can be straightforwardly generalized for the 3D case (see the classes TF3 and TH3F). On the other hand, going to 4D or even higher dimensions is not that straightforward in ROOT. While the general $n$-dimensional histogram can be found in the class THnSparse (see \url{https://root.cern.ch/doc/master/classTHnSparse.html}), the new implementation of multidimensional sampling which would correspond to the general TFN object requires some (non-trivial) work.

\newpage

\noindent \textbf{{Hint \#3}}: In order to check whether a histogram is normalized or not, check out the value of:
{\small
% 
\begin{verbatim}
hist->Integral();
\end{verbatim}
%
} % small
\noindent This member function is the same also for 2D and 3D case.
 % \it

\bigskip

\noindent By building up on top of this example, in real life for any 2D histogram obtained in an experiment we can deduce straightforwardly whether it was filled with two statistically independent or two correlated observables. 

\bigskip

\begin{center}
\Large{\bf{Cumulants from hell}}\\
\end{center}

If two random observables are not statistically independent (i.e. if they are correlated), without loss of generality we can always write:
%
\begin{equation}
f_{xy}(x,y) = f_{x}(x)f_{y}(y) + c_{xy}(x,y)\,.
\end{equation}
%
Therefore, by construction any direct correlation between observables $x$ and $y$ is encoded solely in $c_{xy}(x,y)$, which is by definition the 2-particle cumulant, or the genuine 2-particle correlation. Cumulants cannot be measured directly, but from the above expressions we obtain trivially:
%
\begin{equation}
c_{xy}(x,y) = f_{xy}(x,y) - f_{x}(x)f_{y}(y)\,.
\label{eq:cumulant}
\end{equation}
%
In terms of expectation values (averages), Eq.~(\ref{eq:cumulant}) reads:
%
\begin{equation}
\left<xy\right>_c = \left<xy\right> - \left<x\right>\left<y\right>\,.
\label{eq:cumulant:averages}
\end{equation}
%
This reasoning can be straightforwardly generalized for multivariate case. Without going into the derivation/details now, the final results for the relevant 2D and 3D cumulants, for a system consisting of 3 random observables $x$, $y$ and $z$, are given by the following formulas:
%
\begin{eqnarray}
\left<xy\right>_c&=&\left<xy\right> - \left<x\right>\left<y\right>\nonumber\\
\left<xz\right>_c&=&\left<xz\right> - \left<x\right>\left<z\right>\nonumber\\
\left<yz\right>_c&=&\left<yz\right> - \left<y\right>\left<z\right>\nonumber\\
\left<xyz\right>_c&=&\left<xyz\right>\nonumber\\
&-&\left<xy\right>\left<z\right>-\left<xz\right>\left<y\right>-\left<yz\right>\left<x\right>\nonumber\\
&+& 2\left<x\right>\left<y\right>\left<z\right>\nonumber
\end{eqnarray}
%

\bigskip

\noindent\textbf{Question 2:} {\it Please implement in ROOT the 3-variate p.d.f. $f_{xyz}(x,y,z) \equiv e^{xy+xz+yz} =  e^{xy}e^{xz}e^{yz}$ by using the class TF3. Since this p.d.f. partially factorizes, it is clear that by design it describes only the genuine 2-particle correlations among $x$ and $y$, $x$ and $z$, and $y$ and $z$, respectively. Demonstrate by developing the code in ROOT that all 2D cumulants in this example are not zero, while the 3D cumulant is zero.} 

\bigskip

\noindent By building up on top of this example, in an experiment we can deduce whether there exists a genuine 3-body interaction in the system, which cannot be expressed as a superposition of lower two- and single-particle terms. Non-vanishing 3-particle cumulant undoubtedly points to the existence of genuine 3-body interaction.  

\end{document}