chapter_notes/BDA_notes_ch11.tex

\documentclass[a4paper,11pt,english]{article}

\usepackage[latin1]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{times}
\usepackage{amsmath}
\usepackage{microtype}
\usepackage{url}
\urlstyle{same}

\usepackage[bookmarks=false]{hyperref}
\hypersetup{%
  bookmarksopen=true,
  bookmarksnumbered=true,
  pdftitle={Bayesian data analysis},
  pdfsubject={Comments},
  pdfauthor={Aki Vehtari},
  pdfkeywords={Bayesian probability theory, Bayesian inference, Bayesian data analysis},
  pdfstartview={FitH -32768},
  colorlinks=true,
  linkcolor=black,
  citecolor=black,
  filecolor=black,
  urlcolor=black
}


% if not draft, smaller printable area makes the paper more readable
\topmargin -4mm
\oddsidemargin 0mm
\textheight 225mm
\textwidth 160mm

%\parskip=\baselineskip

\DeclareMathOperator{\E}{E}
\DeclareMathOperator{\Var}{Var}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\Sd}{Sd}
\DeclareMathOperator{\sd}{sd}
\DeclareMathOperator{\Bin}{Bin}
\DeclareMathOperator{\Beta}{Beta}
\DeclareMathOperator{\Invchi2}{Inv-\chi^2}
\DeclareMathOperator{\NInvchi2}{N-Inv-\chi^2}
\DeclareMathOperator{\logit}{logit}
\DeclareMathOperator{\N}{N}
\DeclareMathOperator{\U}{U}
\DeclareMathOperator{\tr}{tr}
%\DeclareMathOperator{\Pr}{Pr}
\DeclareMathOperator{\trace}{trace}

\def\eff{\mathrm{eff}}

\pagestyle{empty}

\begin{document}
\thispagestyle{empty}

\section*{Bayesian data analysis -- reading instructions 11} 
\smallskip
{\bf Aki Vehtari}
\smallskip

\subsection*{Chapter 11}

Outline of the chapter 11
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item Markov chain simulation: before section 11.1, pages 275-276
\item 11.1 Gibbs sampler (an example of simple MCMC method)
\item 11.2 Metropolis and Metropolis-Hastings (an example of simple MCMC method)
\item 11.3 Using Gibbs and Metropolis as building blocks (can be skipped)
\item 11.4 Inference and assessing convergence (important)
\item 11.5 Effective number of simulation draws (important)
\item 11.6 Example: hierarchical normal model (skip this)
\end{list}

Demos 
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item demo11\_1: Gibbs sampling
\item demo11\_2: Metropolis sampling
\item demo11\_3: Convergence of Markov chain
\item demo11\_4: potential scale reduction $\hat{R}$
\end{list}

Find all the terms and symbols listed below. When reading the chapter,
write down questions related to things unclear for you or things you
think might be unclear for others. 
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item Markov chain
\item Markov chain Monte Carlo
\item random walk
\item starting point
\item transition distribution
\item jumping / proposal distribution
\item to converge, convergence, assessing convergence
\item stationary distribution, stationarity
\item effective number of simulations
\item Gibbs sampler
\item Metropolis sampling / algorithm
\item Metropolis-Hastings algorithm
\item acceptance / rejection rule
\item acceptance / rejection rate
%\item (irreducible, aperiodic and not-tranisient are explained in slides)
\item within-sequence correlation, serial correlation
\item warm-up / burn-in
\item to thin, thinned
\item overdispersed starting points
\item mixing
\item to diagnose convergence
\item between- and within-sequence variances
\item potential scale reduction, $\hat{R}$
\item the variance of the average of a correlated sequence
\item autocorrelation
\item variogram
\item $n_{\mathrm{eff}}$
\end{list}

\subsection*{Basics of Markov chains}

Slides by J. Virtamo for the course S-38.143 Queueing Theory have a nice
review of the fundamental terms and Finnish translations for them (in
English
\url{http://www.netlab.tkk.fi/opetus/s38143/luennot/english.shtml} and
in Finnish
\url{http://www.netlab.hut.fi/opetus/s38143/luennot/index.shtml}). See
specially the slides for the lecture 4. To prove that Metropolis
algorithm works, it is sufficient to show that chain is irreducible,
aperiodic and not transient.

\subsection*{Animations}

Nice animations with discussion
\url{http://elevanth.org/blog/2017/11/28/build-a-better-markov-chain/}

~\\
And just the animations with more options to experiment
\url{https://chi-feng.github.io/mcmc-demo/}

\subsection*{Metropolis algorithm}

There is a lot of freedom in selection of proposal distribution in
Metropolis algorithm. There are some restrictions, but we don't go to
the mathematical details in this course.

~\\
Don't confuse rejection in the rejection sampling and in Metropolis
algorithm. In the rejection sampling, the rejected samples are thrown
away. In Metropolis algorithm the rejected proposals are thrown away,
but time moves on and the previous sample x(t) is also the sample
x(t+1).

~\\
When rejecting a proposal, the previous sample is repeated in the
chain, they have to be included and they are valid samples from the
distribution. For basic Metropolis, it can be shown that optimal
rejection rate is 55--77\%, so that on even the optimal case quite many
of the samples are repeated samples. However, high number of
rejections is acceptable as then the accepted proposals are on average
further away from the previous point. It is better to jump further
away 23--45\% of time than more often to jump really close. Methods for
estimating the effective sample size are useful for measuring how
effective a given chain is.

\subsection*{Transition distribution vs. proposal distribution}

Transition distribution is a property of Markov chain. In Metropolis
algorithm the transition distribution is a mixture of a proposal
distribution and a point mass in the current point. The book uses also
term jumping distribution to refer to proposal distribution.

\subsection*{Convergence}

Theoretical convergence in an infinite time is different than
practical convergence in a finite time. There is no exact moment when
chain has converged and thus it is not possible to detect when the
chain has converged (except for rare \emph{perfect sampling} methods
not discussed in BDA3). The convergence diagnostics can help to find
out if the chain is unlikely to be representative of the target
distribution. Furthermore, even if would be able to start from a
independent sample from the posterior so that chain starts from the
convergence, the mixing can be so slow that we may require very large
number of samples before the samples are representative of the target
distribution.

~\\
If starting point is selected at or near the mode, less time is needed
to reach the area of essential mass, but still the samples in the
beginning of the chain are not presentative of the true distribution
unless the starting point was somehow samples directly from the target
distribution. 

\subsection*{$\hat{R}$ ja $n_\eff$}

There are many versions of $\hat{R}$ ja $n_\eff$. Beware that some
software packages compute $\hat{R}$ using old inferior approaches.

The $\hat{R}$ and the approach to estimate effective number of samples
$n_\eff$ were updated in BDA3, and slightly updated version of this is
described in Stan 2.18+ user guide. Since then we have developed even
better $\hat{R}$, $ESS$ (effective sample size or $s_\eff$, with
change from $n$ to $S$ is due to improved consistency in the notation)
in
 \begin{itemize}
 \item Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and
   Paul-Christian Bürkner (2019). Rank-normalization, folding, and
   localization: An improved R-hat for assessing convergence of
   MCMC. arXiv preprint
   arXiv:1903.08008 \url{<http://arxiv.org/abs/1903.08008>}.
 \end{itemize}
 New $\hat{R}$, $ESS$, and Monte Carlo error estimates are available
 in RStan {\tt monitor} function in R, and in ArviZ package in Python.
 
Due to randomness in chains, $\hat{R}$ may get values slightly below 1.

Brief Guide to Stan's Warnings
\url{https://mc-stan.org/misc/warnings.html} provides summary of
available convergence diagnostics in Stan and how to interpret them.

\end{document}

%%% Local Variables: 
%%% TeX-PDF-mode: t
%%% TeX-master: t
%%% End: