forked from avehtari/BDA_course_Aalto
-
Notifications
You must be signed in to change notification settings - Fork 0
/
BDA_notes_ch11.tex
224 lines (189 loc) · 7.77 KB
/
BDA_notes_ch11.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
\documentclass[a4paper,11pt,english]{article}
\usepackage[latin1]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{times}
\usepackage{amsmath}
\usepackage{microtype}
\usepackage{url}
\urlstyle{same}
\usepackage[bookmarks=false]{hyperref}
\hypersetup{%
bookmarksopen=true,
bookmarksnumbered=true,
pdftitle={Bayesian data analysis},
pdfsubject={Comments},
pdfauthor={Aki Vehtari},
pdfkeywords={Bayesian probability theory, Bayesian inference, Bayesian data analysis},
pdfstartview={FitH -32768},
colorlinks=true,
linkcolor=black,
citecolor=black,
filecolor=black,
urlcolor=black
}
% if not draft, smaller printable area makes the paper more readable
\topmargin -4mm
\oddsidemargin 0mm
\textheight 225mm
\textwidth 160mm
%\parskip=\baselineskip
\DeclareMathOperator{\E}{E}
\DeclareMathOperator{\Var}{Var}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\Sd}{Sd}
\DeclareMathOperator{\sd}{sd}
\DeclareMathOperator{\Bin}{Bin}
\DeclareMathOperator{\Beta}{Beta}
\DeclareMathOperator{\Invchi2}{Inv-\chi^2}
\DeclareMathOperator{\NInvchi2}{N-Inv-\chi^2}
\DeclareMathOperator{\logit}{logit}
\DeclareMathOperator{\N}{N}
\DeclareMathOperator{\U}{U}
\DeclareMathOperator{\tr}{tr}
%\DeclareMathOperator{\Pr}{Pr}
\DeclareMathOperator{\trace}{trace}
\def\eff{\mathrm{eff}}
\pagestyle{empty}
\begin{document}
\thispagestyle{empty}
\section*{Bayesian data analysis -- reading instructions 11}
\smallskip
{\bf Aki Vehtari}
\smallskip
\subsection*{Chapter 11}
Outline of the chapter 11
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item Markov chain simulation: before section 11.1, pages 275-276
\item 11.1 Gibbs sampler (an example of simple MCMC method)
\item 11.2 Metropolis and Metropolis-Hastings (an example of simple MCMC method)
\item 11.3 Using Gibbs and Metropolis as building blocks (can be skipped)
\item 11.4 Inference and assessing convergence (important)
\item 11.5 Effective number of simulation draws (important)
\item 11.6 Example: hierarchical normal model (skip this)
\end{list}
Demos
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item demo11\_1: Gibbs sampling
\item demo11\_2: Metropolis sampling
\item demo11\_3: Convergence of Markov chain
\item demo11\_4: potential scale reduction $\hat{R}$
\end{list}
Find all the terms and symbols listed below. When reading the chapter,
write down questions related to things unclear for you or things you
think might be unclear for others.
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item Markov chain
\item Markov chain Monte Carlo
\item random walk
\item starting point
\item transition distribution
\item jumping / proposal distribution
\item to converge, convergence, assessing convergence
\item stationary distribution, stationarity
\item effective number of simulations
\item Gibbs sampler
\item Metropolis sampling / algorithm
\item Metropolis-Hastings algorithm
\item acceptance / rejection rule
\item acceptance / rejection rate
%\item (irreducible, aperiodic and not-tranisient are explained in slides)
\item within-sequence correlation, serial correlation
\item warm-up / burn-in
\item to thin, thinned
\item overdispersed starting points
\item mixing
\item to diagnose convergence
\item between- and within-sequence variances
\item potential scale reduction, $\hat{R}$
\item the variance of the average of a correlated sequence
\item autocorrelation
\item variogram
\item $n_{\mathrm{eff}}$
\end{list}
\subsection*{Basics of Markov chains}
Slides by J. Virtamo for the course S-38.143 Queueing Theory have a nice
review of the fundamental terms and Finnish translations for them (in
English
\url{http://www.netlab.tkk.fi/opetus/s38143/luennot/english.shtml} and
in Finnish
\url{http://www.netlab.hut.fi/opetus/s38143/luennot/index.shtml}). See
specially the slides for the lecture 4. To prove that Metropolis
algorithm works, it is sufficient to show that chain is irreducible,
aperiodic and not transient.
\subsection*{Animations}
Nice animations with discussion
\url{http://elevanth.org/blog/2017/11/28/build-a-better-markov-chain/}
~\\
And just the animations with more options to experiment
\url{https://chi-feng.github.io/mcmc-demo/}
\subsection*{Metropolis algorithm}
There is a lot of freedom in selection of proposal distribution in
Metropolis algorithm. There are some restrictions, but we don't go to
the mathematical details in this course.
~\\
Don't confuse rejection in the rejection sampling and in Metropolis
algorithm. In the rejection sampling, the rejected samples are thrown
away. In Metropolis algorithm the rejected proposals are thrown away,
but time moves on and the previous sample x(t) is also the sample
x(t+1).
~\\
When rejecting a proposal, the previous sample is repeated in the
chain, they have to be included and they are valid samples from the
distribution. For basic Metropolis, it can be shown that optimal
rejection rate is 55--77\%, so that on even the optimal case quite many
of the samples are repeated samples. However, high number of
rejections is acceptable as then the accepted proposals are on average
further away from the previous point. It is better to jump further
away 23--45\% of time than more often to jump really close. Methods for
estimating the effective sample size are useful for measuring how
effective a given chain is.
\subsection*{Transition distribution vs. proposal distribution}
Transition distribution is a property of Markov chain. In Metropolis
algorithm the transition distribution is a mixture of a proposal
distribution and a point mass in the current point. The book uses also
term jumping distribution to refer to proposal distribution.
\subsection*{Convergence}
Theoretical convergence in an infinite time is different than
practical convergence in a finite time. There is no exact moment when
chain has converged and thus it is not possible to detect when the
chain has converged (except for rare \emph{perfect sampling} methods
not discussed in BDA3). The convergence diagnostics can help to find
out if the chain is unlikely to be representative of the target
distribution. Furthermore, even if would be able to start from a
independent sample from the posterior so that chain starts from the
convergence, the mixing can be so slow that we may require very large
number of samples before the samples are representative of the target
distribution.
~\\
If starting point is selected at or near the mode, less time is needed
to reach the area of essential mass, but still the samples in the
beginning of the chain are not presentative of the true distribution
unless the starting point was somehow samples directly from the target
distribution.
\subsection*{$\hat{R}$ ja $n_\eff$}
There are many versions of $\hat{R}$ ja $n_\eff$. Beware that some
software packages compute $\hat{R}$ using old inferior approaches.
The $\hat{R}$ and the approach to estimate effective number of samples
$n_\eff$ were updated in BDA3, and slightly updated version of this is
described in Stan 2.18+ user guide. Since then we have developed even
better $\hat{R}$, $ESS$ (effective sample size or $s_\eff$, with
change from $n$ to $S$ is due to improved consistency in the notation)
in
\begin{itemize}
\item Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and
Paul-Christian Bürkner (2019). Rank-normalization, folding, and
localization: An improved R-hat for assessing convergence of
MCMC. arXiv preprint
arXiv:1903.08008 \url{<http://arxiv.org/abs/1903.08008>}.
\end{itemize}
New $\hat{R}$, $ESS$, and Monte Carlo error estimates are available
in RStan {\tt monitor} function in R, and in ArviZ package in Python.
Due to randomness in chains, $\hat{R}$ may get values slightly below 1.
Brief Guide to Stan's Warnings
\url{https://mc-stan.org/misc/warnings.html} provides summary of
available convergence diagnostics in Stan and how to interpret them.
\end{document}
%%% Local Variables:
%%% TeX-PDF-mode: t
%%% TeX-master: t
%%% End: