forked from miriamamendola/the-final-master
-
Notifications
You must be signed in to change notification settings - Fork 0
/
appendix_a.tex
392 lines (317 loc) · 26.6 KB
/
appendix_a.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
\chapter{MATLAB Exercises}
\section{Mean estimation on heterogeneous data}
In this exercise, we are going to use Monte Carlo simulations to verify the theoretical results obtained in section \ref{sec:ex_1} and provide an analysis of the estimators $\hat\theta_1$, $\hat\theta_2$, $\hat\theta_{ave}$ and $\hat\theta_{ML}$ with varying parameters.
\subsection*{Variation of $\sigma_2^2$}
\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{./figures/appendix_a/figure_1.png}
\caption{MSE of the estimators $\hat\theta_1$, $\hat\theta_2$, $\hat\theta_{ave}$ and $\hat\theta_{ML}$ as a function of $\sigma_2^2$.}
\label{fig:mean_estimation_heterogeneous_data}
\end{figure}
Each point is an empirical estimate of the MSE obtained through Monte Carlo simulation, so we will not have exactly the behavior that the theoretical MSE would have, but we are approximating it and this approximation improves by increasing the number of Monte Carlo experiments.
Since we are varying $\sigma_2^2$ we expect the MSE with respect to the estimates made by $\hat \theta _1$ (blue line) to be a constant, remembering however what was said in the first observation to see oscillations is normal being an estimate but increasing $MC$ this approximation can only improve and increasingly resemble a constant.
The MSE shape of $\hat \theta_{ave}$ (yellow line) is increasing as a function of $\sigma^2_2$. It is reasonable since we are adding noise to the system. Since we know the variance of this estimator we also know its theoretical MSE and from it we can observe that the error is linear with respect to $\sigma_2^2$.
From the theoretical analysis we had obtained that there is a special point after which discarding one of the two parts of data is preferable to keeping them equally into account. Therefore, $\hat \theta_{ave}$ (yellow line) which equally weights ($p=0.5$) both parts of data is preferable to $\hat\theta_1$ (blue line) which instead discards completely one part of data (the one that in the graph is increasing its variance) up to this special point which is equal to $3\sigma_1^2$ and represents a level of critical noise for the variance of that slice of data. Beyond this critical point it is preferable to use the estimator $\hat \theta_1$; also from the graph it can be seen how its MSE remains lower than that of $\hat \theta_{ave}$ from the right of the threshold onwards.
It follows that if we wanted to choose between these two the best estimator we would obtain a curve that before the threshold is linear with $\sigma^2_2$ (i.e. $\hat \theta_{ave}$) and after the threshold is a constant, so we would generally have a non-linear behavior as at a certain point there will be a sudden saturation of the error.
The threshold can be read in two different ways depending on whether we vary $\sigma_1^2$ or $\sigma_2^2$.
\[
\begin{cases}
\sigma_2^2>3\sigma_1^2 \lor \sigma_1^2<\frac 13\sigma_2^2\quad\text{$\hat\theta_1$ better than $\hat \theta_{ave}$} \\
\sigma_2^2<3\sigma_1^2\lor \sigma_1^2>\frac13\sigma_2^2\quad\text{$\hat \theta_{ave}$ better than $\hat \theta_1$ }
\end{cases}
\]
\[
\begin{cases}
\sigma_1^2>3\sigma_2^2\lor \sigma_2^2<\frac 13\sigma_1^2\quad\text{$\hat\theta_2$ better than $\hat \theta_{ave}$} \\
\sigma_1^2<3\sigma_2^2\lor \sigma_2^2>\frac13\sigma_1^2\quad\text{$\hat \theta_{ave}$ better than $\hat \theta_2$ }
\end{cases}
\]
The MSE computed on the estimates made by $\hat \theta_2$ (red line) increases linearly, as we are inserting more and more noise in that slice of data and therefore we are making the estimate more and more difficult at the same $N$.
The estimator $\hat \theta_{ML}$ with respect to all the other estimators is the best over the entire range of variation as the curve of the MSE calculated on its estimates (purple line) always takes lower values. Only at infinity, it seems to converge with the estimator $\hat \theta_1$ and this is what we would expect from the theory.
If we didn't have any formula to support and therefore we should rely only on the data and the simulation we could never be firmly convinced that the MSE of $\hat\theta_{ML}$ is always better than $\hat \theta_1$ as they are too close and it could be just a case that the purple line is lower over the entire range (as experiments we could never be sure). What we could do in a situation like this is:
\begin{enumerate}
\item Do experiments with different orders of magnitude for the number of Monte Carlo simulations and see if there is a common trend
\item If at the same data, the curves are very irregular and despite this there is a curve that always has lower values than another, it could be a clue that this phenomenon is also true in theory
\end{enumerate}
When $\sigma_1^2=\sigma_2^2$ the two points of the two curves of the MSE of $\hat\theta_1$ and $\hat\theta_2$ assume the same identical value. The same value for the curves of $\hat\theta_{ML}$ and $\hat\theta_{ave}$ because the weights given by the ML estimator are equal to those implicit in the arithmetic mean ($p=0.5$).
When $\sigma_2^2=0$ we have a slice of data that is giving us exactly the value of the unknown parameter $\theta$, and therefore the MSE of $\hat\theta_2$ will clearly be $0$. In this particular circumstance the performance of $\hat\theta_2$ is equal to that of $\hat\theta_{ML}$.
When $\sigma_2^2\to+\infty$ the ML estimator will tend to give less and less weight to that slice of data as it becomes more and more noisy and therefore asymptotically the performance of $\hat\theta_{ML}$ will be equal to those of $\hat \theta_1$ and in particular will converge to a certain value.
The two previous points show the boundaries of the performance of the ML estimator. In both cases there is a fixed variance, and there is a variance that dominates the other. In one case we have $\sigma_1^2$ fixed and dominated by $\sigma_2^2$, so the slice of data with variance $\sigma_2^2$ becomes "optimal" and the ML estimator moves in that direction. Conversely in the second case $\sigma_1^2$ is still fixed but dominated by $\sigma_2^2$ which in this case indicates that its part of data is becoming worse, so the ML estimator directs its estimates based on the "best" slice of data.
The arithmetic mean cannot guarantee the behavior of the previous point because it does not take into account the parameters so regardless of the value that the variance continues to take information in the same way from both slices of data.
Taking the three estimators $\hat\theta_{ave}$, $\hat\theta_{1}$ and $\hat\theta_{2}$ the curve obtained by properly selecting the best of them at each point is not so different from the curve of $\hat \theta_{ML}$.
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_2.jpeg}
\end{figure}
Let us observe the comparison between the theoretical error (corresponding to the variance as we have previously said) and the estimated MSE.
\begin{figure}[H]
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_3.png}
\end{figure}
\textit{If I have more noise the error is greater and in fact we have seen how increasing the variance increases the error in general. But if I want to estimate that error with the Monte Carlo simulation, with noisier data it is more complex to make the estimate?}
The answer is clearly yes, if I have noisier data everything seems more random and therefore I need more simulations. Let us observe what happens if we increase the number of simulations by an order of magnitude.
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_4.png}
\end{figure}
\subsection*{Variation of $N$}
Let's look at the variation of the MSE as $N$ increases, both using a linear scale for the y-axis and a logarithmic scale. We also perform a comprehensive analysis from the perspective of the two scenarios we could encounter: $\sigma_2^2>3\sigma_1^2$ and $\sigma_2^2\le3\sigma_1^2$.
We can observe that using both logarithmic scales, the curve has a linear trend. This means that the actual trend is a power of $N$, specifically in this case $N^{-1}$. All these curves decrease with the same law (\textbf{scaling law}, often a theoretical result is expressed in terms of it). This is also derived from theory as the theoretical error decreases proportionally to $N$.
\begin{note}{About scaling laws}
A scaling law is
$$
y=a\cdot x^b
$$
If we used a logarithmic scale to represent this relationship we would have:
$$
\log(y)=\log(a\cdot xb)=\log(a)+b\cdot\log(x)
$$
Since increasing $x$ the curve decreases with a unit slope, the value of $b=-1$, net of the proportionality coefficient $a$ which is different for all the curves.
\end{note}
\begin{note}{Comparison between scaling laws}
For this type of problems $\frac 1N$ is the best scaling law we could have in an independent sample estimation problem. So an estimator with this scaling law can be considered a good estimator, although the constant may then make the difference between estimators with the same scaling law.
Only in extremely particular cases can we have exponential scaling laws.
\end{note}
The estimator $\hat\theta_{ML}$ not only has a scaling law of $\frac 1N$, but also has a smaller constant compared to the other estimators. In fact, we can observe that its curve is lower than the others.
We notice that in the two cases, respectively to the right and to the left of the threshold on the variance that emerged from the theory, we have a different ordering of the curves of $\hat\theta_{ave}$ and $\hat \theta_1$.
\section{MMSE}
In this exercise, we are going to use Monte Carlo simulations to verify the theoretical results obtained in section \ref{sec:ex_2}. The minimum mean square error for a random variable $Y$ given a set of observations $X$ is:
\[
\hat{Y} = \E{}{Y\mid X} = \frac{\sigma^2_y}{\sigma^2_y + \frac{\sigma^2_w}{N}} \bar{X}
\]
While its variance is:
\[
Var[Y|X]=\frac {\sigma^2_Y\sigma^2_W}{\sigma^2_W+N\sigma^2_Y}
\]
In table \ref{tab:mmse} we can see the theoretical values of the MMSE varying $\sigma^2_W$ and $\sigma^2_Y$.
\begin{table}[H]
\centering
\begin{tabular}{|c|c|c|}
\hline
$\lim$ & $\alpha$ & $MMSE$ \\
\hline
$\sigma^2_Y\to+\infty$ & $1$ & $\bar X$ \\
$\sigma^2_Y\to 0$ & $0$ & $0 (\mu_Y)$ \\
$\sigma^2_W\to+\infty$ & $0$ & $0$ \\
$\sigma^2_W\to 0$ & $1$ & $\bar X$ \\
\hline
\end{tabular}
\quad
\begin{tabular}{|c|c|c|}
\hline
$\gg$ & $\sigma^2_W/N$ & $\sigma^2_Y$ \\
\hline
$\sigma^2_W/N$ & - & $\alpha \approx \frac{\sigma^2_Y}{\sigma^2_W}$ \\
$\sigma^2_Y$ & $\alpha \approx 1$ & - \\
\hline
\end{tabular}
\label{tab:mmse}
\end{table}
\subsection*{Variation of $\sigma^2_Y$}
Let's first consider the case where $\sigma^2_Y$ is much larger than $\frac {\sigma^2_W}N$.
\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{./figures/appendix_a/figure_6.png}
\caption{MSE of the estimators $\hat Y$ and $\bar X$ as a function of $\sigma^2_Y$.}
\label{fig:mmse_sigma_y}
\end{figure}
As in the theoretical analysis, we can observe that when the variance of the prior function is much larger than the ratio between the variance of the error and the number of samples, the two estimators tend to behave in the same way. In fact, in this case, $\alpha\to 1$ and the MMSE tends to the arithmetic mean.
Also as expected, if we did not have data (or equivalent data that is independent of $Y$), but only the information of the prior distribution, the MMSE would be the mean of the prior distribution.
$$
\mathbb E[Y|X]=\mathbb E[Y]=\mu_Y
$$
in that case the MSE would be:
$$
\mathbb E[(Y-\hat Y)^2]=Var[Y]=\sigma_Y^2
$$
So we can see this result as a upper bound to the error we can obtain.
% TODO: controllare le caption delle immagini
Having $\sigma^2_Y \gg \frac{\sigma^2_W}N$ the information we have from the prior distribution is not very informative. Having an high variance of the parameter means saying that the parameter may be everywhere in the domain, so it makes sense that the MMSE can work as the arithmetic mean since the only information we have are the data.
\begin{figure}[H]
\centering
\begin{minipage}{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_7.png}
\end{minipage}
\hfill
\begin{minipage}{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_8.png}
\end{minipage}
\caption{MSE of the estimators $\hat Y$ and $\bar X$ as a function of $\sigma^2_Y$.}
\label{fig:mmse_sigma_y_combined}
\end{figure}
Let's now consider the case where $\sigma_Y^2$ is much smaller than $\frac {\sigma^2_W}N$.
\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{./figures/appendix_a/figure_9.png}
%\caption{MSE of the estimators $\hat Y$ and $\bar X$ as a function of $\sigma^2_Y$.}
%\label{fig:mmse_sigma_y_small}
\end{figure}
\begin{figure}[H]
\centering
\begin{minipage}{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_10.png}
\end{minipage}
\hfill
\begin{minipage}{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_11.png}
\end{minipage}
%\caption{MSE of the estimators $\hat Y$ and $\bar X$ as a function of $\sigma^2_Y$.}
%\label{fig:mmse_sigma_y_combined}
\end{figure}
\begin{enumerate}
\item The data is very noisy compared to the prior information we have. In fact, the posterior mean performs better on average than the arithmetic mean.
\item In the extreme case, when this difference is quite pronounced, we are making an estimation based solely on the prior information. In fact, the estimates made by the MMSE are practically a constant at $0$.
\item The MMSE estimator is infused with prior information. To understand if this information leads to better results compared to completely eliminating the data using the estimator $\hat Y=0$ and plotting the value of $\mathbb E [Y^2]$, which is equal to $\sigma^2_Y$ since $\mu_Y=0$ (yellow dashed line added afterwards).
Noticing that the MSE always stays below this bisector and has a sublinear growth, we can appreciate that using data infused with prior information yields beneficial results and will eventually converge to the error of the arithmetic mean (not visible in these plots but in those for $\sigma^2_Y\gg \frac{\sigma^2_W}{N}$).
\end{enumerate}
Now, let's consider the case where $\sigma^2_Y$ is comparable to $\frac {\sigma^2_W}N$.
\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{./figures/appendix_a/figure_12.png}
%\caption{MSE of the estimators $\hat Y$ and $\bar X$ as a function of $\sigma^2_Y$.}
%\label{fig:mmse_sigma_y_small}
\end{figure}
\begin{figure}[H]
\centering
\begin{minipage}{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_13.png}
\end{minipage}
\hfill
\begin{minipage}{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_14.png}
\end{minipage}
%\caption{MSE of the estimators $\hat Y$ and $\bar X$ as a function of $\sigma^2_Y$.}
%\label{fig:mmse_sigma_y_combined}
\end{figure}
\begin{enumerate}
\item When the two quantities oscillate around the same value, we can appreciate how the estimates made by the MMSE, compared to those made by the arithmetic mean, have the peculiarity of being closer to zero. This is because the prior information used by the MMSE indicates that the most probable values are those around the mean (specifically, zero) with a certain variance. Therefore, the estimates tend to approach the most probable values, which are around zero. And in fact, this reasoning makes sense and can be seen from the fact that the MSE calculated on the estimates of the MMSE estimator is smaller than that of the arithmetic mean.
\item More interesting case, it shows what was said before.
\item From this point of view, we can appreciate another thing, namely that the MSE of the arithmetic mean, net of errors due to the approximation of an expected value with a mean, is constant. This also follows from the theory:
$$
\mathbb E\left[\left(\frac 1N\sum X_i\right)^2\right]=\mathbb E_Y\left[\underbrace{\mathbb E_{X|Y}\left[\left(\frac 1N\sum X_i-y\right)^2\right]}_{\sigma_W^2/N}\right]
$$
\end{enumerate}
% TODO: Non ho capito
In generale, dal valore atteso interno poiché stiamo variando la distribuzione rispetto a $X$, ma fissando un valore di $Y=y$ in generale dovremmo ottenere una funzione di $y$. In questo caso specifico il risultato non dipende da $y$. Quindi qualsiasi sia la distribuzione di $Y$, l'errore sarà sempre questo.
\begin{note}{Note}
The fact that this error does not depend on $Y$ and the fact that $\sigma^2$ does not depend on $X$ are not related things.
\end{note}
\subsection*{Final considerations}
\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{./figures/appendix_a/figure_21.png}
%\caption{Caption goes here}
%\label{fig:figure_20}
\end{figure}
\begin{enumerate}
\item the theoretical error matches the error derived from the simulations
\item the posterior mean error is always smaller than the error we would have from the prior information alone (no data). This is always more evident as we increase $\sigma^2_Y$ while the errors are similar for small values of $\sigma^2_Y$ compared to $\frac{\sigma^2 W}{N}$
\item the MMSE error saturates to that of the arithmetic mean (likelihood estimator).
\end{enumerate}
\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{./figures/appendix_a/figure_22.png}
\caption{We can look at the X-axis as indicating the quality (better to the right, worse to the left) of the measuring device. The graph could be improved by decreasing the Y-axis.}
%\label{fig:figure_22}
\end{figure}
\begin{enumerate}
\item If we fixed $\sigma^2_Y$ and change $\sigma_W^2$ (and then we change the ration of this with $N$) we move from a condition ($\sigma_Y^2>\sigma^2_W/N$) where the error of the MMSE tends to the one of the arithmetic mean to a condition ($\sigma_Y^2<\sigma^2_W/N$) where the error derives from the usage of the prior information only.
\item Thus, if we had the variance of the prior fixed, then we know how a certain variabile varies in the environment, and we had a series of measuring devices. If these devices are of \textit{low quality} we will have $\sigma_W^2$ high and then we will try to rely only on the prior information. When, instead, we have \textit{high quality} devices and a low $\sigma^2_W$ then we will tend to rely on the data.
\item The information from the prior distribution represent the state of nature of a variabile, we can't modify them, only represent them more accurately. Then we cannot rely on $\sigma^2_W$ and on $N$. Then other than work on the quality of the devices, I can work on the quantity by increasing $N$ and buying more devices or making more independent measurement. The objective is having $\frac{\sigma^2_W}{N}$ the smaller possible.
\item If the ration tends to infinity we will have that $\alpha \to 0$ and so the MMSE will be equal to the mean of the priori distribution that is $0$. The MSE, when approacing infinity will be equal to the error that we make when we use as estimator $0$; since $Y$ has standard deviation up to $1$, the error will be up to $1$.
\end{enumerate}
In plot \ref{fig:figure_23} we show the MSE with both logarithmic axes to evaluate the scaling law of the error.
\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{./figures/appendix_a/figure_23.png}
%\caption{Caption goes here}
\label{fig:figure_23}
\end{figure}
We can see that differently from the deterministic case, the asymptotic scaling law is $\frac{1}{N}$ when $N$ is sufficiently large (and we can expect this because by increasing $N$ we go near the performances of the arithmetic mean).
There is a low that in the deterministic case tells us that the best error we can obtain follows $\frac{1}{N}$ as scaling low, but this \textit{best} estimator does not always exists.
In the bayesian case we do not have a similar rule but from this example we understood that if $N$ goes to infinity, we can except that the most important thing are the data, if we collect infinite indipendent data the prior information becomes almost irrelevant.
\section{Supervised Non Parametric Regression}
In this exercise, we are going to use Monte Carlo simulations to verify the theoretical results obtained in the exercise \ref{sec:ex_3} and provide an analysis of the estimators $\hat f_{opt}$, $\hat f_{naive}$ and $\hat f_{knn}$ with varying parameters, in particular $h$ for the Naive Kernel and $K$ for the K-NN. In particular, we are going to see what happens when LLN and Locality are not satisfied.
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_15.png}
\caption{Naive Kernel with $h=1$ and K-NN with $K=1$}.
\label{fig:non_parametric_regression}
\end{figure}
Figure \ref*{fig:non_parametric_regression} shows the optimal function $f(x)=\sin(2\pi x)$ in green, the Naive Kernel in red and the K-NN in yellow. We can observe that:
\begin{enumerate}
\item The Naive Kernel is a flat line because we have chosen $h=1$ which is exactly equal to the interval $a$ chosen for the extraction of the samples. In practice, in the calculation of the estimate, it is always taking into account all the points of the training set and the result is that whatever the $x_0$ it receives, it returns the average of the training set.
\item The K-NN has a certain shape, but we are probably conditioned by the fact that we know the true regression function. However, it is true that it follows a trend similar to the one of the true regression function.
\end{enumerate}
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_16.png}
\caption{Naive Kernel with $h=0.5$ and K-NN with $K=5$}.
\label{fig:non_parametric_regression_2}
\end{figure}
Figure \ref*{fig:non_parametric_regression_2} shows the Naive Kernel with $h=0.5$ and the K-NN with $K=5$. We can observe that:
\begin{enumerate}
\item The Naive Kernel is starting to move according to a certain shape, but it is not yet sufficient.
\item The K-NN has further reduced its error compared to the previous plot.
\end{enumerate}
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_17.png}
\caption{Naive Kernel with $h=0.1$ and K-NN with $K=10$}.
\label{fig:non_parametric_regression_3}
\end{figure}
In figure \ref*{fig:non_parametric_regression_3} we can observe that the estimation of the Naive Kernel is now very close to the true regression function.
This improvement is due to the fact that we have reduced the amplitude of the neighborhood and increased the number of samples to use.
By using a very large $h$ (as we did before with $h=1$), we were guaranteeing the LLN but not the Locality. Therefore, the function was converging to $\E{}{Y_0}$ instead of $\E{}{Y_0|X_0}$. Instead, by reducing $h$, we have gained Locality.
By contrast, using a small $K$ ($K=1$ in the first case), we had Locality but not the LLN.
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_18.png}
\caption{Naive Kernel with $h=0.01$ and K-NN with $K=100$}.
\label{fig:non_parametric_regression_4}
\end{figure}
In figure \ref*{fig:non_parametric_regression_4} we can observe that:
\begin{enumerate}
\item In some points, the K-NN is not able to provide an estimate because the neighborhood is too small and it is not able to find neighbors. These jumps also remind us of the case of the K-NN with $K=1$.
\item In the case of the K-NN, we are in the situation seen initially for $h=1$ in the Naive Kernel, we are converging to an expected value $\E{}{Y_0}$ instead of $\E{}{Y_0|X_0}$.
\end{enumerate}
\begin{note}{Note}
The fact that every time we are approximating $\E{}{Y_0}$ we have a mean close to zero is not a coincidence. In fact, we generated the data as $y=\sin(2\pi x)+\epsilon$, the expected value of this quantity is the sum of two quantities having zero mean.
\end{note}
To verify the consistency of the estimators, we can observe the behavior of the MSE as a function of the number of samples $N$.
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_19.png}
\caption{Naive Kernel and KNN with $N=1000$}.
\label{fig:non_parametric_regression_5}
\end{figure}
In figure \ref*{fig:non_parametric_regression_5} we can observe that:
\begin{enumerate}
\item Even though we have increased the number of samples, the oscillations of the K-NN are not satisfactory. By constrast the Naive Kernel is approaching the true regression function.
\item We may think that the Naive Kernel, which is the naive approach, works better than the K-NN.
\item In reality, the error made in this case is that we cannot simply increase $N$ and expect improvements, but we must choose $h$ and $K$ appropriately as a function of $N$.
\item If $K$ remains fixed and $N$ increases, we are not guaranteeing the LLN property.
\end{enumerate}
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_20.png}
\caption{Naive Kernel with $h=0.01$ and K-NN with $K=100$}.
\label{fig:non_parametric_regression_6}
\end{figure}
In figure \ref*{fig:non_parametric_regression_6} we can observe that:
\begin{enumerate}
\item Increasing $K$, as we could expect, we have recovered the LLN property and therefore strongly attenuated the oscillations.
\item Hovewer, we notice that in the extreme points of the domain, in this case $0$ and $1$, we are losing something. This happens because in those points we have more neighbors in one direction rather than in another and this makes the estimates less accurate because they are less local estimates. This could suggest that the number of neighbors ($K$) may not be appropriate.
\end{enumerate}
In figure \ref*{fig:non_parametric_regression_7} we show a good approximation of the true regression function with both estimators.
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{./figures/appendix_a/figure_20.png}
\caption{Naive Kernel with $h=0.01$ and K-NN with $K=4000$, by using $N=100000$ samples.}.
\label{fig:non_parametric_regression_7}
\end{figure}