generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-model_fitting.Rmd
594 lines (467 loc) · 28.9 KB
/
07-model_fitting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
# Model Selection & Fitting{#modelspec}
In this chapter it is applied all the theory seen so far through the lenses of R-INLA package on the scraped data. The hierarchical Latent Gaussian model to-be-fitted sets against the response monthly price with the other predictors scraped. As a consequence on top of the hierarchy i.e. the higher level the model places the likelihood of data for which a Normal distribution is specified. At the medium level the GMRF containing the latent components and distributed as a Multivariate Normal (centered in zero and with "markov" properties encoded in the precision matrix). In the end priors are specified both for measurement error precision and for the latent field (penalized in complexity) i.e. the variance of the spatial process and the Matérn hyper parameters. The SPDE approach trinagularize the spatial domain and makes possible to pass from a discrete surface to a continuous representation of the own process. This requires a series of steps ranging from mesh building, passing the mesh inside the Matérn obtaining the spde object and then reprojecting through the stack function. In the end the model is fitted integrating the spatial random field. Posterior distributions for parameters and hyper parameters are then evaluated and summarized.
```{r libs, include=FALSE}
library(geoR, warn.conflicts = F, quietly = T)
library(brinla, warn.conflicts = F, quietly = T)
library(faraway, quietly = T, warn.conflicts = F)
library(kableExtra, warn.conflicts = F, quietly = T)
library(INLA, warn.conflicts = F, quietly = T)
library(inlabru, warn.conflicts = F, quietly = T)
library(rgdal, warn.conflicts = F, quietly = T)
library(ggthemes, warn.conflicts = F, quietly = T)
library(sf, warn.conflicts = F, quietly = T)
library(ggregplot, warn.conflicts = F, quietly = T)
library(ggspatial, warn.conflicts = F, quietly = T)
library(RColorBrewer, warn.conflicts = F, quietly = T)
library(ggrepel, warn.conflicts = F, quietly = T)
```
```{r loadpreprocess}
## qui leggi gli ultimi dati e li chiami finali
final_data <- read_csv("data/prova_finale.csv") %>%
# converti tutto a fattore
mutate((across(where(is.character), ~ forcats::as_factor(.)))) %>%
# droppi le variabili che non ti servono
dplyr::select(lat, long, condom, totlocali, altro, bagno, cucina, heating, photosnum, floor, price, sqfeet, fibra_ottica:giardino_privato) %>%
# conversione dataframe
as.data.frame() %>%
# togli gli na
na.omit()
# qui sistemi il processo gaussiano sulle coordinate
coordinate <- final_data %>%
dplyr::select(lat, long) %>%
as.data.frame()
coordinates(coordinate) <- c("long", "lat")
proj4string(coordinate) <- CRS("+init=epsg:28992") ## qui era 28992 ora metto 3003
final_data2 <- final_data %>%
dplyr::select(-lat, -long)
confini <- readOGR(dsn = "data/confini/A090101_ComuneMilano.shp", verbose = F)
```
In order to make the distribution of the response i.e. price (per month in €) approximately Normal it is applied a $log_{10}$ transformation (further transformation would have better Normalized data i.e. Box-Cox [@boxcox] and Yeo-Johnson [@yeojohnson] however they over complicate interpretability). Moreover all of the numerical covariates e.g. _condominium_, _sqmeter_ and _photosnum_ have already been prepared in \@ref(prep) and are further standardized and scaled. The Locations are represented in map plot \@ref(fig:ggmap) within the borders of the Municipality of Milan. At first the borders shapefile is imported from [GeoPortale Milano](https://geoportale.comune.milano.it/sit/open-data/). The corresponding CRS is in UTM (i.e. Eastings and Northings) which differs from the spatial covariates extracted (lat and long). Therefore the spatial entity needs to be rescaled and reprojected to the new CRS. In the end the latitude and the longitude points are overlayed to the borders as in figure \@ref(fig:ggmap).
```{r ggmap, fig.cap="Milan Real Estate data within the Municpality borders, 4 points of interest" }
confini_st <- st_read("data/confini/A090101_ComuneMilano.shp", quiet = T)
confini <- st_transform(confini_st, "+proj=longlat +datum=WGS84") %>%
st_as_sf()
# coordinate = datiprep2 %>% select(lat, long) %>% as.data.frame(lat = coordinate$lat , long = coordinate$long)
# coordinates(coordinate) <- c("long", "lat")
# coordinate_proj <- spTransform(coordinate, CRS())
# coordinates
#
# ### qui migliorare la scala di colori ## #
poi <- tribble(
~poi, ~lat, ~long,
"Milan Dome", 45.464211, 9.191383,
"Generali Tower", 45.4783, 9.1552,
"Meazza Stadium", 45.478489, 9.122150,
"Porta Venezia", 45.47438, 9.20501
)
# coordinates(poi) = c("long", "lat")
# proj4string(poi) = CRS("+proj=longlat +datum=WGS84")
poi_sf <- poi %>%
st_as_sf(coords = c("long", "lat")) %>%
st_set_crs("+proj=longlat +datum=WGS84")
ggplot(final_data) +
geom_sf(data = confini, fill = "antiquewhite1") +
geom_sf(data = poi_sf, colour = "red", size = 6, alpha = .4) +
annotation_scale(location = "bl", width_hint = 0.4) +
geom_point(aes(x = long, y = lat, colour = price), size = 3, shape = 16) +
geom_text_repel(
data = poi, aes(x = long, y = lat, label = poi), fontface = "bold", box.padding = 0.5, min.segment.length = 1.8, nudge_x = .9, nudge_y = 1,
segment.curvature = -0.4, arrow = arrow(length = unit(0.015, "npc"))
) +
coord_sf() +
scale_colour_continuous(name = "Price (€)\n", type = "viridis", alpha = .8, limits = c(200, 5000)) +
# scale_colour_gradient(name = 'Price (€)\n',
# limits=c(200, 5000),
# low="#FCB9B2", high="#B23A48") +
annotation_north_arrow(
location = "bl", which_north = "true",
pad_x = unit(0.95, "in"), pad_y = unit(0.5, "in"),
style = north_arrow_fancy_orienteering
) +
theme_map() +
theme(legend.position = "right", legend.box = "horizontal")
```
## Model Specification & Mesh Assessement {#modelspecandmesh}
WIth the aim to harness INLA power a hierarchical LGM need to be imposed. Its definition starts from fitting an exponential family distribution on Milan Real Estate Rental data $\mathbf{y}$ i.e. normal distribution:
$$
\mathbf{y}\sim \operatorname{Normal}\left(\boldsymbol\mu, \sigma_{e}^{2}\right)
$$
Then the linear predictor i.e. the mean, since the function $\mathrm{g}\left(\boldsymbol\mu\right)$ is identity, is:
$$
\boldsymbol\eta=b_{0}+\boldsymbol x \boldsymbol\beta+\boldsymbol\xi
$$
where, following the notation in expression \@ref(eq:linearpredictor), $\boldsymbol\xi$ is the spatial random effect assigned to the function $f_1()$, for the set of coviarates $\boldsymbol{z}=\left(x_{1} = lat,\, x_{2} = long\right)$. $\boldsymbol\xi$ is also the GMRF defined in\@ref(def:gmrf), as a result it is distributed as a multivariate Normal centered in 0 and whose covariance function is Matérn (fig. \@ref(fig:matern)) i.e. $\xi_{i} \sim N\left(\mathbf{0}, \mathbf{Q}_{\mathscr{C}}^{-1}\right)$. The precision matrix $\mathbf{Q}_{\mathscr{C}}^{-1}$ (see left panel in figure \@ref(fig:precisionmat)) is the one that requires to be treated with SPDE and its dimensions are $n \times n$, in this case when NAs are omitted and after imputation it results in $192 \times 192$.
$\boldsymbol\beta$ are the model scraped covariates coefficients.
Then the latent field is composed by $\boldsymbol{\theta}=\{\boldsymbol{\xi}, \boldsymbol{\beta}\}$. In the end following the traits of the final expression in \@ref(GP) the LGM can be reformatted as follows,
$$
\boldsymbol{\mathbf{y}}=\boldsymbol{z} \boldsymbol{\beta}+\boldsymbol{\xi}+\boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim N\left(\mathbf{0}, \sigma^2_{\varepsilon} I_{d}\right)
$$
Priors are assigned as Gaussian vagues ones for the $\boldsymbol\beta$ coefficients with fixed precision, indeed for the GMRF priors are assigned to the matérn process $\tilde{\boldsymbol\xi}$ as Penalized in Complexity.
Thus following the steps and notation marked in sec. \@ref(LGM). the _higher_ level eq. \@ref(eq:higher) is constituted by the **likelihood**, depending on hyper parameters $\boldsymbol\psi_1 = \{\sigma_{\varepsilon}^2\}$:
$$\pi(\boldsymbol{\mathbf{y}} \mid \boldsymbol{\theta}, \boldsymbol{\psi_1})=\prod_{i\ = 1}^{\mathbf{192}} \pi\left(y_{i} \mid \theta_{i}, \boldsymbol{\psi_1}\right)$$
At the _medium_ level eq. \@ref(eq:medium) there exists the **latent field** $\pi(\boldsymbol{\theta} \mid \boldsymbol\psi_2)$, conditioned to the hyper parameter , whose only component is the measurement error precision, $\boldsymbol\psi_2 = \left(\sigma_{\mathscr{\xi}}^{2}, \phi, \nu\right)$:
$$ \pi(\boldsymbol{\theta} \mid \boldsymbol{\psi_2})=(2 \pi)^{-n / 2}| \boldsymbol{Q_{\mathscr{C}}(\psi_2)}|^{1 / 2} \exp \left(-\frac{1}{2} \boldsymbol{\theta}^{\prime} \boldsymbol{Q_{\mathscr{C}}(\psi_2)} \boldsymbol{\theta}\right)$$
In the lower level priors are considered as their joint distribution i.e. $\boldsymbol\psi_ =\{ \boldsymbol\psi_1\, \boldsymbol\psi_2\}$. It follows the expression of the joint posterior distribution:
$$
\pi(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \mathbf{y})\propto \underbrace{\pi(\boldsymbol{\psi})}_{\text {priors}} \times \underbrace{\pi(\boldsymbol\theta \mid \boldsymbol\psi)}_{\text {GMRF}} \times \underbrace{\prod_{i=1}^{\mathbf{192}} \pi\left(\mathbf{y} \mid \boldsymbol\theta, \boldsymbol{\psi}\right)}_{\text {likelihood }}
$$
Then once again the objectives of bayesian inference are the posterior marginal distributions for the latent field and the hyperparameters, which are contained in equation \@ref(eq:finalobj), for which Laplace approximations in \@ref(approx) are employed.
<!-- paramtere estimation and spatial prediction --><!-- paramtere estimation and spatial prediction --><!-- paramtere estimation and spatial prediction --><!-- paramtere estimation and spatial prediction --><!-- paramtere estimation and spatial prediction --><!-- paramtere estimation and spatial prediction --><!-- paramtere estimation and spatial prediction --><!-- paramtere estimation and spatial prediction --><!-- paramtere estimation and spatial prediction --><!-- paramtere estimation and spatial prediction --><!-- paramtere estimation and spatial prediction --><!-- paramtere estimation and spatial prediction -->
## Building the SPDE object{#spdemodeol}
SPDE requires to triangulate the domain space $\mathscr{D}$ as intuited in \@ref(spdeapproach). The function `inla.mesh.2d` together with the `inla.nonconvex.hull` are able to define a good triangulation since no proper boundaries are provided. Four meshes are produced, whose figures are in \@ref(fig:meshes). Triangles appears to be thoroughly smooth and equilateral. The extra offset prevents the model to be affected by boundary effects. However the maximum accessible number of vertices is in $\operatorname{mesh}_3$ , it seems to have the most potential for this dataset. Critical parameters for meshes are `max.edge=c(0.01, 0.02)` and `offset=c(0.0475, 0.0625))` which in fact mold their sophistication. Further trials have shown that below the lower bound of max.edge of $0.01$ the function is not able to produce the triangulation.
```{r meshes, fig.cap="Left: mesh traingulation for 156 vertices, Right: mesh traingulation for 324 vertices"}
## Build boundary information:
## (fmesher supports SpatialPolygons, but this app is not (yet) intelligent enough for that.)
boundary.loc <- coordinate
bnd <- inla.nonconvex.hull(coordinate, convex = 1, concave = 10)
boundary <- list(
list(
inla.nonconvex.hull(coordinates(boundary.loc), 0.0475),
bnd
),
inla.nonconvex.hull(coordinates(boundary.loc), 0.0625)
)
mesh1 <- inla.mesh.2d(
boundary = boundary,
max.edge = c(0.025, 0.048),
min.angle = c(30, 21),
max.n = c(48000, 16000), ## Safeguard against large meshes.
max.n.strict = c(128000, 128000), ## Don't build a huge mesh!
cutoff = 0.01, ## Filter away adjacent points.
offset = c(0.0475, 0.0625)
) ## Offset for extra boundaries, if needed.
plt_mesh1 <- ggplot(st_as_sf(coordinate)) +
gg(mesh1) +
geom_sf(colour = "red", alpha = .4) +
ggtitle(paste("Vertices: ", mesh1$n), subtitle = TeX("$Mesh_1$")) +
coord_equal() +
coord_sf() +
theme_map()
mesh2 <- inla.mesh.2d(
boundary = boundary,
max.edge = c(0.017, 0.019), ## con max edge più ampio
min.angle = c(30, 21),
max.n = c(48000, 16000), ## Safeguard against large meshes.
max.n.strict = c(128000, 128000), ## Don't build a huge mesh!
cutoff = 0.01, ## Filter away adjacent points.
offset = c(0.0475, 0.0625)
) ## Offset for extra boundaries, if needed.
plt_mesh2 <- ggplot(st_as_sf(coordinate)) +
gg(mesh2) +
geom_sf(colour = "red", alpha = .4) +
ggtitle(paste("Vertices: ", mesh2$n), subtitle = TeX("$Mesh_2$")) +
coord_equal() +
coord_sf() +
theme_map()
mesh3 <- inla.mesh.2d(
boundary = boundary,
max.edge = c(0.01, 0.02),
min.angle = c(30, 21),
max.n.strict = c(128000, 128000), ## Don't build a huge mesh!
cutoff = 0.001, ## Filter away adjacent points.
offset = c(0.0475, 0.0625)
) ## Offset for extra boundaries, if needed.
plt_mesh3 <- ggplot(st_as_sf(coordinate)) +
gg(mesh3) +
geom_sf(colour = "red", alpha = .4) +
ggtitle(paste("Vertices: ", mesh3$n), subtitle = TeX("$Mesh_3$")) +
coord_equal() +
coord_sf() +
theme_map()
mesh4 <- inla.mesh.2d(
boundary = boundary,
max.edge = c(0.012, 0.014),
min.angle = c(30, 21),
max.n.strict = c(128000, 128000), ## Don't build a huge mesh!
cutoff = 0.001, ## Filter away adjacent points.
offset = c(0.0475, 0.0625)
) ## Offset for extra boundaries, if needed.
plt_mesh4 <- ggplot(st_as_sf(coordinate)) +
gg(mesh4) +
geom_sf(colour = "red", alpha = .4) +
ggtitle(paste("Vertices: ", mesh4$n), subtitle = TeX("$Mesh_4$")) +
coord_equal() +
coord_sf() +
theme_map()
plt_mesh1 + plt_mesh2 + plt_mesh3 + plt_mesh4
```
The SPDE model object is built with the function `inla.spde2.pcmatern()` which needs as arguments the mesh triangulation, i.e. $\operatorname{mesh}_3$ and the PC priors \@ref(priorsspec), satisfying the following probability statements: $\operatorname{Prob}(\sigma_{\varepsilon}^2<3060)< 0.01$ where $\sigma_{\varepsilon}^2$ is the spatial range desumed in \@ref(fig:semivariogram). And $\operatorname{Prob}(\tau> 0.9)< 0.05$ where $\tau$ is the marginal standard deviation (on the log scale) of the field \@ref(spatassess).
As model complexity increases, for instance, a lot of random effects may be discovered in the linear predictor and chances are that SPDE object may get into trouble. As a result the `inla.stack` function recreates the linear predictor as a combination of the elements of the old linear predictor and the matrix of observations. Further mathematical details about the stack are in @Blangiardo-Cameletti, but are beyond the scope of the analysis.
```{r spde_matern}
## QUESTO FUNZIONA
# spde1 = inla.spde2.pcmatern(mesh1, prior.range = c(3060, 0.01),
# prior.sigma = c(0.9, 0.05))
spde2 <- inla.spde2.pcmatern(mesh3,
prior.range = c(3060, 0.01),
prior.sigma = c(0.9, 0.05)
)
## default ones
spde3 <- inla.spde2.matern(mesh3)
# ### se volessi fare conla staaxck seguendo INLA dovrei allora fare
# ###
# A1 <- inla.spde.make.A(mesh = mesh3, loc = coordinate)
# stack1 <- inla.stack(
# tag = "estimation", ## tag
# data = list(price = final_data2$price), ## response
# A = list(A1, 1), ## projector matrices (SPDE and fixed effects)
# effects = list(
# list(site = seq_len(spde2$n.spde)), ## random field index
# final_data2 %>%
# as.data.frame() %>%
# transmute(Intercept = 1, totlocali,condom) ## fixed effect covariates
# )
# )
```
## Model Selection
Since covariates are many they arouse also many possible combinations of predictors that may better shape complex dynamics. However, there is the possibility to take advantage of several natively implemented ways to use built-in INLA features. As a matter of fact in section \@ref(devbased) are presented a set of tools to compare models based on deviance criteria. One of them is DIC which actually balances the number of features applying a penalty when the number of features introduced increases eq. \@ref(eq:dic).
Consequently a Forward Stepwise Selection [@guyon2003introduction] algorithm is designed within the inla function call. One at a time predictors are introduced into the model then results are stored into a dataframe. In the end the most satisfying combination based on DIC is sorted out.
```{r model_sel, eval=F}
## add varibale to the stack
A1 <- inla.spde.make.A(mesh = mesh3, loc = coordinate)
stack1 <- inla.stack(
tag = "estimation", ## tag
data = list(price = final_data2$price), ## response
A = list(A1, 1), ## projector matrices (SPDE and fixed effects)
effects = list(
list(site = seq_len(spde3$n.spde)), ## random field index
final_data2 %>%
as.data.frame() %>%
transmute(
Intercept = 1, condom, totlocali, altro, bagno, cucina, heating,
photosnum, floor, sqfeet, fibra_ottica, videocitofono, impianto_di_allarme,
reception, porta_blindata, esposizione_esterna, cancello_elettrico, portiere_intera_giornata,
balcone, arredato, impianto_tv_centralizzato, armadio_a_muro, portiere_mezza_giornata,
caminetto, esposizione_interna, mansarda, parzialmente_arredato, taverna, idromassaggio,
terrazza, esposizione_doppia, giardino_comune, giardino_privato
) ## fixed effect covariates
)
)
covs_df <- final_data %>%
dplyr::select(-lat, -long, -altro, -price) %>%
na.omit() %>%
as.data.frame()
for (i in 1:16) {
f1 <- as.formula(paste0("log(price) ~ -1 + Intercept + f(site, model = spde3) + ", paste0(colnames(covs_df)[1:i], collapse = " + ")))
model1 <- inla(f1,
family = "normal", data = inla.stack.data(stack1),
control.predictor = list(A = inla.stack.A(stack1)),
control.compute = list(waic = TRUE, dic = TRUE)
)
model_selection <- if (i == 1) {
rbind(c(model = paste(colnames(covs_df)[1:i]), waic = model1$waic$waic, dic = model1$dic$dic))
} else {
rbind(model_selection, c(model = paste(colnames(covs_df)[1:i], collapse = " + "), waic = model1$waic$waic, dic = model1$dic$dic))
}
}
# model_selection2 =data.frame(model_selection)
# tibble(predictors = 1:15,
# waic = model_selection2$waic,
# dic = model_selection2$dic) %>%
# gather(statistic, value, -predictors) %>%
# ggplot(aes(predictors, value)) +
# geom_line(show.legend = F) +
# geom_point(show.legend = F) +
# facet_wrap(~ statistic, scales = "free")
#
model_selection[13, 3]
## il candidato perfetto è il modello humero 13
## quello che contiene i seguenti predittori
## "condom + totlocali + bagno + cucina + heating + photosnum + floor + sqfeet + fibra_ottica + videocitofono + impianto_di_allarme + reception + porta_blindata"
##
```
```{r}
print(paste("condom + totlocali + bagno + cucina + heating + photosnum + floor + sqfeet + fibra_ottica + videocitofono + impianto_di_allarme + reception + porta_blindata", "WAIC = -7.532", "DIC =-9.001"))
```
## Parameter Estimation and Results{#fit}
Now the candidate statistical model is fitted rebuilding the stack and calling inla on the final formula i.e. number of predictors. When the model has run, the conclusions for the random and fixed effects are now discussed.
```{r modelfitting}
## rebuilt the stack based on stepwise forward
##
A1 <- inla.spde.make.A(mesh = mesh3, loc = coordinate)
stack_fin <- inla.stack(
tag = "estimation", ## tag
data = list(price = final_data2$price), ## response
A = list(A1, 1), ## projector matrices (SPDE and fixed effects)
effects = list(
list(site = seq_len(spde3$n.spde)), ## random field index
final_data2 %>%
as.data.frame() %>%
transmute(
Intercept = 1, condom, totlocali, altro, bagno, cucina, heating,
photosnum, floor, sqfeet, fibra_ottica, videocitofono, impianto_di_allarme,
reception, porta_blindata
) ## fixed effect covariates
)
)
model_final <- inla(log(price) ~ 0 + Intercept + condom + totlocali + altro + bagno + cucina + heating +
photosnum + floor + sqfeet + fibra_ottica + videocitofono + impianto_di_allarme +
reception + porta_blindata + f(site, model = spde3),
family = "normal", data = inla.stack.data(stack_fin),
control.predictor = list(A = inla.stack.A(stack_fin)),
control.compute = list(waic = TRUE, cpo = TRUE)
)
summ_fin <- model_final$summary.fixed %>%
tibble::rownames_to_column("coefficients") %>%
dplyr::select(-mode, -kld) %>%
mutate(across(where(is.numeric), round, 2)) %>%
arrange(desc(mean))
summ_fin %>%
head(10) %>%
kbl(caption = "Summary statistics for the top 10 coefficients arranged by descending mean", booktabs = T) %>%
kable_minimal()
```
Since the models coefficients are many i.e.`r dim(summ_fin)[1]` the analysis will concentrate on the most interesting ones. Looking at table \@ref(tab:modelfitting) where coefficients are arranged by descending mean, the posterior mean for the intercept, rescaled back from log, is truly quite lofty, affirming that the average house price is `r paste(round(exp(summ_fin[1,2]),2),"€", sep = " ")`.
Moreover being a "Trilocale" or "Bilocale", as pointed out in \@ref(mvp), brings a monthly extra profit respectively of `r paste(round(exp(summ_fin[2,2]),2),"€", sep = " ")` and `r paste(round(exp(summ_fin[4,2]),2),"€", sep = " ")`. However standard deviations are quite high.
Unsurprisingly the covariate condominium is positively correlated with the response price but its effect is not very tangible. This might imply that pricier apartments expect in mean the same condominium cost as the cheaper ones.
Moreover one feature strongly requested and paid is having a floor heating system, regardless of it is autonomous or centralized.
INLA is able to output the mean and the standard deviation of measurement precision error in $\psi_2$ i.e. $1 / \sigma_{\varepsilon}^{2}$, note that the interpretation is different from variance. In figure \@ref(fig:summhyper) are plotted the respective marginal hyper parameter distributions for $\boldsymbol\psi$.
<!-- With a combination of `inla.tmarginal()` and `inla.emarginal()`, refer to \@ref(rinla), it is possible to compute the mean and the sd for the error, see TAB (...) below. -->
```{r table_hyper}
summary_hyperpar <- tibble::rownames_to_column(model_final$summary.hyperpar, "hyper-param") %>%
as_tibble() %>%
dplyr::select(-mean, -"0.975quant", -"0.5quant", -"0.025quant")
summary_hyperpar %>%
kbl(caption = "Summary statistics for Hyper Parameters", booktabs = T) %>%
kable_minimal()
```
```{r summhyper, fig.cap="Marginal Hyper Parameter distributions for each element of Psi"}
plot_sigma2 <- post_plot(model_final, quantity = "marginals.hyperpar", coef = "Precision for the Gaussian observations") +
ylab(TeX("$\\pi(\\,\\sigma^2_{e} \\,|\\, \\mathbf{y}\\,)$")) +
xlab(TeX("$\\sigma^2_{e}$"))
plot_kappa <- post_plot(model_final, quantity = "marginals.hyperpar", coef = "Theta1 for site") +
ylab(TeX("$\\pi(\\,\\kappa \\,|\\, \\mathbf{y}\\,)$")) +
xlab(TeX("$\\kappa$"))
plot_nu <- post_plot(model_final, quantity = "marginals.hyperpar", coef = "Theta2 for site") +
ylab(TeX("$\\pi(\\,\\nu \\,|\\, \\mathbf{y}\\,)$")) +
xlab(TeX("$\\nu$"))
(plot_nu + plot_kappa) / plot_sigma2
```
<!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD --><!-- PLOT THE LATENT FIELD -->
## Plot GMRF
Finally, it might be of interest to look at the latent field into the spatial field. The investigation can inform about possible omitted variables, i.e. how much variance the predictors are not able to capture in the response variable. From plot \@ref(fig:pltgmrf), for which it is chosen a grid of $150 \times 150$ points, seems that the variance is correctly explained evidencing also three distinctive darker and relatively cheaper areas.
```{r pltgmrf, fig.cap="Gaussian Markov Random field of the final model projected onto the spatial field"}
A1.grid <- inla.mesh.projector(mesh3, dims = c(150, 150)) ## c(41, 41)
inla.mesh.project(A1.grid, model_final$summary.random$site) %>%
as.matrix() %>%
as.data.frame() %>%
bind_cols(
expand.grid(x = A1.grid$x, y = A1.grid$y)
) %>%
filter(!is.na(ID)) -> eta_spde
ggplot(final_data) +
# geom_sf(data = confini,fill = "antiquewhite1") +
geom_tile(data = eta_spde, aes(x = x, y = y, fill = mean)) +
geom_point(aes(x = long, y = lat, colour = price), size = 3, shape = 16) +
scale_colour_continuous(name = "Price (€)\n", type = "viridis", alpha = 1, limits = c(200, 5000)) +
labs(
x = "",
y = ""
) +
scale_fill_viridis_c(alpha = .8) +
theme_nicco()
```
## Spatial Model Criticism
The model on the basis of the PIT \@ref(predbase) does not seems to show consistent residual trend nor failures (meaning bizarre values in relation to the others) i.e. $fails = $ `r sum(model_final$cpo$failure)` outliers. The fact that the distribution is seemingly Uniform \@ref(fig:modelcpo) tells that the model is correctly specified. This is summarized by the LPML statistics which accounts for the negative log sum of each cross validate CPO, obtaining `r -sum(log(model_final$cpo$cpo)) %>% round(3) `.
```{r modelcpo, fig.cap="PIT cross validation statistics on $model_final$"}
ggplot(data = NULL, aes(x = model_final$cpo$pit)) +
geom_histogram(
binwidth = 0.1,
boundary = 2,
# col="red",
fill = "lightblue",
alpha = .5
) +
geom_density(col = 5) +
labs(
x = "PIT",
y = " "
) +
theme_nicco()
```
## Spatial Prediction on a Grid
A gridded object is required in order to project the posterior mean onto the domain space. Usually these operations are very computationally expensive since grid objects size can easily touch $10^4$ to $10^6$ points. For the purposes of spatial prediction, it is considered a grid of $10km \times 10km$ with $100 \times 100$ grid points. The intention is to chart the price smoothness given the prediction grid, alongside the corresponding uncertainty i.e. the standard error.
Higher prices in \@ref(fig:predgrid) are observed nearby the points of interest in \@ref(ggmap), however the north-ovest displays lower monthly prices. The variance is ostensibly considerable within the domain and unsurprisingly decreases where data is more dense.
```{r predgrid, fig.cap="Prediction on predictive posterior mean a 122X122 grid overlapped with the Mesh3"}
A1.grid <- inla.mesh.projector(mesh3, dims = c(100, 100)) # 41 x 41
expand.grid(X = A1.grid$x, Y = A1.grid$y) %>%
mutate(Intercept = 1) -> grid_data
stack1_grid <- inla.stack(
tag = "grid", ## tag
data = list(price = NA), ## response
A = list(A1.grid$proj$A, 1), ## projector matrices (SPDE and fixed effects)
effects = list(
list(site = seq_len(spde3$n.spde)), ## random field index
grid_data ## covariates at grid locations
)
)
stack_all <- inla.stack(stack_fin, stack1_grid)
model_grid <- inla(log(price) ~ 0 + Intercept + X + Y + f(site, model = spde3),
family = "normal", data = inla.stack.data(stack_all),
control.predictor = list(
A = inla.stack.A(stack_all),
link = 1
),
control.compute = list(waic = TRUE),
control.mode = list(
theta = model_final$mode$theta,
restart = FALSE
),
control.results = list(
return.marginals.random = FALSE,
return.marginals.predictor = FALSE
)
)
si <- inla.stack.index(stack_all, "grid")$data
grid_data %>%
bind_cols(model_grid$summary.fitted.values[si, ]) %>%
`coordinates<-`(~ X + Y) %>%
`proj4string<-`(CRS(SRS_string = "EPSG:3003")) -> gd
mean <- gd[!is.na(gd), ] %>%
as.data.frame() %>%
ggplot() +
geom_sf(data = confini) +
geom_tile(aes(x = X, y = Y, fill = mean)) +
scale_fill_viridis_c(alpha = .8) + # , limits=c(0,5000)
# scale_colour_brewer(palette = "Greens")+
coord_fixed() +
coord_sf() +
# geom_point(data = final_data, aes(x = long, y = lat), colour = "red", alpha = .3)+
labs(
x = "",
y = ""
) +
# gg(mesh3,alpha = .2, int.color = "red") +
theme_nicco() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
sd <- gd[!is.na(gd), ] %>%
as.data.frame() %>%
ggplot() +
geom_sf(data = confini) +
geom_tile(aes(x = X, y = Y, fill = sd)) +
scale_fill_viridis_c(alpha = .8) + # , limits=c(0,5000)
# scale_colour_brewer(palette = "Greens")+
coord_fixed() +
coord_sf() +
# geom_point(data = final_data, aes(x = long, y = lat), colour = "red", alpha = .3)+
labs(
x = "",
y = ""
) +
# gg(mesh3,alpha = .2, int.color = "red") +
theme_nicco() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
mean + sd
#
# boundary = bnd$loc %>%
# as_tibble() %>%
# rename("lat" = "V1","long" = "V2") %>%
# as.data.frame()
#
# spTransform(boundary, CRS(SRS_string = "EPSG:3003")) -> boundary
# SpatialPointsDataFrame(boundary, data, coords.nrs = numeric(0),
# proj4string = CRS(as.character(NA)),bbox = NULL)
#
# boundary = SpatialPoints(boundary, proj4string=CRS(SRS_string = "EPSG:3003"), bbox = NULL)
# plot(boundary)
# coordinate
```