-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFinal_Project.Rmd
695 lines (547 loc) · 39.3 KB
/
Final_Project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
---
title: 'Comparison of Local Regression in different R packages'
author: "Issam Arabi, Eric Maeder"
date: "2022-12-20"
output: html_document
bibliography: ./references.bib
link-citations: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(dplyr)
library(locpol)
library(fANCOVA)
library(KernSmooth)
library(wavethresh)
library(ggplot2)
library(patchwork)
```
## Introduction
When applying statistical methods in R using built-in functions, it is common to be faced with multiple possible functions and packages aiming to tackle the same issues. In this scenario, and especially when dealing with large amounts of data, it is crucial yet difficult to choose the better-optimized one for fulfilling one's goals. Optimization here denotes performance in terms of results precision and computation time.
In this context, we aim to explore the different R packages aiming to implement local polynomial interpolation and their performance by applying the packages and timing their execution on varied data sets of different sizes and structures to thoroughly assess their versatility.
## Plan
In this report, we will be following the method of analysis from the work of @deng2011density on density estimation. Effectively, we will start with a small introduction to local polynomial regression, followed by a description of the different R packages we will examine. Next, we will add the computation time of the different functions when applied to data sets of increasing sizes. Then, we will measure the performance of the considered functions when dealing with data sets with more challenging structures. Afterward, we will discuss the results, and lastly, we will summarize our results and conclude.
## Local Polynomial Regression
Local Polynomial regression, most widely known through the Locally Estimated Scatter-plot Smoothing (LOESS) method, is a powerful prediction tool when observing bounded data sets with non-trivial distributions. It is a non-parametric method, which allows it to be versatile and easy to use.
When observing response values $Y_1, Y_2, ...$ at points $X_1, X_2, ...$, local polynomial regression aims to assign at each point X the value corresponding to $\underset{\beta \in \mathbb{R}^{p+1}}{argmin} \sum_{i=1}^n [Y_i - \beta_0 - \beta_1(X_i - x) - ... - \beta_p (X_i - x)^p]^2 K(\frac{X_i - x}{h})$, where p is the order of the fitted polynomial, K is a kernel function (basically a weighting function), and h is a bandwidth parameter allowing to choose how strong the influence of distant observations on the regression should be.
As we can see, this is a least squares problem, and although the solution is long and unexciting to produce, it comes down to computing a gradient and solving a relatively simple system of equations.
Then, the resulting polynomials are combined and smoothed to obtain a curve spanning the whole considered region. In practice, polynomial degrees p < 5 are mostly used due to the increasing computational challenge, which is the approach we will adopt in this report.
## Local Polynomial Regression Packages
In this section, we give a quick overview of the considered functions and their parameters.
+ **stats::loess**, @stats, from its documentation, fits a local polynomial surface determined by one or more numerical predictors, using local fitting. its typical call is as follows:
loess(formula, data, weights, subset, na.action, model = FALSE,
span = 0.75, enp.target, degree = 2,
parametric = FALSE, drop.square = FALSE, normalize = TRUE,
family = c("gaussian", "symmetric"),
method = c("loess", "model.frame"),
control = loess.control(...), ...)
But in our case, we will be working on fully-specified data sets with no missing data and no prior information about value importance, so we will only use the following:
loess(formula, (data),
span = 0.75, degree = 2,
family = "gaussian"
method = "loess")
where *formula* is where we set the response and explanatory variables, *span* is used to control the bandwidth, *degree* is self-explanatory and *family* specifies that we want a least-squares approach.
*Span* works as follows: when given a value smaller than 1, each local regression is fitted using the proportion of observations equal to the span value into account, with tri-cubic weighting $(1 - (\frac{\text{distance to point}}{\text{maximal distance to considered points}})^3)^3$.
This approach prevents cases of badly spread data where no observation is in the considered bandwidth when fitting a local regression, making it unstable, at the cost of computing the distance to each data point. Moreover, this makes it difficult to compare to other functions, as we cannot give them the exact same inputs.
+ **locpol::locpol**, @locpol, follows the steps of the local polynomial regression described in the previous section. Its call is as follows:
locpol(formula,data,weig=rep(1,nrow(data)),bw=NULL,kernel=EpaK,deg=1,
xeval=NULL,xevalLen=100)
again, we will not use all parameters for comparison in this report, although they offer room for versatility. Therefore, we will mostly use:
locpol(formula,(data),bw=NULL,kernel=EpaK,deg=1)
where *formula*, *data*, and *deg* are identical to the corresponding parameters in the loess function. The *bw* parameter is the bandwidth, and *kernel* is the kernel function set to Epanechnikov by default, but can be gaussian, uniform, etc.
This follows closely the theoretical approach we chose.
+ **KernSmooth::locpoly**, @kernsmooth, aims to fit a local polynomial with kernel weighting to estimate a regression function. It can also be used for density estimation and its derivatives. We call it as follows:
locpoly(x, y, drv = 0L, degree, kernel = "normal",
bandwidth, gridsize = 401L, bwdisc = 25,
range.x, binned = FALSE, truncate = TRUE)
Here, *x* and *y* are the explanatory and response variables, *drv* concerns derivative estimations, *kernel* and *bandwidth* are self-explanatory, *gridsize* represents the number of equidistant points over which we want to estimate the fct, and the last parameters are used to speed up computations when the data is pre-processed. This approach is pretty similar to locpol and consequently to our theoretical version. The similarity in input parameters makes it easy to compare both of them.
+ **fANCOVA::loess.as**, @fancova, performs local polynomial regression with automatic bandwidth selection using either cross-validation or AIC as the criterion. It also gives the option to select it manually, which we will use for comparison. This is the usage:
loess.as(x, y, degree = 1, criterion = c("aicc", "gcv"),
family = c("gaussian", "symmetric"), user.span = NULL,
plot = FALSE, ...)
where *x*, *y*, *degree*, and *criterion* are self-explanatory. Here we will use the Gaussian family corresponding to the Least Squares approach, and *user.span* is a manual smoothing parameter selection (bandwidth value e.g.). *plot* is an option to generate a visualization of the results.
#### Other more specific local regressions
+ **caret::knnreg**, @caret, and other k-nearest neighbors approaches, are a substitute to the bandwidth approach, making sure that isolates and outliers on the x-axis do not have too much impact on the results by always taking at least k observations into account.
+ **spgwr::gwr**, @spgwr, performs a geographically weighted regression; a special type of local regression applicable to spatially ordered data, taking into account the distance of each observation when fitting a local estimate.
## Comparison of Computation times
```{r, echo = FALSE}
X4 <- seq(from = -5, to = 5, by= 0.01)
d4 <- dnorm(X4)
df4 <-data.frame(X4,d4)
lo <- 0
lp <- 0
loas <- 0
loly <- 0
for (i in 1:1000){
t <- Sys.time()
loess(d4 ~ X4, data = df4, family = "gaussian", method = "loess", span = 0.075)
lo <- lo + Sys.time() - t
}
for (i in 1:1000){
t <- Sys.time()
locpol(d4 ~ X4, data = df4, bw = 0.025, kernel = gaussK, deg = 2)
lp <- lp + Sys.time() - t
}
for (i in 1:1000){
t <- Sys.time()
loess.as(df4$X4, df4$d4, degree = 2, criterion = "aicc", family = "gaussian")
loas <- loas + Sys.time() - t
}
for (i in 1:1000){
t <- Sys.time()
locpoly(df4$X4, df4$d4, degree = 2, kernel = "normal", bandwidth = 0.025, gridsize = 1001)
loly <-loly + Sys.time() - t
}
```
To compare computation times for the functions, we run each of them for 1000 times and compare the average runtime. Below are the results.
```{r, echo = FALSE}
freq_data <- tibble(loess=lo/1000, locpol=lp/1000, loess.as=loas/1000, locpoly=loly/1000)
print(freq_data)
```
From this data, we can see that loess.as takes much more time than the rest of the functions, followed by locpol, then loess. The most computationally efficient function was locpoly. This is due to the loess.as function calculating its optimal bandwidth internally, requiring the application of a selection algorithm, such as cross validation, or more generally a K-folds approach, for example. This does require extra time, but has the advantage of combining choosing an efficient span parameter and applying it at once. A more valid comparison would be to time a K-folds approach on the datasets to choose either a bandwidth or a span parameter, and add its execution time to the other functions.
## Comparison of approximation residuals
Now, we delve into the comparison of the fit produced by the different functions, applied to the same datasets. To do so, we use two families of different data sets: one stemming from the Gaussian distribution added to a sinusoidal factor, and one originating from the claw distribution noised @wavethresh. For both families, we look at samples of size $10^2,10^3,10^4$ and $10^5$. On each of those, we apply the four considered functions and we calculate the $R^2$ value, representing the goodness of fit (coefficient of determination). This allows for comparison between different functions on the same data sets and for the same function on different sizes of data sets.
Worth keeping in mind is that two of the functions take a span parameter into account, denoting a percentage of the data points to consider, and the two others a bandwidth parameter. This implies the first two will not be meaningfully comparable to the latter, as we do not run any algorithm, such as cross-validation, to choose a bandwidth parameter. For the span parameter, since such an algorithm is already implemented in the loess.as function, we run the loess function with the resulting parameter to have comparatively similar performances.
Since locpol function is inapplicable to datasets too large, we set the resulting R2 to 0 on the last plot.
We output the different fit functions, tables of the $R^2$ coefficients, and lastly a plot of the coefficients of determination on each dataset size.
Let us start with the smaller datasets:
``` {r c2, echo = FALSE}
X3 <- seq(from = -5, to = 5, by= 0.1)
X4 <- seq(from = -5, to = 5, by= 0.01)
X5 <- seq(from = -5, to = 5, by= 0.001)
X6 <- seq(from = -5, to = 5, by= 0.0001)
X7 <- seq(from = -5, to = 5, by= 0.00001) # X-axis evaluation points
Y3 <- dclaw(X3) # claw distribution
n3 <- rnorm(X3, sd = 0.1) # gaussian noise
Y3 <- Y3 + n3
# applying the functions:
daf3 <- data.frame(X3,Y3)
Lo3c <- loess(Y3 ~ X3, data = daf3, family = "gaussian", method = "loess", span = 0.587) # Lo3c$residuals
Lpc3 <- locpol(Y3 ~ X3, data = daf3, bw = 0.25, kernel = gaussK, deg = 2) # Lpc3$residuals
Loasc3 <- loess.as(daf3$X3, daf3$Y3, degree = 2, criterion = "aicc", family = "gaussian") # Loasc3$residuals
LoLyc3 <- locpoly(daf3$X3, daf3$Y3, degree = 2, kernel = "normal", bandwidth = 0.25, gridsize = 101) # only gives results
# R^2 coefficients
#Loess
SSres <- sum((Y3 - Lo3c$fitted)^2)
SStot <- sum((Y3 - mean(Y3))^2)
R2Lo3 <- 1 - SSres/SStot
#Locpol
SSres <- sum((Y3 - Lpc3$mf[1])^2)
SStot <- sum((Y3 - mean(Y3))^2)
R2Lp3 <- 1 - SSres/SStot
#Loess.as
SSres <- sum((Y3 - Loasc3$fitted)^2)
SStot <- sum((Y3 - mean(Y3))^2)
R2Loas3 <- 1 - SSres/SStot
#locpoly
SSres <- sum((Y3 - LoLyc3$y)^2)
SStot <- sum((Y3 - mean(Y3))^2)
R2Loly3 <- 1 - SSres/SStot
#plot
fitted_loc <- Lo3c$fitted
fitted_lp <- Lpc3$mf[1]
fitted_loas <- Loasc3$fitted
fitted_Loly <- LoLyc3$y
pc2 <- ggplot() + geom_point(mapping = aes(x = X3, y = fitted_loc, color = "loess")) +
geom_point(mapping = aes(x = X3, y = unlist(fitted_lp), color = "locpol")) +
geom_point(mapping = aes(x = X3, y = fitted_loas, color = "loess.as")) +
geom_point(mapping = aes(x = X3, y = fitted_Loly, color = "locpoly")) +
scale_color_manual(name = "Regression Function",
values = c( "loess" = "blue", "locpol" = "red", "loess.as" = "green", "locpoly" = "purple"),
labels = c("loess", "locpol", "loess.as","locpoly")) +
labs(y = "fit") +
ggtitle("Claw distribution, 10^2 data points")
pc2
```
For now, we cannot see the claws emerging from the distribution, but that is only due to the inaccuracy induced by the small sample size, as the locpol function produces an R^2 of 1, meaning it approximates the function perfectly. The others underfit a bit, leading to smaller values and worse approximation. However, that might not be a bad result, as we are a priori with little data and therefore not sure whether we should fit the data so precisely as we fear overfitting to the data.
Let us look at what changes when we consider a superior order of magnitude:
```{r c3, echo = FALSE}
## 1001
#Distribution
Y4 <- dclaw(X4)
n4 <- rnorm(X4, sd = 0.1)
Y4 <- Y4 + n4
# Application
daf4 <- data.frame(X4,Y4)
Lo4c <- loess(Y4 ~ X4, data = daf4, family = "gaussian", method = "loess", span = 0.211) # Lo3c$residuals
Lpc4 <- locpol(Y4 ~ X4, data = daf4, bw = 0.25, kernel = gaussK, deg = 2) # Lpc3$residuals
Loasc4 <- loess.as(daf4$X4, daf4$Y4, degree = 2, criterion = "aicc", family = "gaussian") # Loasc3$residuals
LoLyc4 <- locpoly(daf4$X4, daf4$Y4, degree = 2, kernel = "normal", bandwidth = 0.25, gridsize = 1001) # only gives results
# R^2 coefficients
#Loess
SSres <- sum((Y4 - Lo4c$fitted)^2)
SStot <- sum((Y4 - mean(Y4))^2)
R2Lo4 <- 1 - SSres/SStot
#Locpol
SSres <- sum((Y4 - Lpc4$mf[1])^2)
SStot <- sum((Y4 - mean(Y4))^2)
R2Lp4 <- 1 - SSres/SStot
#Loess.as
SSres <- sum((Y4 - Loasc4$fitted)^2)
SStot <- sum((Y4 - mean(Y4))^2)
R2Loas4 <- 1 - SSres/SStot
#locpoly
SSres <- sum((Y4 - LoLyc4$y)^2)
SStot <- sum((Y4 - mean(Y4))^2)
R2Loly4 <- 1 - SSres/SStot
#plot
fitted_loc <- Lo4c$fitted
fitted_lp <- Lpc4$mf[1]
fitted_loas <- Loasc4$fitted
fitted_Loly <- LoLyc4$y
pc3 <- ggplot() + geom_point(mapping = aes(x = X4, y = fitted_loc, color = "loess")) +
geom_point(mapping = aes(x = X4, y = unlist(fitted_lp), color = "locpol")) +
geom_point(mapping = aes(x = X4, y = fitted_loas, color = "loess.as")) +
geom_point(mapping = aes(x = X4, y = fitted_Loly, color = "locpoly")) +
scale_color_manual(name = "Regression Function",
values = c( "loess" = "blue", "locpol" = "red", "loess.as" = "green", "locpoly" = "purple"),
labels = c("loess", "locpol", "loess.as","locpoly")) +
labs(y = "fit") +
ggtitle("Claw distribution, 10^3 data points") # loess is under loess.as
pc3
```
Again, the locpol function interpolates the data perfectly, and we can indeed see the claws appearing on the plot. The others are still underfitting by quite a lot this time, leading to $R^2$ coefficients around 0.65 for all three others. Interestingly, the locpoly and the locpol functions performance differs by a lot even though we input the same parameters.
Furthermore, the two functions with an algorithmically-computed span from the loess.as function underfit quite badly. This hints that this specific method is not sensitive enough to local variations, maybe because of the use of a too-slow-decreasing kernel when taking the span parameter into account.
The great performance of the locpol function however ends there, as we are prevented from using it on a bigger dataset by a *maxevalpts* parameter, limiting the maximum size we can input. Let us look at the two last density plots together, as they contain similar results.
```{r c45, echo = FALSE}
#distribution
Y5 <- dclaw(X5)
n5 <- rnorm(X5, sd = 0.1)
Y5 <- Y5 + n5
# Application
daf5 <- data.frame(X5,Y5)
Lo5c <- loess(Y5 ~ X5, data = daf5, family = "gaussian", method = "loess", span = 0.05) # Lo3c$residuals
#Lpc5 <- locpol(Y5 ~ X5, data = daf3, bw = 0.25, kernel = gaussK, deg = 2) # Lpc3$residuals
Loasc5 <- loess.as(daf5$X5, daf5$Y5, degree = 2, criterion = "aicc", family = "gaussian") # Loasc3$residuals
LoLyc5 <- locpoly(daf5$X5, daf5$Y5, degree = 2, kernel = "normal", bandwidth = 0.25, gridsize = 10001) # only gives results
# R^2 Coefficients
#Loess
SSres <- sum((Y5 - Lo5c$fitted)^2)
SStot <- sum((Y5 - mean(Y5))^2)
R2Lo5 <- 1 - SSres/SStot
#Loess.as
SSres <- sum((Y5 - Loasc5$fitted)^2)
SStot <- sum((Y5 - mean(Y5))^2)
R2Loas5 <- 1 - SSres/SStot
#locpoly
SSres <- sum((Y5 - LoLyc5$y)^2)
SStot <- sum((Y5 - mean(Y5))^2)
R2Loly5 <- 1 - SSres/SStot
#plot
fitted_loc <- Lo5c$fitted
fitted_loas <- Loasc5$fitted
fitted_Loly <- LoLyc5$y
pc4 <- ggplot() + geom_point(mapping = aes(x = X5, y = fitted_loc, color = "loess")) +
geom_point(mapping = aes(x = X5, y = fitted_loas, color = "loess.as")) +
geom_point(mapping = aes(x = X5, y = fitted_Loly, color = "locpoly")) +
scale_color_manual(name = "Regression Function",
values = c( "loess" = "blue", "loess.as" = "green", "locpoly" = "purple"),
labels = c("loess", "loess.as","locpoly")) +
labs(y = "fit") +
ggtitle("Claw distribution, 10^4 data points") # loess is under loess.as
# X6
# Distribution
Y6 <- dclaw(X6)
n6 <- rnorm(X6, sd = 0.1)
Y6 <- Y6 + n6
# Application
daf6 <- data.frame(X6,Y6)
Lo6c <- loess(Y6 ~ X6, data = daf6, family = "gaussian", method = "loess", span = 0.05) # Lo3c$residuals with span obtained from loess.as
Loasc6 <- loess.as(daf6$X6, daf6$Y6, degree = 2, criterion = "aicc", family = "gaussian") # Loasc3$residuals
LoLyc6 <- locpoly(daf6$X6, daf6$Y6, degree = 2, kernel = "normal", bandwidth = 0.25, gridsize = 100001) # only gives results
# R^2 Coefficients
# Loess
SSres <- sum((Y6 - Lo6c$fitted)^2)
SStot <- sum((Y6 - mean(Y5))^2)
R2Lo6 <- 1 - SSres/SStot
#loess.as
SSres <- sum((Y6 - Loasc6$fitted)^2)
SStot <- sum((Y6 - mean(Y6))^2)
R2Loas6 <- 1 - SSres/SStot
#locpoly
SSres <- sum((Y6 - LoLyc6$y)^2)
SStot <- sum((Y6 - mean(Y6))^2)
R2Loly6 <- 1 - SSres/SStot
#plot
fitted_loc2 <- Lo6c$fitted
#fitted_lp <- Lpc5$mf[1]
fitted_loas2 <- Loasc6$fitted
fitted_Loly2 <- LoLyc6$y
pc5 <- ggplot() + geom_point(mapping = aes(x = X6, y = fitted_loc2, color = "loess")) +
geom_point(mapping = aes(x = X6, y = fitted_loas2, color = "loess.as")) +
geom_point(mapping = aes(x = X6, y = fitted_Loly2, color = "locpoly")) +
scale_color_manual(name = "Regression Function",
values = c( "loess" = "blue", "loess.as" = "green", "locpoly" = "purple"),
labels = c("loess", "loess.as","locpoly")) +
labs(y = "fit") +
ggtitle("Claw distribution, 10^5 data points") # loess is under loess.as
# at this point basically, loess does the same as loess.as, loess.as is great, locpoly underfits a lot (too big bw most likely)
pc4
pc5
```
On the above two plots, we see loess and loess.as have shaped the target claw function quite nicely, which is appropriate, as we noised the data using a mean zero Gaussian function. Toward comprehension, let us have a look at the given dataset for the last computations (size $10^5$):
```{r claw, echo = FALSE}
plot(X6,Y6,main = "noised claw distribution on 10^5 points")
```
Now, let us look at the $R^2$ coefficients across the dataset sizes under the form of a table and a plot:
```{r r2c, echo = FALSE}
tab <- matrix(c(R2Lo3, R2Lp3, R2Loas3, R2Loly3, R2Lo4, R2Lp4, R2Loas4, R2Loly4,R2Lo5, "-", R2Loas5, R2Loly5, R2Lo6, "-", R2Loas6, R2Loly6), ncol=4, byrow=TRUE)
colnames(tab) <- c('Loess','locpol','loess.as',"locpoly")
rownames(tab) <- c('10^2','10^3','10^4',"10^5")
tab <- as.table(tab)
# plotting r^2 coefficients
locr2 <- c(R2Lo3,R2Lo4,R2Lo5,R2Lo6)
lpr2 <- c(R2Lp3,R2Lp4,0,0)
loasr2 <- c(R2Loas3,R2Loas4,R2Loas5,R2Loas6)
locyr2 <- c(R2Loly3,R2Loly4,R2Loly5,R2Loly6)
X <- c(10^2,10^3,10^4,10^5)
pr2c <- ggplot() + geom_point(mapping = aes(x = log10(X), y = locr2, color = "loess")) +
geom_point(mapping = aes(x = log10(X), y = lpr2, color = "locpol")) +
geom_point(mapping = aes(x = log10(X), y = loasr2, color = "loess.as")) +
geom_point(mapping = aes(x = log10(X), y = locyr2, color = "locpoly")) +
scale_color_manual(name = "Regression Function",
values = c( "loess" = "blue", "locpol" = "red", "loess.as" = "green", "locpoly" = "purple"),
labels = c("loess", "locpol", "loess.as","locpoly")) +
labs(y = "R2") +
ggtitle("R2 coefficients applied to claw distribution, for different functions, and data size")
tab
pr2c
```
First looking at the table, we see loess and loess.as work similarly given the same parameters, meaning we can exchange one for the other, results-wise, depending on whether we know the span we want to use or not. Of course, computing this parameter takes time. Cross-validation has an $O(n^2)$ complexity, reduced to $O(Kn)$ when considering K-folds cross-validation, where n is the sample size. For the two bandwidth functions, locpol seems whenever applicable, a better choice compared to the locpoly, as it yields better results altogether. However, caution is needed when interpreting these results, as although a coefficient of determination equal to unity means the data is perfectly interpolated, it does not mean that the algorithm performs well when dealing with noised data, as that would indicate that the locpol function is over-fitting the data points. In this case, however, this is preferable, as it is the only function that shaped the existing claws when applied on datasets of small size.
Looking at the plot now, we see the loess and loess.as actually increase their coefficients of determination as the sample size grows. Indeed, it seems that it requires a lot of data to "convince" the loess.as algorithm to reduce its span parameter enough so that it does not underfit anymore. They obtain here better results than the locpoly function, but this entirely depends on the bandwidth parameter, that we would normally tune fine by applying an algorithm to choose it. As an example, we see the tendency to underfit disappears as expected when reducing the bandwidth, here by dividing it by 10:
```{r locpoly, echo = FALSE}
LoLyc5 <- locpoly(daf5$X5, daf5$Y5, degree = 2, kernel = "normal", bandwidth = 0.025, gridsize = 10001) # only gives results
fitted_loc <- Lo5c$fitted
fitted_loas <- Loasc5$fitted
fitted_Loly <- LoLyc5$y
pc4
```
Now moving to a different family of data-sets, we generate this time a Gaussian random variable to which we add small sinusoidal components, aiming to generate softer waves than the claw function. Here, we expect the functions to perform better, as the local variations should be smoother. We also noise them identically to the claw data-set. This time, we use a smaller bandwidth for bigger data-sets for the locpoly function, as the previous one was clearly too big for the larger data-sets, as a higher x-axis density leads to more local samples and thus this reduction is relevant.
With this in mind, we now once again look at the different results, starting with order 100, and moving to 10^5.
First, we look at the original curve we are trying to interpolate on $10^4$ points, without the noise:
```{r target, echo = FALSE}
Y5 <- dnorm(X5) # normal distribution
s5 <- 1/10*sin(4*X5) # sinusoidal component
Y5 <- Y5 + s5
plot(X5,Y5, main = "Gaussian distribution with sinusoidal component")
```
As we can see, the target is still wiggly, but the second derivative varies slower than in the previous example. Let us see how the different functions perform;
```{r s2, echo = FALSE}
#101:
Y3 <- dnorm(X3) # normal distribution
s3 <- 1/10*sin(4*X3) # sinusoidal noise
n3 <- rnorm(X3, sd = 0.05) # gaussian noise
Y3 <- Y3 + s3 + n3
# applying the functions:
daf3 <- data.frame(X3,Y3)
Lo3c <- loess(Y3 ~ X3, data = daf3, family = "gaussian", method = "loess", span = 0.587) # Lo3c$residuals
Lpc3 <- locpol(Y3 ~ X3, data = daf3, bw = 0.25, kernel = gaussK, deg = 2) # Lpc3$residuals
Loasc3 <- loess.as(daf3$X3, daf3$Y3, degree = 2, criterion = "aicc", family = "gaussian") # Loasc3$residuals
LoLyc3 <- locpoly(daf3$X3, daf3$Y3, degree = 2, kernel = "normal", bandwidth = 0.25, gridsize = 101) # only gives results
# R^2 coefficients
#Loess
SSres <- sum((Y3 - Lo3c$fitted)^2)
SStot <- sum((Y3 - mean(Y3))^2)
R2Lo3 <- 1 - SSres/SStot
#Locpol
SSres <- sum((Y3 - Lpc3$mf[1])^2)
SStot <- sum((Y3 - mean(Y3))^2)
R2Lp3 <- 1 - SSres/SStot
#Loess.as
SSres <- sum((Y3 - Loasc3$fitted)^2)
SStot <- sum((Y3 - mean(Y3))^2)
R2Loas3 <- 1 - SSres/SStot
#locpoly
SSres <- sum((Y3 - LoLyc3$y)^2)
SStot <- sum((Y3 - mean(Y3))^2)
R2Loly3 <- 1 - SSres/SStot
#plot
fitted_loc <- Lo3c$fitted
fitted_lp <- Lpc3$mf[1]
fitted_loas <- Loasc3$fitted
fitted_Loly <- LoLyc3$y
pc2 <- ggplot() + geom_point(mapping = aes(x = X3, y = fitted_loc, color = "loess")) +
geom_point(mapping = aes(x = X3, y = unlist(fitted_lp), color = "locpol")) +
geom_point(mapping = aes(x = X3, y = fitted_loas, color = "loess.as")) +
geom_point(mapping = aes(x = X3, y = fitted_Loly, color = "locpoly")) +
scale_color_manual(name = "Regression Function",
values = c( "loess" = "blue", "locpol" = "red", "loess.as" = "green", "locpoly" = "purple"),
labels = c("loess", "locpol", "loess.as","locpoly")) +
labs(y = "fit") +
ggtitle("Gaussian with sinusoidal noise function, 10^2 data points")
pc2
```
On the first graph, we can see that the span approaches under-fit the curve quite strongly, the locpoly function gives a good idea of the underlying curve, and the locpol function, while a bit too wiggly, also returns quality results. This is similar to what happened in the claw distribution case.
Let us observe the evolution of the regressions, as the number of data points increases:
```{r ps3, echo = FALSE}
## 1001
#Distribution
Y4 <- dnorm(X4) # normal distribution
s4 <- 1/10*sin(4*X4) # sinusoidal component
n4 <- rnorm(X4, sd = 0.05) # gaussian noise
Y4 <- Y4 + s4 + n4
# Application
daf4 <- data.frame(X4,Y4)
Lo4c <- loess(Y4 ~ X4, data = daf4, family = "gaussian", method = "loess", span = 0.211) # Lo3c$residuals
Lpc4 <- locpol(Y4 ~ X4, data = daf4, bw = 0.25, kernel = gaussK, deg = 2) # Lpc3$residuals
Loasc4 <- loess.as(daf4$X4, daf4$Y4, degree = 2, criterion = "aicc", family = "gaussian") # Loasc3$residuals
LoLyc4 <- locpoly(daf4$X4, daf4$Y4, degree = 2, kernel = "normal", bandwidth = 0.25, gridsize = 1001) # only gives results
# R^2 coefficients
#Loess
SSres <- sum((Y4 - Lo4c$fitted)^2)
SStot <- sum((Y4 - mean(Y4))^2)
R2Lo4 <- 1 - SSres/SStot
#Locpol
SSres <- sum((Y4 - Lpc4$mf[1])^2)
SStot <- sum((Y4 - mean(Y4))^2)
R2Lp4 <- 1 - SSres/SStot
#Loess.as
SSres <- sum((Y4 - Loasc4$fitted)^2)
SStot <- sum((Y4 - mean(Y4))^2)
R2Loas4 <- 1 - SSres/SStot
#locpoly
SSres <- sum((Y4 - LoLyc4$y)^2)
SStot <- sum((Y4 - mean(Y4))^2)
R2Loly4 <- 1 - SSres/SStot
#plot
fitted_loc <- Lo4c$fitted
fitted_lp <- Lpc4$mf[1]
fitted_loas <- Loasc4$fitted
fitted_Loly <- LoLyc4$y
pc3 <- ggplot() + geom_point(mapping = aes(x = X4, y = fitted_loc, color = "loess")) +
geom_point(mapping = aes(x = X4, y = unlist(fitted_lp), color = "locpol")) +
geom_point(mapping = aes(x = X4, y = fitted_loas, color = "loess.as")) +
geom_point(mapping = aes(x = X4, y = fitted_Loly, color = "locpoly")) +
scale_color_manual(name = "Regression Function",
values = c( "loess" = "blue", "locpol" = "red", "loess.as" = "green", "locpoly" = "purple"),
labels = c("loess", "locpol", "loess.as","locpoly")) +
labs(y = "fit") +
ggtitle("Gaussian with sinusoidal noise function, 10^3 data points") # loess is under loess.as
pc3
```
Here, we are still fitting the locpol and the locpoly functions with the same input parameters, but, once again, we notice a difference. It appears that the locpol function tends to over-fit quite badly in this case, which returns a noisy and rough resulting underlying curve, whereas the locpoly function and the two span approaches give a better result. Noticeable is the small under-fit of the loess function, possibly due to the approximation of the span parameter we input to it, instead of the exact number (order of $10^{-3}$).
```{r ps4, echo = FALSE}
# 10001
#distribution
Y5 <- dnorm(X5) # normal distribution
s5 <- 1/10*sin(4*X5) # sinusoidal component
n5 <- rnorm(X5, sd = 0.05) # gaussian noise
Y5 <- Y5 + s5 + n5
# Application
daf5 <- data.frame(X5,Y5)
Lo5c <- loess(Y5 ~ X5, data = daf5, family = "gaussian", method = "loess", span = 0.05) # Lo3c$residuals
#Lpc5 <- locpol(Y5 ~ X5, data = daf3, bw = 0.25, kernel = gaussK, deg = 2) # Lpc3$residuals
Loasc5 <- loess.as(daf5$X5, daf5$Y5, degree = 2, criterion = "aicc", family = "gaussian") # Loasc3$residuals
LoLyc5 <- locpoly(daf5$X5, daf5$Y5, degree = 2, kernel = "normal", bandwidth = 0.025, gridsize = 10001) # only gives results
# R^2 Coefficients
#Loess
SSres <- sum((Y5 - Lo5c$fitted)^2)
SStot <- sum((Y5 - mean(Y5))^2)
R2Lo5 <- 1 - SSres/SStot
#Loess.as
SSres <- sum((Y5 - Loasc5$fitted)^2)
SStot <- sum((Y5 - mean(Y5))^2)
R2Loas5 <- 1 - SSres/SStot
#locpoly
SSres <- sum((Y5 - LoLyc5$y)^2)
SStot <- sum((Y5 - mean(Y5))^2)
R2Loly5 <- 1 - SSres/SStot
#plot
fitted_loc <- Lo5c$fitted
fitted_loas <- Loasc5$fitted
fitted_Loly <- LoLyc5$y
pc4 <- ggplot() + geom_point(mapping = aes(x = X5, y = fitted_loc, color = "loess")) +
geom_point(mapping = aes(x = X5, y = fitted_loas, color = "loess.as")) +
geom_point(mapping = aes(x = X5, y = fitted_Loly, color = "locpoly")) +
scale_color_manual(name = "Regression Function",
values = c( "loess" = "blue", "loess.as" = "green", "locpoly" = "purple"),
labels = c("loess", "loess.as","locpoly")) +
labs(y = "fit") +
ggtitle("Gaussian with sinusoidal noise function, 10^4 data points") # loess is under loess.as
```
```{r ps5, echo = FALSE}
# X6
# Distribution
Y6 <- dnorm(X6) # normal distribution
s6 <- 1/10*sin(4*X6) # sinusoidal component
n6 <- rnorm(X6, sd = 0.05) # gaussian noise
Y6 <- Y6 + s6 + n6
# Application
daf6 <- data.frame(X6,Y6)
Lo6c <- loess(Y6 ~ X6, data = daf6, family = "gaussian", method = "loess", span = 0.05) # Lo3c$residuals with span obtained from loess.as
Loasc6 <- loess.as(daf6$X6, daf6$Y6, degree = 2, criterion = "aicc", family = "gaussian") # Loasc3$residuals
LoLyc6 <- locpoly(daf6$X6, daf6$Y6, degree = 2, kernel = "normal", bandwidth = 0.025, gridsize = 100001) # only gives results
# R^2 Coefficients
# Loess
SSres <- sum((Y6 - Lo6c$fitted)^2)
SStot <- sum((Y6 - mean(Y5))^2)
R2Lo6 <- 1 - SSres/SStot
#loess.as
SSres <- sum((Y6 - Loasc6$fitted)^2)
SStot <- sum((Y6 - mean(Y6))^2)
R2Loas6 <- 1 - SSres/SStot
#locpoly
SSres <- sum((Y6 - LoLyc6$y)^2)
SStot <- sum((Y6 - mean(Y6))^2)
R2Loly6 <- 1 - SSres/SStot
#plot
fitted_loc2 <- Lo6c$fitted
#fitted_lp <- Lpc5$mf[1]
fitted_loas2 <- Loasc6$fitted
fitted_Loly2 <- LoLyc6$y
pc5 <- ggplot() + geom_point(mapping = aes(x = X6, y = fitted_loc2, color = "loess")) +
geom_point(mapping = aes(x = X6, y = fitted_loas2, color = "loess.as")) +
geom_point(mapping = aes(x = X6, y = fitted_Loly2, color = "locpoly")) +
scale_color_manual(name = "Regression Function",
values = c( "loess" = "blue", "loess.as" = "green", "locpoly" = "purple"),
labels = c("loess", "loess.as","locpoly")) +
labs(y = "fit") +
ggtitle("CGaussian with sinusoidal noise function, 10^5 data points") # loess is under loess.as
# at this point basically, loess does the same as loess.as, loess.as is great, locpoly underfits a lot (too big bw most likely)
# fitting into a table
tab <- matrix(c(R2Lo3, R2Lp3, R2Loas3, R2Loly3, R2Lo4, R2Lp4, R2Loas4, R2Loly4,R2Lo5, "-", R2Loas5, R2Loly5, R2Lo6, "-", R2Loas6, R2Loly6), ncol=4, byrow=TRUE)
colnames(tab) <- c('Loess','locpol','loess.as',"locpoly")
rownames(tab) <- c('10^2','10^3','10^4',"10^5")
tab <- as.table(tab)
# plotting r^2 coefficients
locr2 <- c(R2Lo3,R2Lo4,R2Lo5,R2Lo6)
lpr2 <- c(R2Lp3,R2Lp4,0,0)
loasr2 <- c(R2Loas3,R2Loas4,R2Loas5,R2Loas6)
locyr2 <- c(R2Loly3,R2Loly4,R2Loly5,R2Loly6)
X <- c(10^2,10^3,10^4,10^5)
pr2c <- ggplot() + geom_point(mapping = aes(x = log10(X), y = locr2, color = "loess")) +
geom_point(mapping = aes(x = log10(X), y = lpr2, color = "locpol")) +
geom_point(mapping = aes(x = log10(X), y = loasr2, color = "loess.as")) +
geom_point(mapping = aes(x = log10(X), y = locyr2, color = "locpoly")) +
scale_color_manual(name = "Regression Function",
values = c( "loess" = "blue", "locpol" = "red", "loess.as" = "green", "locpoly" = "purple"),
labels = c("loess", "locpol", "loess.as","locpoly")) +
labs(y = "R2") +
ggtitle("R2 coefficients applied to gaussian distribution with sinusoidal and gaussian noise, for different functions, and data size")
pc4
pc5
```
With so many data points and uniform noise, we now have good results for each of the three functions. We can see that the regressions become more smooth as the number of data points goes up, most likely due to the lesser impact of the noise given, a direct consequence of the Law of Large Numbers. Now, let us have a final look at the evolution of the $R^2$ coefficients (even though it is a worse indicator than checking the underlying fits).
```{r final results, echo = FALSE}
tab
pr2c
```
As we can see, when we increase the number of data points, the span and the bandwidth methods give similar results. For smaller samples, however, the bandwidth approach is preferable. Again, we can see a clear overfitting tendency from the locpol function, as both of the considered coefficients of determination are equal to one, meaning the algorithm interpolated each point perfectly, which might be detrimental when working with noised data.
In summary, we observed two different approaches to local regression: one using a span parameter and one using a bandwidth, each with two different functions. It seems the bandwidth approach tends to underfit the data, especially when looking at small datasets, but works fine when dealing with larger samples. The bandwidth approach does not necessarily have this problem, but the two considered functions behave quite differently even in cases where they are given identical parameters; the locpoly function is much more sensitive to small fluctuations of the data, which is good to detect patterns with relatively small samples (order of 100,000), but overfits quite badly when given noised data. The locpoly function, on the contrary, tends to underfit on smaller samples (given the same bandwidth as the locpol function) but works well for bigger datasets. This is overall a useful result, as we cannot apply the locpoly function on samples that are too big anyway.
A remaining concern is the difference in the results of the two functions taking a bandwidth as input parameter. Indeed, there is no explanation in the function descriptions that explain why they output results so differently, as both seem to 'fit a local polynomial given a kernel and a bandwidth'. A possibility would be the regression criteria, which could either be least squares, Aic, etc.
## General Results and Discussion
Based on the preceding parts, here are the main takeaways:
+ First, the span approach (loess and loess.as) seems to consistently underfit for small data sets, even though the hyperparameter is algorithmically chosen. However, they work well for larger data sets. Maybe this is due to the algorithm in loess.as not producing optimal results, which could be checked by computing the span parameter externally. When given the same parameters, both functions produce identical results, which allows us to conclude that loess is better when we already have a prior idea of the span we want to use, whereas loess.as computes a candidate when we have no idea.
+ Regarding the bandwidth approach (locpol and locpoly), both functions yield different results when given identical inputs. In this case, we didn't choose the bandwidth parameter optimally, but both seem to work well when given a reasonable estimate. Locpol is more sensitive to small variations in the data, but this makes it prone to over-fitting and not optimal for larger datasets. On the other hand, Locpoly seems to underfit a bit when given the same bandwidth as locpol, but this can be solved by reducing this parameter, as shown in the previous section. In summary, locpol is better when we think our data is not too noisy and we want to detect finer deviations, and locpoly is better otherwise.
Comparing both approaches differently, we see that the span approach is worse for small data sets (tends to underfit), while both approaches give similar results when working on large data sets.
+ Computationally, the loess.as function is slower than its counterparts, as it calculates its hyperparameter value internally. A relevant comparison would be to run a bandwidth or span selection algorithm, such as K-folds, and integrate it into the timing of the other functions, to put them on even ground. Loess and locpol are faster, but locpoly is without a doubt the most efficient in its calculations, partially because locpoly only outputs the fit (X, Y coordinates) and nothing else, which is not the case for the others.
Overall, locpoly seems to be a good choice, both in terms of accuracy and time. However, we do need to select a bandwidth a priori. Loess and loess.as are very similar; thus, it all comes down to whether or not we think we have a good span parameter input or not. Locpol is somewhere at the same level of performance; it is slower than locpoly and does not apply to large datasets, but it gives interesting results and is a more sensitive function to local variations than locpoly.
## Conclusion
When only dealing with the application of local regression, the locpoly function from the Kernsmooth package is a great candidate, but the other ones are also decent candidates, due to their versatility (better performance for some specific inputs) and their results that are quite accurate in general.
To look further into the matter, one could explore the following ideas:
+ Run a K-folds algorithm for choosing bandwidth/span externally, then compare the resulting run-time with loess.as's (which computes it internally).
+ Evaluate the performance of the different functions on datasets with a varying X-axis density. In our analysis, the data points were evenly distributed on the interval, but variations can have an impact, especially on the bandwidth approach.
+ Check the code of the locpol and locpoly functions to examine why their results differ.
+ Instead of the $R^2$ criteria, implement a similar version of the calculations but using the residuals from the regression values to the function before it was noised (i.e. the underlying curve) to assess the performance in a more formal way. Here, the resulting values would decrease as soon as the considered algorithm would start either to underfit, or to overfit, which would make for a relevant complementary criteria to assessing the resulting curve.
## References (add references to fcts and packages)