-
Notifications
You must be signed in to change notification settings - Fork 0
/
Indian-Liver-Disease.Rmd
1122 lines (878 loc) · 51.1 KB
/
Indian-Liver-Disease.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Classification Machine Learning Predicting Liver Disease"
author: "Dave Downing"
date: "5/6/2020"
output:
pdf_document:
toc: true
toc_depth: 4
number_sections: true
header-includes:
- \usepackage{float}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache = TRUE)
options(tinytex.verbose = TRUE)
```
```{r, echo=FALSE, include=FALSE, set.seed(1147)}
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(caret)) install.packages("caret")
if(!require(ROCR)) install.packages("ROCR")
if(!require(PRROC)) install.packages("PRROC")
if(!require(ROSE)) install.packages("ROSE")
if(!require(xgboost)) install.packages("xgboost")
if(!require(kableExtra)) install.packages("kableExtra")
if(!require(gam)) install.packages("gam")
if(!require(splines)) install.packages("splines")
if(!require(foreach)) install.packages("foreach")
if(!require(ggthemes)) install.packages("ggthemes")
if(!require(DMwR)) install.packages("DMwR")
if(!require(caretEnsemble)) install.packages("caretEnsemble")
if(!require(reshape2)) install.packages("reshape2")
library(tidyverse)
library(caret)
library(ROCR)
library(PRROC)
library(ROSE)
library(xgboost)
library(kableExtra)
library(gam)
library(splines)
library(foreach)
library(ggthemes)
library(DMwR)
library(caretEnsemble)
library(reshape2)
load("rda/liver.rda")
```
\newpage
# Introduction
The purpose of this project is to predict the presence of liver disease from a given set of data. The data set is referred to as the Indian Liver Patient data set made available from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset. There are a total of 583 rows in the data set and these 11 columns:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
tibble(Columns=names(liver)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
The first 10 columns are the variables used for predicting whether or not a patient has liver disease. They include age, gender, and a list of key liver enzymes and proteins. The Dataset column denotes the outcome and indicates whether or not the patient has liver disease, with 1 indicating liver disease and 2 indicating no liver disease. For the below graph they will be labeled disease and healthy:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
liver %>% group_by(Gender, Age_By_Decade=cut(Age,c(0,10,20,30,40,50,60,70,80,90,100),
labels = seq(10,100,10)),
Dataset=ifelse(Dataset=="1","Disease","Healthy")) %>%
ggplot(aes(Age_By_Decade, fill=Dataset)) +
geom_bar(position=position_dodge()) +
facet_wrap(~ Gender) +
ylab("Frequency")
```
This data set is imbalanced containing 416 patient records with liver disease and only 167 non liver disease patient records. The goal of this project is to see if there is predictive power in these available variables so that it would be possible to provide better health outcomes for these patients by being able to identify early signs of liver disease. To measure the success of the models the metrics overall accuracy will be used to determine the success. We will however look at other metrics such as AUPRC to determine the overall effectiveness of the models picking up on the smaller class of non liver disease patients.
The key steps taken in pursuit of this goal were to inspect and then clean and transform the data and then run it through a series of machine learning algorithms to try to improve the accuracy of liver disease prediction. The algorithms were also combined together in an ensemble method to see if combining them would help further. Since the data set is more imbalanced towards liver disease patients, four sampling methods (up-sampling, down-sampling, ROSE and SMOTE) were also completed to see if that would help with the prediction accuracy.
Ultimately the size of the data set and the imbalance towards patients with the disease made it very difficult to build a sufficient model. Generating overall accuracy greater than predicting everyone had the disease was very difficult. While some success was achieved with some models having an AUPRC greater than **0.85** along with much improved specificity, this was achieved at the the overall expense of the sensitivity metric which caused lower accuracy. Given the importance of detecting the disease this is not an ideal result. However, multiple models did agree on the importance of total bilirubin. Expanding the data set significantly, including specifically more of those without liver disease, and re-running the models may prove to be a very significant benefit in improving the models and determining if total bilirubin and the other predictors can be of aid in screening for potential liver disease.
# Methods / Analysis
In order to avoid over-fitting our model, the data was split into a training set, **train**, and a test set, **test**. Only the training set was used to create the models. The test set was used exclusively for validation of the model results as a control for the predictions.
## Data Processing
### Duplicates
In data science duplicates are a very common problem as they can arise for a whole multitude of reasons. There are sometimes issues with the computer systems populating the data and there are also potential issues with humans duplicating data on accident as well. Examining our data we can see we do have a small number of duplicates in our data set:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
liver %>% group_by(duplicated(liver)) %>% summarize(Count = n()) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
\newpage
Since these records should not be counted more than once as they can adversely affect our models, they must be removed from the data set.
```{r}
liver <- unique(liver)
```
### Check for Missing Values
It is also very common in data science to have fields with NULL values. Searching all of of columns for missing values returns the following results:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
sapply(liver, function(x) sum(is.na(x))) %>%
kable(col.names = c("Missing Values")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
The data set is in very good shape with only the Albumin_and_Globulin_Ratio column missing any values at all, and only 4 values missing at that. Since it is such a small volume of records these records will be removed so that the data set is fully accurate without any statistical replacement.
```{r}
liver <- na.omit(liver)
```
### Data Types
As discussed above, in this data set a 1 indicates the patient does have liver disease, but a 2 is the value to denote there is not disease. Converting this to a 0 makes much more sense as we can turn this into a binary.
Likewise, there is a column for Gender which has values of Male and Female. This value was also hot-encoded into a binary to run through the various machine learning algorithms. Male is set 1 and Female is set to 0.
```{r}
liver$Dataset <- factor(ifelse(liver$Dataset==2,0,1), levels = c(1,0))
liver$Gender <- factor(ifelse(liver$Gender=="Male",1,0), levels = c(1,0))
```
\newpage
### Log Transformation
One of the other common problems in data science is that often the distributions of certain variables are highly skewed. When this occurs a simple log transformation can be applied to reduce the range of these predictors to limit the effect on the model. The below is a plot of
all of the non-binary predictor variables so that the distributions can be examined:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
d <- melt(liver)
ggplot(d,aes(x = value)) +
geom_histogram(fill="blue") +
ylab ("Frequency") +
facet_wrap(~variable,scales = "free_x")
```
The plot shows that Total_Billirubin, Direct_Billirubin, Alkaline_Phospotase, Alamine_Aminotransferase and Aspartate_Aminotransferase will all benefit from a log transformation.
```{r}
livertrans <- liver
livertrans$Total_Bilirubin <- log(livertrans$Total_Bilirubin +1)
livertrans$Direct_Bilirubin <- log(livertrans$Direct_Bilirubin +1)
livertrans$Alkaline_Phosphotase <- log(livertrans$Alkaline_Phosphotase +1)
livertrans$Alamine_Aminotransferase <- log(livertrans$Alamine_Aminotransferase +1)
livertrans$Aspartate_Aminotransferase <- log(livertrans$Aspartate_Aminotransferase +1)
```
After making the change, the histograms can be re-plotted to see the impact:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
d1 <- melt(livertrans)
ggplot(d1,aes(x = value)) +
geom_histogram(fill="blue") +
ylab ("Frequency") +
facet_wrap(~variable,scales = "free_x")
```
\newpage
### Correlation
Lastly before we partition the data we can look at the correlations between all of the predictor variables to see if there are any highly correlated predictors. Here is a heat map of the correlations:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
lt <- livertrans[, (3:10)]
lt <- sapply(lt, as.numeric)
lower <- function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
upper <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
reorder_cormat <- function(cormat){
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}
corr_matrix <- round(cor(lt),2)
corr_matrix <- reorder_cormat(corr_matrix)
upper_tri <- upper(corr_matrix)
melt_corr_matrix <- melt(upper_tri, na.rm = TRUE)
ggplot(melt_corr_matrix,aes(Var2, Var1, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 1,
size = 9, hjust = 1), axis.text.y = element_text(size = 9), axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank()) +
coord_fixed()
```
The map makes it very easy to see that total and direct bilirubin are very highly correlated. Looking at the just those two values we can see just how correlated they are:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
cor(lt[,c(1,2)]) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
As they are pretty close to being perfectly correlated one of these variables should be removed from the algorithms. Here is the average of how highly correlated each variable is with the others. Since having too much correlation between predictors is not ideal, the variable with less total correlation will be kept and the other will be removed:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
cor <- cor(lt)
rowMeans(cor[1:2,]) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
Both are very close but direct bilirubin will be removed as a predictor from the machine learning models since it is just a touch more highly correlated with the other variables than total bilirubin.
```{r}
livertrans <- select (livertrans,-c(Direct_Bilirubin))
```
## Modeling
Now that the data has been processed for the machine learning algorithms the data can be split into a training set (80%) and a test set (20%) which will be used for validation purposes only.
```{r, message=FALSE, warning=FALSE}
set.seed(1147, sample.kind = "Rounding")
test_index <- createDataPartition(y = livertrans$Dataset, times = 1,
p = 0.2, list = FALSE)
train <- livertrans[-test_index,]
test <- livertrans[test_index,]
```
### Metrics
To determine the performance of each machine learning algorithm, the following metrics were used.
**Accuracy**
This is just a measure of how many of the algorithm predictions were correct, whether true or false.
**Sensitivity**
This is a measure of the ability to correctly identify those with the disease (true positives).
**Specificity**
The next metric included is specificity, which is a measure of the ability to correctly identify those without the disease.
**F1 Score**
F1 Score is defined as the harmonic average of precision and sensitivity. Precision is the ratio of correctly predicted positive observations of the total predicted positive observations. Said more simply, recall is simply a measure of everyone that was predicted to have the disease, how many of them actually have it? The harmonic average of the recall and sensitivity generate the F1 score. This is a good metric since it is accounting for both false positives and false negatives.
**AUC**
AUC stands for area under the cover. It provides an measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.
**AUPRC**
AUPRC stands for area under the precision recall curve. PR curves are used to examine the trade-off between the proportion of positively labeled examples that are truly positive (precision) as a function of the proportion of correctly classified positives (sensitivity also known as recall). In particular, PR analysis is preferred to ROC AUC analysis when there is a large imbalance in the data.
### Naive Guessing Everyone
Since this particular data set is skewed toward people with the disease, the simplest way to predict if a patient has liver disease is to simply guess all patients are positive for the disease as a control for the machine learning algorithms. By doing so below are the results:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
predictions <- factor(rep(1,times=length(test$Dataset)),levels=c(1,0))
predictions1 <- prediction(as.numeric(predictions) , test$Dataset)
cm <- confusionMatrix(predictions, test$Dataset)
auc <- performance(predictions1, "auc")
AUPRC <- performance(predictions1, "prec", "rec")
aucv <- [email protected][[1]]
results <- tibble(Method="Guess",
Accuracy=cm$overall["Accuracy"],
Sensitivity= cm$byClass["Sensitivity"],
Specificity= cm$byClass["Specificity"],
F1= cm$byClass["F1"],
AUC=aucv,
AUPRC=0)
results %>% kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
Since 71% of the data set has the disease, of course the accuracy is 71%. The sensitivity is perfect because we guessed 1 for every single record. Correspondingly, the specificity is 0 since we didn't guess a single person to not have liver disease. These will be the benchmarks for each of the machine models to compare against to see how they perform versus this baseline.
### Random Forest
Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest predicts out a class prediction and the class with the most votes becomes the model’s prediction. Due to these features and the algorithms popularity this was the first algorithm selected to train the data. Ten fold cross validation was repeated ten times and preprocessing was used to center and scale the data.
Here are the results:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
control <- trainControl(method="repeatedcv", number = 10, repeats=10, allowParallel = TRUE)
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
model_rf <- train(Dataset ~ .,
method = "rf",
data = train,
trControl = control,
preProcess = c("scale", "center"),
verboseIter = TRUE)
predictions <- predict(model_rf, test)
predictions1 <- prediction(as.numeric(predictions) , test$Dataset)
cm <- confusionMatrix(predictions, test$Dataset)
auc <- performance(predictions1, "auc")
AUPRC <- performance(predictions1, "prec", "rec")
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = predictions[test$Dataset == 1],
scores.class1 = predictions[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="RF",
Accuracy=cm$overall["Accuracy"],
Sensitivity= cm$byClass["Sensitivity"],
Specificity= cm$byClass["Specificity"],
F1= cm$byClass["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
results %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
The random forest was able to generate a AUPRC of **0.712.** The overall accuracy however is actually lower than just guessing that everyone has the disease. The sensitivity did improve from the 0 generated from the guessing model but it is still quite low at **0.121**.
Here are the most important variables from the random forest. Total bilirubin as was noted in the introduction is consistently one of the key variables through all of the models:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
imprf <- varImp(model_rf)
t <- tibble(term = rownames(imprf$importance),
importance = imprf$importance$Overall) %>%
mutate(rank = rank(-importance)) %>% arrange(desc(importance))
t %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
#### Over (Up) Sampling
One of the great features of random forests is that we also have the option to add a sampling component to help control for unbalanced data sets. With over-sampling, we randomly duplicate samples from the class with fewer records so that the number of samples in each class match. While losing information is avoided by taking this approach, the risk of over-fitting the model is possible. Therefore cross-validation is necessary on each fold independently to get an honest estimate of the model performance. Below our the results of this approach:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
control <- trainControl(method="repeatedcv",
number = 10,
repeats=10,
allowParallel = TRUE,
sampling = "up")
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
model_rf_up <- train(Dataset ~ .,
method = "rf",
data = train,
trControl = control,
preProcess = c("scale", "center"),
verboseIter = TRUE
)
predictions <- predict(model_rf_up, test)
predictions1 <- prediction(as.numeric(predictions) , test$Dataset)
cm <- confusionMatrix(predictions, test$Dataset)
auc <- performance(predictions1, "auc")
AUPRC <- performance(predictions1, "prec", "rec")
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = predictions[test$Dataset == 1],
scores.class1 = predictions[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="RF - Up",
Accuracy=cm$overall["Accuracy"],
Sensitivity= cm$byClass["Sensitivity"],
Specificity= cm$byClass["Specificity"],
F1= cm$byClass["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
results %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
And these are the most important variables from the over-sampling random forest:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
imprfu <- varImp(model_rf_up)
t1 <- tibble(term = rownames(imprfu$importance),
importance = imprfu$importance$Overall) %>%
mutate(rank = rank(-importance)) %>% arrange(desc(importance))
t1 %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
With the over-sampling approach, the overall AUPRC is less than the original random forest at **0.647.** The effect of the sampling can be clearly seen looking at the sensitivity and specificity. The specificity markedly increases from **0.121** to **0.364**, but this was at the expense of the sensitivity which correspondingly decreased from **0.877** to **0.815.**
#### Under (Down) Sampling
With under-sampling, a subset of samples from the class with more instances are randomly selected to match the number of samples coming from the class. The main concern with under-sampling is that we lose potentially lose out on information from the removed samples. Just like with oversampling, cross-validation is thus necessary on each fold independently to control for this. Below our the results of this method:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
control <- trainControl(method="repeatedcv",
number = 10,
repeats=10,
allowParallel = TRUE,
sampling = "down")
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
model_rf_down <- train(Dataset ~ .,
method = "rf",
data = train,
trControl = control,
preProcess = c("scale", "center"),
verboseIter = TRUE
)
predictions <- predict(model_rf_down, test)
predictions1 <- prediction(as.numeric(predictions) , test$Dataset)
cm <- confusionMatrix(predictions, test$Dataset)
auc <- performance(predictions1, "auc")
AUPRC <- performance(predictions1, "prec", "rec")
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = predictions[test$Dataset == 1],
scores.class1 = predictions[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="RF - Down",
Accuracy=cm$overall["Accuracy"],
Sensitivity= cm$byClass["Sensitivity"],
Specificity= cm$byClass["Specificity"],
F1= cm$byClass["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
results %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
And here are the variable importance results:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
imprfd <- varImp(model_rf_down)
t2 <- tibble(term = rownames(imprfd$importance),
importance = imprfd$importance$Overall) %>%
mutate(rank = rank(-importance)) %>% arrange(desc(importance))
t2 %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
Under-sampling really shows the effect that sampling can have on an unbalanced data set, as the specificity skyrockets all the way up to **0.758.** Unfortunately, the sensitivity falls down to **0.667.** The AUPRC also falls considerably to **0.599.**
#### ROSE Sampling
The ROSE sampling approach is a hybrid method. Artificial balanced samples are generated according to a smoothed bootstrap approach to help deal with unbalanced data. Below are the results from this method:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
control <- trainControl(method="repeatedcv",
number = 10,
repeats=10,
allowParallel = TRUE,
sampling = "rose")
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
model_rf_rose <- train(Dataset ~ .,
method = "rf",
data = train,
trControl = control,
preProcess = c("scale", "center"),
verboseIter = TRUE
)
predictions <- predict(model_rf_rose, test)
predictions1 <- prediction(as.numeric(predictions) , test$Dataset)
cm <- confusionMatrix(predictions, test$Dataset)
auc <- performance(predictions1, "auc")
AUPRC <- performance(predictions1, "prec", "rec")
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = predictions[test$Dataset == 1],
scores.class1 = predictions[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="RF - ROSE",
Accuracy=cm$overall["Accuracy"],
Sensitivity= cm$byClass["Sensitivity"],
Specificity= cm$byClass["Specificity"],
F1= cm$byClass["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
results %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
And here is the variable importance results, with total bilirubin moving into the top spot:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
imprfr <- varImp(model_rf_rose)
t3 <- tibble(term = rownames(imprfr$importance),
importance = imprfr$importance$Overall) %>%
mutate(rank = rank(-importance)) %>% arrange(desc(importance))
t3 %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
The ROSE sampling approach had an even stronger effect on increasing the specificity **(0.818**), but also had the strongest effect on decreasing the sensitivity to **0.543.** The overall AUPRC **(0.621)** did increase back up versus the under-sampling method, but is still considerably less than over-sampling and original random forest models.
#### SMOTE Sampling
The SMOTE sampling approach is another hybrid method. A combination of of over-sampling the minority class and under-sampling the majority class is sued to try to achieve a better classifier performance in AUC space. Over-sampling the minority class involves creating synthetic minority class examples in order to achieve this. Here are the results from the random forest with SMOTE sampling:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
control <- trainControl(method="repeatedcv",
number = 10,
repeats=10,
allowParallel = TRUE,
sampling = "smote")
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
model_rf_smote <- train(Dataset ~ .,
method = "rf",
data = train,
trControl = control,
preProcess = c("scale", "center"),
verboseIter = TRUE
)
predictions <- predict(model_rf_smote, test)
predictions1 <- prediction(as.numeric(predictions) , test$Dataset)
cm <- confusionMatrix(predictions, test$Dataset)
auc <- performance(predictions1, "auc")
AUPRC <- performance(predictions1, "prec", "rec")
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = predictions[test$Dataset == 1],
scores.class1 = predictions[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="Random Forest - SMOTE",
Accuracy=cm$overall["Accuracy"],
Sensitivity= cm$byClass["Sensitivity"],
Specificity= cm$byClass["Specificity"],
F1= cm$byClass["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
results %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
And here are the variable importance results, with bilirubin again ranked number one:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
imprfs <- varImp(model_rf_smote)
t4 <- tibble(term = rownames(imprfs$importance),
importance = imprfs$importance$Overall) %>%
mutate(rank = rank(-importance)) %>% arrange(desc(importance))
t4 %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
The SMOTE sampling approach generated the highest accuracy of the random forest approaches at **0.702**, but unfortunately the AUPRC was the second lowest at **0.610.**
### Logistic Regression
Since we have a binary outcome and not that many features compare to the size of the data set a logistic regression was a very obvious choice due to its well-known and long-standing reputation. The following output shows the performance:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
model_glm <- train(Dataset ~ ., method = "glm", data = train)
predictions <- predict(model_glm, test)
predictions1 <- prediction(as.numeric(predictions) , test$Dataset)
cm <- confusionMatrix(predictions, test$Dataset)
auc <- performance(predictions1, "auc")
AUPRC <- performance(predictions1, "prec", "rec")
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = predictions[test$Dataset == 1],
scores.class1 = predictions[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="GLM",
Accuracy=cm$overall["Accuracy"],
Sensitivity= cm$byClass["Sensitivity"],
Specificity= cm$byClass["Specificity"],
F1= cm$byClass["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
results %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
The logistic regression generates an AUPRC higher than all but our original random forest model, however it did best that model in overall accuracy. A closer examination of the variable importance will be presented in the results section.
### K-Nearest Neighbors
The KNN was chosen as it is also very widely recognized and its approach differs from some of the others chosen so far. This algorithm assumes that similar things exist in close proximity to each other. In other words, similar things are near to each other. A tune grid with seq(1,100,4) was used to optimize for k. Below are the results of how KNN performed on this data set:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
control <- trainControl(method ="CV",
number = 10
)
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
model_knn <- train(Dataset ~ .,
method = "knn",
data = train,
trControl = control,
preProcess = c("scale", "center"),
tuneGrid = data.frame(k = seq(1,100,4))
)
predictions <- predict(model_knn, test)
predictions1 <- prediction(as.numeric(predictions) , test$Dataset)
cm <- confusionMatrix(predictions, test$Dataset)
auc <- performance(predictions1, "auc")
AUPRC <- performance(predictions1, "prec", "rec")
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = predictions[test$Dataset == 1],
scores.class1 = predictions[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="KNN",
Accuracy=cm$overall["Accuracy"],
Sensitivity= cm$byClass["Sensitivity"],
Specificity= cm$byClass["Specificity"],
F1= cm$byClass["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
results %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
The KNN model bettered the AUC from the logistic regression, moving into second place behind the original random forest in AUPRC at **0.687.** It's accuracy of **0.711** is the best of the machine learning models, however this just ties it with naively guessing everyone has the disease. This is because this model guesses very strongly that most have the disease, with a sensitivity of **0.975** but a specificity not much better than our naive model at **0.061.**
### An Ensemble Approach
To see if significant improvement would be be possible by taking an ensemble approach, a multitude of new models were selected and run through a function to see which performed best. LDA, QDA, Naive Bayes, SVM, Loess, Rpart, Multinom and ADABoost were all selected in addition to Random Forest, XGBoost and GLM from above. The default caret tuning will be applied for these models. After running the models, 25 resamples will be taken to avoid over-fitting and the 95% confidence interval of accuracy of each models resamples will be measured. This will give a very nice visual to see which models performed the best.
Here are the metrics for all of the models:
```{r, include=FALSE}
models <- c("glm", "rpart", "lda", "naive_bayes", "svmLinear", "gamLoess", "multinom", "qda", "rf", "adaboost", "xgbTree")
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
fits <- lapply(models, function(model){
print(model)
train(Dataset ~ ., method = model, data = train)
})
names(fits) <- models
pred <- sapply(fits, function(object)
predict(object, newdata = test))
predictions1 <- sapply(models, function(d) prediction(as.numeric(pred[,d]) , test$Dataset))
cm <- sapply(models, function(p) confusionMatrix(factor(pred[,p],levels=c(1,0)), test$Dataset))
#Multinom
auc <- performance(predictions1$multinom, "auc")
AUPRC <- performance(predictions1$multinom, "prec", "rec")
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = as.numeric(pred[,"multinom"])[test$Dataset == 1],
scores.class1 = as.numeric(pred[,"multinom"])[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="Multinom",
Accuracy=cm["overall",]$multinom["Accuracy"],
Sensitivity= cm["byClass",]$multinom["Sensitivity"],
Specificity= cm["byClass",]$multinom["Specificity"],
F1= cm["byClass",]$multinom["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
#LDA
auc <- performance(predictions1$lda, "auc")
AUPRC <- performance(predictions1$lda, "prec", "rec")
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = as.numeric(pred[,"lda"])[test$Dataset == 1],
scores.class1 = as.numeric(pred[,"lda"])[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="LDA",
Accuracy=cm["overall",]$lda["Accuracy"],
Sensitivity= cm["byClass",]$lda["Sensitivity"],
Specificity= cm["byClass",]$lda["Specificity"],
F1= cm["byClass",]$lda["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
#Adaboost
auc <- performance(predictions1$adaboost, "auc")
AUPRC <- performance(predictions1$adaboost, "prec", "rec")
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = as.numeric(pred[,"adaboost"])[test$Dataset == 1],
scores.class1 = as.numeric(pred[,"adaboost"])[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="Adaboost",
Accuracy=cm["overall",]$adaboost["Accuracy"],
Sensitivity= cm["byClass",]$adaboost["Sensitivity"],
Specificity= cm["byClass",]$adaboost["Specificity"],
F1= cm["byClass",]$adaboost["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
#SVM
auc <- performance(predictions1$svmLinear, "auc")
AUPRC <- performance(predictions1$svmLinear, "prec", "rec")
AUPRC
aucv <- [email protected][[1]]
#AUPRCv <- pr.curve(scores.class0 = as.numeric(pred[,"svmLinear"])[test$Dataset == 1],
# scores.class1 = as.numeric(pred[,"svmLinear"])[test$Dataset == 0],
# curve=TRUE)
results <- bind_rows(results,
tibble(Method="SvmLinear",
Accuracy=cm["overall",]$svmLinear["Accuracy"],
Sensitivity= cm["byClass",]$svmLinear["Sensitivity"],
Specificity= cm["byClass",]$svmLinear["Specificity"],
F1= cm["byClass",]$svmLinear["F1"],
AUC=aucv,
AUPRC=0))
#rpart
auc <- performance(predictions1$rpart, "auc")
AUPRC <- performance(predictions1$rpart, "prec", "rec")
AUPRC
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = as.numeric(pred[,"rpart"])[test$Dataset == 1],
scores.class1 = as.numeric(pred[,"rpart"])[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="Rpart",
Accuracy=cm["overall",]$rpart["Accuracy"],
Sensitivity= cm["byClass",]$rpart["Sensitivity"],
Specificity= cm["byClass",]$rpart["Specificity"],
F1= cm["byClass",]$rpart["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
#gamLoess
auc <- performance(predictions1$gamLoess, "auc")
AUPRC <- performance(predictions1$gamLoess, "prec", "rec")
AUPRC
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = as.numeric(pred[,"gamLoess"])[test$Dataset == 1],
scores.class1 = as.numeric(pred[,"gamLoess"])[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="GamLoess",
Accuracy=cm["overall",]$gamLoess["Accuracy"],
Sensitivity= cm["byClass",]$gamLoess["Sensitivity"],
Specificity= cm["byClass",]$gamLoess["Specificity"],
F1= cm["byClass",]$gamLoess["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
#qda
auc <- performance(predictions1$qda, "auc")
AUPRC <- performance(predictions1$qda, "prec", "rec")
AUPRC
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = as.numeric(pred[,"qda"])[test$Dataset == 1],
scores.class1 = as.numeric(pred[,"qda"])[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="QDA",
Accuracy=cm["overall",]$qda["Accuracy"],
Sensitivity= cm["byClass",]$qda["Sensitivity"],
Specificity= cm["byClass",]$qda["Specificity"],
F1= cm["byClass",]$qda["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
#xgbTree
auc <- performance(predictions1$xgbTree, "auc")
AUPRC <- performance(predictions1$xgbTree, "prec", "rec")
AUPRC
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = as.numeric(pred[,"xgbTree"])[test$Dataset == 1],
scores.class1 = as.numeric(pred[,"xgbTree"])[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="XGBTree",
Accuracy=cm["overall",]$xgbTree["Accuracy"],
Sensitivity= cm["byClass",]$xgbTree["Sensitivity"],
Specificity= cm["byClass",]$xgbTree["Specificity"],
F1= cm["byClass",]$xgbTree["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
#naive_bayes
auc <- performance(predictions1$naive_bayes, "auc")
AUPRC <- performance(predictions1$naive_bayes, "prec", "rec")
AUPRC
aucv <- [email protected][[1]]
AUPRCv <- pr.curve(scores.class0 = as.numeric(pred[,"naive_bayes"])[test$Dataset == 1],
scores.class1 = as.numeric(pred[,"naive_bayes"])[test$Dataset == 0],
curve=TRUE)
results <- bind_rows(results,
tibble(Method="Naive Bayes",
Accuracy=cm["overall",]$naive_bayes["Accuracy"],
Sensitivity= cm["byClass",]$naive_bayes["Sensitivity"],
Specificity= cm["byClass",]$naive_bayes["Specificity"],
F1= cm["byClass",]$naive_bayes["F1"],
AUC=aucv,
AUPRC=AUPRCv[[3]]))
```
```{r, echo=FALSE, message=FALSE, warning=FALSE}
results %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
There is some very strong performance from some of the models with Naive Bayes and QDA producing AUPRC numbers over **0.855.** However, these improvements come at the expense of the sensitivity metrics which is not good when the focus is on trying to detect disease. There is some better performance on our overall accuracy, and while it would be tempting to only select these best models for the ensemble, care needs to be taken to avoid over-fitting. Instead, 25 resamples from each of the models can be taken to see how the accuracy compares by method to avoid this danger of over-fitting.
```{r, echo=FALSE, message=FALSE, warning=FALSE, results=FALSE}
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
resamples <- resamples(fits)
```
```{r, echo=FALSE, message=FALSE, warning=FALSE}
dotplot(resamples, metric = "Accuracy", main="Ensemble Resamples by Method")
```
We can see when looking at the resamples that multinom actually does very well, with the lower end of its 95% confidence interval outperforming the median of the 95% confidence interval of all of the other model's resamples.
Now these top six performing models based on the accuracy confidence intervals (multinom, glm, lda, rf, adaboost and svmLinear) can be used to create the ensemble. First, the models are examined to see if they are very highly correlated with each other as using very highly correlated models is not advised:
```{r, echo=FALSE, message=FALSE, warning=FALSE, results=FALSE}
models <- c("multinom", "glm", "lda", "rf", "adaboost", "svmLinear")
set.seed(1147, sample.kind = "Rounding") #in R 3.6 or later
fits <- lapply(models, function(model){
print(model)
train(Dataset ~ ., method = model, data = train)
})
names(fits) <- models
resamples <- resamples(fits)
```
```{r,echo=FALSE,message=FALSE, warning=FALSE}
modelCor(resamples) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center",
font_size = 10,
full_width = FALSE)
```
Since there are no concerns with high correlation, an ensemble model can be created and the results are below:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
pred <- sapply(fits, function(object)
predict(object, newdata = test))