-
Notifications
You must be signed in to change notification settings - Fork 0
/
BT2103 Group Project - Github.Rmd
2006 lines (1554 loc) · 94.6 KB
/
BT2103 Group Project - Github.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "BT2103 Optimisation Methods in Business Analytics"
subtitle: "Machine Learning Classification"
output:
pdf_document: default
html_document: default
editor_options:
markdown:
wrap: 72
fontsize: 12pt
---
```{r , echo=TRUE, warning=FALSE}
suppressMessages(library(dplyr))
suppressMessages(library(caret))
suppressMessages(library(lessR))
suppressMessages(library(ggplot2))
suppressMessages(library(Hmisc))
suppressMessages(library(corrplot))
suppressMessages(library(mltools))
suppressMessages(library(data.table))
suppressMessages(library(cowplot))
suppressMessages(library(e1071))
suppressMessages(library(class))
suppressMessages(library(ROSE))
suppressMessages(library(ROCR))
suppressMessages(library(rpart))
suppressMessages(library(rpart.plot))
suppressMessages(library(randomForest))
suppressMessages(library(party))
suppressMessages(library(pROC))
suppressMessages(library(knitr))
suppressMessages(library(kableExtra))
```
# 1. Brief Introduction Of Data Set And Data Modeling Problem
The dataset "card.csv" has been obtained from UCI repository:
<https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients>
It contains payment information of 30,000 credit card holders obtained
from a bank in Taiwan.
## Dataset Description
ID: Unique dataset identification number for consumer
**Feature Attributes**
LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the
individual consumer credit and his/her family (supplementary) credit
(Continuous)
SEX: Gender (1 = male; 2 = female) (Categorical)
EDUCATION: Education (0 = unknown, 1 = graduate school; 2 = university;
3 = high school; 4 = others, 5 = unknown, 6 = unknown) (Categorical)
MARRIAGE: Marital status (0 = unknown, 1 = married; 2 = single; 3 =
others) (Categorical)
AGE: Age (in years) (Continuous)
PAY_0 - PAY_6: History of past payment. Past monthly payment records
were tracked (from April 2005 to September 2005, where PAY_0 = the
repayment status in September 2005 and PAY_6 = the repayment status in
April 2005). The measurement scale for the repayment status is: -1 = pay
duly; 1 = payment delay for one month; 2 = payment delay for two months;
. . .; 8 = payment delay for eight months; 9 = payment delay for nine
months and above. (Categorical)
BILL_AMT1 - BILL_AMT6: Amount of bill statement (NT dollar) (from April
2005 to September 2005, where BILL_AMT1 = amount of bill statement in
September 2005 and BILL_AMT6 = amount of bill statement in April 2005).
(Continuous)
PAY_AMT1 - PAY_AMT6: Amount of previous payment (NT dollar), where
PAY_AMT1 = amount paid in September 2005 and PAY_AMT6 = amount paid in
April 2005. (Continuous)
**Target Variable**
default_payment_next_month: target values of whether payment is
defaulted (Yes = 1, No = 0) (Categorical)
In this report, we aim to evaluate supervised machine learning
algorithms in their ability to predict whether a customer will default
on their credit card payment. From the perspective of risk management
for the bank, predicting whether or not a customer is likely to default
payment is important for banks as it poses a financial risk to them.
We will be using the following classification algorithms:
1) Logistic Regression
2) Support Vector Machines (SVM)
3) Naive Bayes Classification
4) Decision Tree (Conditional Inference Tree)
5) Random Forest
6) Neural Networks
```{r , echo=TRUE, warning=FALSE}
data <- read.table("card.csv",sep=",",skip=2,header=FALSE)
header <- scan("card.csv",sep=",",nlines=2,what=character())
#labeling column names
colnames(data) <- c("ID", "LIMIT_BAL", "SEX", "EDUCATION", "MARRIAGE", "AGE",
"PAY_0", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6",
"BILL_AMT1", "BILL_AMT2", "BILL_AMT3", "BILL_AMT4",
"BILL_AMT5", "BILL_AMT6", "PAY_AMT1", "PAY_AMT2",
"PAY_AMT3", "PAY_AMT4", "PAY_AMT5", "PAY_AMT6",
"default_payment_next_month")
#converting categorical data into factors
data$SEX <- as.factor(data$SEX)
data$EDUCATION <- as.factor(data$EDUCATION)
data$MARRIAGE <- as.factor(data$MARRIAGE)
data$PAY_0 <- as.factor(data$PAY_0)
data$PAY_2 <- as.factor(data$PAY_2)
data$PAY_3 <- as.factor(data$PAY_3)
data$PAY_4 <- as.factor(data$PAY_4)
data$PAY_5 <- as.factor(data$PAY_5)
data$PAY_6 <- as.factor(data$PAY_6)
data$default_payment_next_month <- as.factor(data$default_payment_next_month)
```
# 2. Exploratory Data Analysis
Before we perform initial investigations on data to discover patterns
and spot anomalies/inconsistencies in our data, we will first take a
look at our target variable, default_payment_next_month.
```{r , echo=TRUE, warning=FALSE,fig.height = 3, fig.width = 5}
ggplot(data, aes(default_payment_next_month,
fill = default_payment_next_month)) +
geom_bar() +
scale_fill_manual(values = c("deepskyblue", "turquoise2"), labels=c('Not Default', 'Default')) +
geom_text(aes(label=round(..count../30000,2)), stat = "count", vjust = -0.4, colour = "black", size=2)
```
As we can see, we have a binary classification problem as shown in the
figure, with values set to "0" for consumers who did not default on
payment and "1" for consumers who defaulted on payment. Out of 30,000
samples, 6636 or 22.1% of clients has defaulted on payment.
## Data Cleaning
a. **Handling missing data**
```{r , echo=TRUE, warning=FALSE}
paste("Number Of Missing Values In Dataset:", sum(is.na(data)))
```
There are no missing values in the dataset.
b. **Categorical Feature Analysis**
Firstly, column for the repayment status in September 2005 is wrongly
labelled as PAY_0. To be consistent with "BILL_AMT1" and PAY_AMT1", the
"PAY_0" column will be re-labelled as "PAY_1" instead.
```{r , echo=TRUE, warning=FALSE}
colnames(data)[7] = "PAY_1"
```
Secondly, we also found unknown values for certain categorical features.
These errors in the data set can be addressed in two ways:
1. Removing the rows which contain the error
2. Replace the wrong attribute class with a correction
[MARRIAGE]{.ul}
According to the UCI repository, MARRIAGE: Marital status (1 = married;
2 = single; 3 = others). There are data points of MARRIAGE that are
labelled as "0", which do not correspond to any of the above-mentioned
categories.
For the MARRIAGE feature attribute, we noticed that 54 observations has
a value of "0", which is not in any of the categories as defined in the
variable.
We decided to remove this observations because we cannot assume that
these entries belong to any of the other 3 defined categories. Also,
since there are 30000 customer data points in the data set, and thus
these data points will not affect the performance and evaluation of our
model severely, with the change seen in the table below.
```{r , echo=TRUE, warning=FALSE}
marriage1<-count(data, MARRIAGE)
marriage1
data = data[!(data$MARRIAGE == 0),]
data$MARRIAGE = droplevels(data$MARRIAGE)
marriage2 <- count(data, MARRIAGE)
marriage2
```
[EDUCATION]{.ul}
According to the UCI repository, EDUCATION: Education (1 = graduate
school; 2 = university; 3 = high school; 4 = others).
For the EDUCATION feature attribute, we noticed that 346 observations
has a value of "0", "5" or "6", which are not in any of the categories
as defined in the variable.
We decided to remove this observations because we cannot assume that
these entries belong to any of the other 4 defined categories. Also,
since there are 30000 customer data points in the data set, and thus
these data points will not affect the performance and evaluation of our
model severely, with the change seen in the table below.
```{r , echo=TRUE, warning=FALSE}
edu1 = count(data, EDUCATION)
edu1
data = data[!(data$EDUCATION == 0 | data$EDUCATION == 5 | data$EDUCATION == 6),]
data$EDUCATION = droplevels(data$EDUCATION)
edu2 = count(data, EDUCATION)
edu2
```
[PAY_N]{.ul}
```{r, results='asis', echo=TRUE, warning=FALSE}
x = cbind(count(data, PAY_1),count(data, PAY_2)$n,count(data, PAY_3)$n,count(data, PAY_4)$n)
colnames(x) <- c('cat','pay_1','Pay_2','Pay_3','Pay_4')
y = cbind(count(data, PAY_5),count(data, PAY_6)$n)
colnames(y) <- c('cat','Pay_5','Pay_6')
cat('\\begin{center}')
# {c c} Creates a two column table
# Use {c | c} if you'd like a line between the tables
cat('\\begin{tabular}{ c c }')
print(knitr::kable(x, format = 'latex'))
# Separate the two columns with an &
cat('&')
print(knitr::kable(y, format = 'latex'))
cat('\\end{tabular}')
cat('\\end{center}')
```
According to the UCI repository, for PAY_N: -1 = pay duly; 1 = payment
delay for one month; 2 = payment delay for two months; . . .; 8 =
payment delay for eight months; 9 = payment delay for nine months and
above.
However, There are data points of PAY_N that are labelled as "-2" and
"0", which do not correspond to any of the above-mentioned categories.
Upon further research on the dataset, we have found that the author who
created the data set clarified that "-2": No consumption and "0": The
use of revolving credit, as seen in
<https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset/discussion/34608>.
Therefore, there are no errors in the PAY_N columns with respect to the
categories of PAY_N and no further action is required.
c. **Checking For Outliers (For Numerical Data)**
```{r , echo=TRUE, warning=FALSE,fig.height = 3, fig.width = 5}
process <- caret::preProcess(data[,c(2,6,13,14,15,16,17,18,19,20,21,22,23,24)], method=c("range"))
norm_scale <- predict(process,data[,c(2,6,13,14,15,16,17,18,19,20,21,22,23,24)])
boxplot1<- boxplot(norm_scale,range= 3, cex.axis=0.3, las=2)
```
Input variables may have different units of different scales and
magnitude; for this reason before drawing a boxplot, a Min-Max
standardisation is applied in order to scale the features between the
range of 0 to 1.
From the boxplot shown, we can see that there is a significant amount of
outliers present in the dataset. Due to this significant number, we
choose not to remove these outliers since they may contain valuable
information that can improve our model, and we do not have enough
concrete evidence to prove that they are anomalies in observations.
Due to the high number of outliers, we plotted the histograms of each
respective feature to understand the distribution of each feature.
```{r , echo=TRUE, warning=FALSE,fig.height = 3, fig.width = 5}
#Histogram of Amount of the given credit (NT dollar)
a = hist(data$LIMIT_BAL, main="Histogram of amount of Amount of the given credit (NT dollar)", cex.main = 0.5, breaks = "Sturges", labels=F, ylim = c(0,10000))
text(x = a$mids, y = a$counts, labels = a$counts, cex = 0.5, pos = 3)
```
Looking at the distribution of LIMIT_BAL, the perceived outliers do not
stray away from the observed distribution as seen from the histogram.
Hence, we decided not to remove any data points of LIMIT_BAL as they are
plausible values of the amount of the given credit (NT dollar).
```{r , echo=TRUE, warning=FALSE,fig.height = 3, fig.width = 5}
#Histograms of Bill Statements
par(mfrow=c(1,2))
a = hist(data$BILL_AMT1, main="Histogram of amount of Bill Statement (September 2005)", cex.main = 0.4, breaks = "Sturges", labels=F, ylim = c(0,25000))
text(x = a$mids, y = a$counts, labels = a$counts, cex = 0.35, pos = 3)
b = hist(data$BILL_AMT2, main="Histogram of amount of Bill Statement (August 2005)", cex.main = 0.4, breaks = "Sturges", labels=F, ylim = c(0,25000))
text(x = b$mids, y = b$counts, labels = b$counts, cex = 0.35, pos = 3)
par(mfrow=c(1,2))
c = hist(data$BILL_AMT3, main="Histogram of amount of Bill Statement (July 2005)", cex.main = 0.4, breaks = "Sturges", labels=F, ylim = c(0,25000))
text(x = c$mids, y = c$counts, labels = c$counts, cex = 0.35, pos = 3)
d = hist(data$BILL_AMT4, main="Histogram of amount of Bill Statement (June 2005)", cex.main = 0.4, breaks = "Sturges", labels=F, ylim = c(0,25000))
text(x = d$mids, y = d$counts, labels = d$counts, cex = 0.35, pos = 3)
par(mfrow=c(1,2))
e = hist(data$BILL_AMT5, main="Histogram of amount of Bill Statement (May 2005)", cex.main = 0.4, breaks = "Sturges", labels=F, ylim = c(0,25000))
text(x = e$mids, y = e$counts, labels = e$counts, cex = 0.35, pos = 3)
f = hist(data$BILL_AMT6, main="Histogram of amount of Bill Statement (April 2005)", cex.main = 0.4, breaks = "Sturges", labels=F, ylim = c(0,25000))
text(x = f$mids, y = f$counts, labels = f$counts, cex = 0.35, pos = 3)
```
Looking at the distribution of BILL_AMTN, the perceived outliers do not
stray away from the observed distribution as seen from the histograms.
Hence, we decided not to remove any data points of BILL_AMTN as they are
plausible values of the amount of the amount of bill statement (NT
dollar).
```{r , echo=TRUE, warning=FALSE,fig.height = 3, fig.width = 5}
#Histograms of Amount of Previous Payment
par(mfrow=c(1,2))
a = hist(data$PAY_AMT1, main="Histogram of amount of amount paid (September 2005)", cex.main = 0.4, breaks = "Sturges", labels=F, ylim = c(0,35000))
text(x = a$mids, y = a$counts, labels = a$counts, cex = 0.35, pos = 3)
b = hist(data$PAY_AMT2, main="Histogram of amount of amount paid (August 2005)", cex.main = 0.4, breaks = "Sturges", labels=F, ylim = c(0,35000))
text(x = b$mids, y = b$counts, labels = b$counts, cex = 0.35, pos = 3)
par(mfrow=c(1,2))
c = hist(data$PAY_AMT3, main="Histogram of amount of amount paid (July 2005)", cex.main = 0.4, breaks = "Sturges", labels=F, ylim = c(0,35000))
text(x = c$mids, y = c$counts, labels = c$counts, cex = 0.35, pos = 3)
d = hist(data$PAY_AMT4, main="Histogram of amount of amount paid (June 2005)", cex.main = 0.4, breaks = "Sturges", labels=F, ylim = c(0,35000))
text(x = d$mids, y = d$counts, labels = d$counts, cex = 0.35, pos = 3)
par(mfrow=c(1,2))
e = hist(data$PAY_AMT5, main="Histogram of amount of amount paid (May 2005)", cex.main = 0.4, breaks = "Sturges", labels=F, ylim = c(0,35000))
text(x = e$mids, y = e$counts, labels = e$counts, cex = 0.35, pos = 3)
f = hist(data$PAY_AMT6, main="Histogram of amount of amount paid (April 2005)", cex.main = 0.4, breaks = "Sturges", labels=F, ylim = c(0,35000))
text(x = f$mids, y = f$counts, labels = f$counts, cex = 0.35, pos = 3)
```
Looking at the distribution of PAY_AMTN, the perceived outliers do not
stray away from the observed distribution as seen from the histograms.
Hence, we decided not to remove any data points of PAY_AMTN as they are
plausible values of the amount of the amount of previous payment (NT
dollar).
```{r , echo=TRUE, warning=FALSE,fig.height = 3, fig.width = 5}
#Histogram of age
f = hist(data$AGE, main="Histogram of Age Of Clients", cex.main = 0.8, breaks = "Sturges", labels=F, ylim = c(0,8000))
text(x = f$mids, y = f$counts, labels = f$counts, cex = 0.5, pos = 3)
```
No outliers detected as the range of values in as shown in the
histograms of AGE are plausible values of the amount of age in years
d. **Correlation Between Features**
One factor that could affect machine learning classification
performances is correlation between features. This is because if
features are strongly correlated to each other, classification
algorithms which assume that the features are all independent may result
in a poor classification performance. Reducing the number of dimensions
of feature vectors could lead to better classification performances.
```{r , echo=TRUE, warning=FALSE,fig.height = 3, fig.width = 5}
#splitting dataset into training set(train.data) & test set(test.data)
set.seed(1234)
n = length(data$ID)
index <- 1:nrow(data)
testindex <- sample(index, trunc(n)/4)
test.data <- data[testindex,]
train.data <- data[-testindex,]
#using train data, taking out default_payment_next_month and the categorical feature attributes (remove ID, SEX, MARRIAGE, PAY_N, default_payment_next_month)
train.data_2 <- cor(train.data[,-c(1,3,4,5,7,8,9,10,11,12,25)])
corr_mat <- round(train.data_2,2)
melted_corr_mat <- reshape2::melt(corr_mat)
ggplot(data = melted_corr_mat, aes(x=Var1, y=Var2,
fill=value)) +
geom_tile() +
geom_text(aes(Var2, Var1, label = value),
color = "black", size = 2) +
scale_x_discrete(guide = guide_axis(angle = 90))
```
Features that have high correlation to each other ($\rho$ \> 0.9):
BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6
5 out of 6 of this variables will later be removed in the data
pre-processing phase as they likely will not be useful in predicting the
target values of whether payment is defaulted
(default_payment_next_month) as they may impact the performance of the
machine learning models.
# 3. Data Pre-Processing
The data pre-processing section involves transforming raw data into an
understandable and usable format.
## a. One-Hot Encoding
Categorical features must be changed in the data pre-processing phase
since machine learning models require numeric input values. One-hot
encoding will be used for categorical data with no ordinal relationship
in our data set. One-hot encoding creates new binary dummy variables for
each class in every categorical feature. For example, categorical
features such as EDUCATION has been split into dummy variables such as
EDUCATION_1 to EDUCATION_4.
```{r , echo=TRUE, warning=FALSE}
#one hot encoding for train data
train.data_without_target = train.data[-c(25)]
train.data_without_target_encoded <- one_hot(as.data.table(train.data_without_target))
train_data_encoded = cbind(train.data_without_target_encoded, train.data[c(25)])
#one hot encoding for test data
test.data_without_target <- test.data[-c(25)]
test.data_without_target_encoded <- one_hot(as.data.table(test.data_without_target))
test_data_encoded = cbind(test.data_without_target_encoded, test.data[c(25)])
```
## b. Feature Scaling
It's a common case that in most data sets, features are not on the same
scale. Machine learning models are, however, largely Euclidean
distant-based. Without feature scaling, features with large units will
dominate those with small units when it comes to calculation of the
Euclidean distance, hence features with small units may be neglected. To
prevent this from happening, we need to scale our numerical features.
In this project, we will be using be using caret library to pre-process
and scale the data through Min-Max Scaling method to scale the data
values to a range of 0 to 1.
```{r , echo=TRUE, warning=FALSE}
#scaling training data set
raw_train_data <- train_data_encoded
processed_train_data <- train_data_encoded
df <- preProcess(processed_train_data, method=c("range"))
processed_train_data <- predict(df, as.data.frame(processed_train_data))
#processed_train_data
#scaling test data set
raw_test_data <- test_data_encoded
processed_test_data <- test_data_encoded
df <- preProcess(processed_test_data, method=c("range"))
processed_test_data <- predict(df, as.data.frame(processed_test_data))
#processed_test_data
```
## c. Train-Test Split
Each dataset is split into 2 parts, the training set and test set, in
the proportion 3:1. To reduce any leaking of information into the
testset, we will be doing all feature selection techniques on the
trainset only. This includes testing for high correlation and
implementing oversampling and undersampling techniques.
# 4. Feature Selection
Feature selection is the process where features which contribute most to
the prediction variable are selected.
## a. Removal of highly correlated variables
As mentioned above, features that have high correlation to each other
will be removed as they likely will not be useful in predicting the
target values of whether payment is defaulted
(default_payment_next_month) as they may impact the performance of the
machine learning models. Some classification models such as the Naive
Bayes Classification are also reliant on the assumption that features
used to predict the target variable are linearly independent.
```{r , echo=TRUE, warning=FALSE}
processed_train_data.corr <- select(processed_train_data,-c('ID','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6'))
processed_test_data.corr <- select(processed_test_data,-c('ID','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6'))
setnames(processed_test_data.corr, old = c('PAY_1_-1','PAY_2_-1','PAY_3_-1','PAY_4_-1','PAY_5_-1','PAY_6_-1','PAY_1_-2','PAY_2_-2','PAY_3_-2','PAY_4_-2','PAY_5_-2','PAY_6_-2'),
new = c('PAY_1_n1','PAY_2_n1','PAY_3_n1','PAY_4_n1','PAY_5_n1','PAY_6_n1','PAY_1_n2','PAY_2_n2','PAY_3_n2','PAY_4_n2','PAY_5_n2','PAY_6_n2'))
setnames(processed_train_data.corr, old = c('PAY_1_-1','PAY_2_-1','PAY_3_-1','PAY_4_-1','PAY_5_-1','PAY_6_-1','PAY_1_-2','PAY_2_-2','PAY_3_-2','PAY_4_-2','PAY_5_-2','PAY_6_-2'),
new = c('PAY_1_n1','PAY_2_n1','PAY_3_n1','PAY_4_n1','PAY_5_n1','PAY_6_n1','PAY_1_n2','PAY_2_n2','PAY_3_n2','PAY_4_n2','PAY_5_n2','PAY_6_n2'))
processed_train_data_rosepca.corr <- select(processed_train_data,-c('ID','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6'))
processed_test_data_rosepca.corr <- select(processed_test_data,-c('ID','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6'))
setnames(processed_test_data_rosepca.corr, old = c('PAY_1_-1','PAY_2_-1','PAY_3_-1','PAY_4_-1','PAY_5_-1','PAY_6_-1','PAY_1_-2','PAY_2_-2','PAY_3_-2','PAY_4_-2','PAY_5_-2','PAY_6_-2'),
new = c('PAY_1_n1','PAY_2_n1','PAY_3_n1','PAY_4_n1','PAY_5_n1','PAY_6_n1','PAY_1_n2','PAY_2_n2','PAY_3_n2','PAY_4_n2','PAY_5_n2','PAY_6_n2'))
setnames(processed_train_data_rosepca.corr, old = c('PAY_1_-1','PAY_2_-1','PAY_3_-1','PAY_4_-1','PAY_5_-1','PAY_6_-1','PAY_1_-2','PAY_2_-2','PAY_3_-2','PAY_4_-2','PAY_5_-2','PAY_6_-2'),
new = c('PAY_1_n1','PAY_2_n1','PAY_3_n1','PAY_4_n1','PAY_5_n1','PAY_6_n1','PAY_1_n2','PAY_2_n2','PAY_3_n2','PAY_4_n2','PAY_5_n2','PAY_6_n2'))
```
## b. Principal Component Analysis (PCA)
PCA is particularly useful for dimensionality reduction when the
dimensions of the data are high. Since there is a large number of
feature attributes in our dataset, to reduce the number of features used
in the model, we can also use PCA to find a set of features ("Principal
Components") that explain the most variance in data. Since PCA is
designed to minimize variance (squared deviations), we applied PCA to
only the continuous variables in our dataset as the concept of squared
deviations break down when considering binary variables.
We can then choose a subset of the PCs, e.g. the first k PCs, perhaps
based on the cumulative variance explained. (e.g, the first 7 PCs may
account for 99% of the variance). For our project, we chose PCA as a
feature selection because we do not require interpretability of the
selected features.
```{r , echo=TRUE, warning=FALSE}
set.seed(1234)
pca1 <- prcomp(select(processed_train_data.corr,c('LIMIT_BAL','AGE','BILL_AMT1','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6')), center=T)
#We did not include scale = True since the variables are already scaled.
summary(pca1)
```
From the summary of the PCA done, Principal components 1 to 7 together
explain 99.113% of the variance in the data. Thus, we will use PCA 1 to
7 in our machine learning models.
Using the Principal Components derived in our PCA analysis from our
trainset, we would be projecting it to our test set. This way the points
in both train and test sets end up in the same dimension space, and we
would not be using any knowledge about our test set during training,
hence preventing leaking.
```{r , echo=TRUE, warning=FALSE}
processed_train_data.corr$PC1 = pca1$x[,"PC1"]
processed_train_data.corr$PC2 = pca1$x[,"PC2"]
processed_train_data.corr$PC3 = pca1$x[,"PC3"]
processed_train_data.corr$PC4 = pca1$x[,"PC4"]
processed_train_data.corr$PC5 = pca1$x[,"PC5"]
processed_train_data.corr$PC6 = pca1$x[,"PC6"]
processed_train_data.corr$PC7 = pca1$x[,"PC7"]
#mapping pca 1 to 7 weights to form pca 1 to 7 in test set
pca_test= predict(pca1,processed_test_data.corr)
processed_test_data.corr = cbind(processed_test_data.corr,pca_test[,-c(8,9)])
```
## c. Oversampling and Undersampling techniques
For the target values of whether payment is defaulted
(default_payment_next_month) in training data set, data set is
unbalanced as 22.31881% of clients has defaulted on payment, as compared
to 77.68119% who did not default on payment, as seen in the summary
table below.
```{r , echo=TRUE, warning=FALSE}
#check class distribution
processed_train_data %>%
group_by(default_payment_next_month) %>%
summarise("Percentage of Total" = 100*n()/nrow(processed_train_data ))
```
To prevent the training process from biasing towards a certain class
over another, over-sampling and under-sampling techniques will be used.
Over-sampling generates more observations from the minority class to
ensure the dataset is balanced. Under-sampling reduces the observations
from the majority class to ensure the dataset is balanced.
Data generated from over-sampling is expected to have repeated
observations and data generated from under-sampling is expected to be
deprived of important information from the original data. This would
result in inaccuracies in the resulting performance of the machine
learning algorithms.
To prevent this from happening, the ROSE package helps to generate data
synthetically.
However, it is noteworthy that synthetic samples do not formally belong
to the target distribution hence should not be used as a test sample.
```{r , echo=TRUE, warning=FALSE}
#1 preprocessing
train_data.corr <- select(train.data,-c('ID','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6'))
test_data.corr <- select(test.data,-c('ID','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6'))
data.rose <- ROSE(default_payment_next_month ~ .,
data = train_data.corr,
seed = 1)$data
rose_train.data_without_target.corr = data.rose[-c(19)]
rose_train.data_without_target_encoded.corr <- one_hot(as.data.table(rose_train.data_without_target.corr))
rose_train_data_encoded.corr = cbind(rose_train.data_without_target_encoded.corr, data.rose[c(19)])
rose_processed_train_data.corr <- rose_train_data_encoded.corr
df <- preProcess(rose_processed_train_data.corr, method=c("range"))
rose_processed_train_data.corr <- predict(df, as.data.frame(rose_processed_train_data.corr))
setnames(rose_processed_train_data.corr, old = c('PAY_1_-1','PAY_2_-1','PAY_3_-1','PAY_4_-1','PAY_5_-1','PAY_6_-1','PAY_1_-2','PAY_2_-2','PAY_3_-2','PAY_4_-2','PAY_5_-2','PAY_6_-2'),
new = c('PAY_1_n1','PAY_2_n1','PAY_3_n1','PAY_4_n1','PAY_5_n1','PAY_6_n1','PAY_1_n2','PAY_2_n2','PAY_3_n2','PAY_4_n2','PAY_5_n2','PAY_6_n2'))
```
PCA feature selection was applied before oversampling and undersampling.
By doing so, we are able to project the data on the axes/plane where the
variance of the data is the highest, hence leveraging the maximisation
of within data variability. Running oversampling and undersampling on
when the variability within the data is high creates synthetic samples
closer to the actual data.
```{r , echo=TRUE, warning=FALSE}
#4 preprocessing
set.seed(1234)
rose_processed_train_data_pca.corr <- processed_train_data_rosepca.corr
pca2 <- prcomp(rose_processed_train_data_pca.corr[,c(1,11,76:82)], center=T)
#We did not include scale = True since the variables are already scaled.
summary(pca2)
rose_processed_train_data_pca.corr$PC1 = pca2$x[,"PC1"]
rose_processed_train_data_pca.corr$PC2 = pca2$x[,"PC2"]
rose_processed_train_data_pca.corr$PC3 = pca2$x[,"PC3"]
rose_processed_train_data_pca.corr$PC4 = pca2$x[,"PC4"]
rose_processed_train_data_pca.corr$PC5 = pca2$x[,"PC5"]
rose_processed_train_data_pca.corr$PC6 = pca2$x[,"PC6"]
rose_processed_train_data_pca.corr$PC7 = pca2$x[,"PC7"]
#mapping pca 1 to 7 weights to form pca 1 to 7 in test set
pca_test2= predict(pca2,processed_test_data_rosepca.corr)
rose_processed_test_data_pca.corr = cbind(processed_test_data_rosepca.corr,pca_test2[,-c(8,9)])
rose_processed_train_data_pca.corr<- ROSE(default_payment_next_month ~ .,
data = rose_processed_train_data_pca.corr,
seed = 1)$data
```
```{r , echo=TRUE, warning=FALSE}
#1 processed_train_data_rosepca.corr, processed_test_data_rosepca.corr
#2 rose_processed_train_data.corr processed_test_data_rosepca.corr
#3
#processed_train_data.corr,-c('LIMIT_BAL','AGE','BILL_AMT1','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6')
#processed_test_data.corr
#4
#rose_processed_train_data_pca.corr,-c('LIMIT_BAL','AGE','BILL_AMT1','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6')
#rose_processed_test_data_pca.corr
```
# 5. Model Selection
We will now run a variety of machine learning models to test and
evaluate which one performs the best in predicting defaulters among
credit card customers.
The models we will be using are:
1) Logistic Regression
2) Support Vector Machines (SVM)
3) Naive Bayes Classification
4) Decision Tree (Conditional Inference Tree)
5) Random Forest
6) Neural Networks
We will be training each model on datasets with the different
preprocessing techniques using the train set. The different datasets
are:
1) Original trainset after removing highly correlated variables
2) Original trainset after removing highly correlated variables +
Oversampling and undersampling techniques
3) PCA after removing highly correlated variables
4) PCA after removing highly correlated variables + Oversampling and
undersampling techniques
After which, we will be evaluating each respective model using the test
set.
<!-- -->
## 1. Logistic Regression
Logistic regression is a classification algorithm used to find the
probability of event success and failure. Since outcome variable
default_payment_next_month: target values of whether payment is
defaulted (Yes = 1, No = 0) is binary, logistic regression is used.
To find optimum threshold for the logistic regression classification, we
have to decide how much True Positive Rate(TPR) and False Positive
Rate(FPR) we wish to have. If we were to increase the TPR, FPR will
likely to increase as well. For this set of data, we wish to minimise
the number of false negatives as far as possible, hence we choose a
threshold that increases TPR and reduces FPR.
To find the probability threshold to give us the best performance, we
looked into the point on the ROC curve that will maximise TPR while
minimising FPR at the same time. For instance, we plotted the ROC curve
for test results of dataset 4 (PCA after removing highly correlated
variables + Oversampling and undersampling techniques), which gave us an
optimum threshold of 0.429, as seen below.
```{r , echo=TRUE, warning=FALSE,fig.height = 3, fig.width = 5}
set.seed(1234)
#1
#processed_train_data_rosepca.corr, processed_test_data_rosepca.corr
logistic <- glm(default_payment_next_month ~ .,
data = processed_train_data_rosepca.corr,
family = "binomial")
#summary
#summary(logistic)
#predict test data based on model
predict_reg <- predict(logistic,
processed_test_data_rosepca.corr,
type = "response")
# prediction(predict_reg,processed_test_data_rosepca.corr$default_payment_next_month) %>%
# performance(measure = "tpr", x.measure = "fpr") -> result
#
# plotdata <- data.frame(x = [email protected][[1]],
# y = [email protected][[1]],
# p = [email protected][[1]])
#
# p <- ggplot(data = plotdata) +
# geom_path(aes(x = x, y = y)) +
# xlab([email protected]) +
# ylab([email protected]) +
# theme_bw()
#
# dist_vec <- plotdata$x^2 + (1 - plotdata$y)^2
# opt_pos <- which.min(dist_vec)
# round(plotdata[opt_pos, ]$p, 3)
prediction_data <- ifelse(predict_reg > 0.5, 1, 0)
log.cm <- table(processed_test_data_rosepca.corr$default_payment_next_month,prediction_data)
log.cm1 <- table(processed_test_data_rosepca.corr$default_payment_next_month,prediction_data)
precision <- as.matrix(log.cm)[1] / (as.matrix(log.cm)[1] + as.matrix(log.cm)[2])
recall <- as.matrix(log.cm)[1] / (as.matrix(log.cm)[1] + as.matrix(log.cm)[3])
accuracy <- (as.matrix(log.cm)[1] + as.matrix(log.cm)[4]) / 7400
avg_accuracy <- (recall+(as.matrix(log.cm)[4] /(as.matrix(log.cm)[2] + as.matrix(log.cm)[4])))/2
f1 <- 2* (precision*recall) / (precision+recall)
auc = auc(processed_test_data_rosepca.corr$default_payment_next_month, predict_reg, quiet=TRUE)
a = cbind(precision,recall,accuracy,f1,avg_accuracy, auc)
#2
#rose_processed_train_data.corr rose_processed_test_data.corr
logistic <- glm(default_payment_next_month ~ .,
data = rose_processed_train_data.corr,
family = "binomial")
#summary
#summary(logistic)
#predict test data based on model
predict_reg <- predict(logistic,
processed_test_data_rosepca.corr,
type = "response")
# prediction(predict_reg,processed_test_data_rosepca.corr$default_payment_next_month) %>%
# performance(measure = "tpr", x.measure = "fpr") -> result
#
# plotdata <- data.frame(x = [email protected][[1]],
# y = [email protected][[1]],
# p = [email protected][[1]])
#
# p <- ggplot(data = plotdata) +
# geom_path(aes(x = x, y = y)) +
# xlab([email protected]) +
# ylab([email protected]) +
# theme_bw()
#
# dist_vec <- plotdata$x^2 + (1 - plotdata$y)^2
# opt_pos <- which.min(dist_vec)
# round(plotdata[opt_pos, ]$p, 3)
prediction_data <- ifelse(predict_reg > 0.5, 1, 0)
log.cm <- table(processed_test_data_rosepca.corr$default_payment_next_month,prediction_data)
log.cm2 <- table(processed_test_data_rosepca.corr$default_payment_next_month,prediction_data)
precision <- as.matrix(log.cm)[1] / (as.matrix(log.cm)[1] + as.matrix(log.cm)[2])
recall <- as.matrix(log.cm)[1] / (as.matrix(log.cm)[1] + as.matrix(log.cm)[3])
accuracy <- (as.matrix(log.cm)[1] + as.matrix(log.cm)[4]) / 7400
avg_accuracy <- (recall+(as.matrix(log.cm)[4] /(as.matrix(log.cm)[2] + as.matrix(log.cm)[4])))/2
f1 <- 2* (precision*recall) / (precision+recall)
auc = auc(processed_test_data_rosepca.corr$default_payment_next_month, predict_reg, quiet=TRUE)
b = cbind(precision,recall,accuracy,f1,avg_accuracy, auc)
#3
#processed_train_data.corr, processed_train_data.corr,-c('LIMIT_BAL','AGE','BILL_AMT1','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6')
logistic <- glm(default_payment_next_month ~ .,
data = select(processed_train_data.corr,-c('LIMIT_BAL','AGE','BILL_AMT1','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6')),
family = "binomial")
#summary
#summary(logistic)
#predict test data based on model
predict_reg <- predict(logistic,
processed_test_data.corr,
type = "response")
# prediction(predict_reg,processed_test_data.corr$default_payment_next_month) %>%
# performance(measure = "tpr", x.measure = "fpr") -> result
#
# plotdata <- data.frame(x = [email protected][[1]],
# y = [email protected][[1]],
# p = [email protected][[1]])
#
# p <- ggplot(data = plotdata) +
# geom_path(aes(x = x, y = y)) +
# xlab([email protected]) +
# ylab([email protected]) +
# theme_bw()
#
# dist_vec <- plotdata$x^2 + (1 - plotdata$y)^2
# opt_pos <- which.min(dist_vec)
# #round(plotdata[opt_pos, ]$p, 3)
prediction_data <- ifelse(predict_reg > 0.5, 1, 0)
log.cm <- table(processed_test_data.corr$default_payment_next_month,prediction_data)
log.cm3 <- table(processed_test_data.corr$default_payment_next_month,prediction_data)
precision <- as.matrix(log.cm)[1] / (as.matrix(log.cm)[1] + as.matrix(log.cm)[2])
recall <- as.matrix(log.cm)[1] / (as.matrix(log.cm)[1] + as.matrix(log.cm)[3])
accuracy <- (as.matrix(log.cm)[1] + as.matrix(log.cm)[4]) / 7400
avg_accuracy <- (recall+(as.matrix(log.cm)[4] /(as.matrix(log.cm)[2] + as.matrix(log.cm)[4])))/2
f1 <- 2* (precision*recall) / (precision+recall)
auc = auc(processed_test_data.corr$default_payment_next_month, predict_reg, quiet=TRUE)
c = cbind(precision,recall,accuracy,f1,avg_accuracy, auc)
#4
#rose_processed_train_data_pca.corr,-c('LIMIT_BAL','AGE','BILL_AMT1','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6')
#rose_processed_test_data_pca.corr
logistic <- glm(default_payment_next_month ~ .,
data = select(rose_processed_train_data_pca.corr,-c('LIMIT_BAL','AGE','BILL_AMT1','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6')),
family = "binomial")
#summary
#summary(logistic)
#predict test data based on model
predict_reg <- predict(logistic,
rose_processed_test_data_pca.corr,
type = "response")
prediction_data <- ifelse(predict_reg > 0.429 , 1, 0)
prediction(predict_reg,rose_processed_test_data_pca.corr$default_payment_next_month) %>%
performance(measure = "tpr", x.measure = "fpr") -> result
plotdata <- data.frame(x = [email protected][[1]],
y = [email protected][[1]],
p = [email protected][[1]])
p <- ggplot(data = plotdata) +
geom_path(aes(x = x, y = y)) +
xlab([email protected]) +
ylab([email protected]) +
theme_bw() +
ggtitle("ROC Curve To Find Optimum Treshold (Dataset 4)")
dist_vec <- plotdata$x^2 + (1 - plotdata$y)^2
opt_pos <- which.min(dist_vec)
p +
geom_point(data = plotdata[opt_pos, ],
aes(x = x, y = y), col = "red") +
annotate("text",
x = plotdata[opt_pos, ]$x + 0.1,
y = plotdata[opt_pos, ]$y,
label = paste("p =", round(plotdata[opt_pos, ]$p, 3)))
log.cm <- table(rose_processed_test_data_pca.corr$default_payment_next_month,prediction_data)
log.cm4 <- table(rose_processed_test_data_pca.corr$default_payment_next_month,prediction_data)
precision <- as.matrix(log.cm)[1] / (as.matrix(log.cm)[1] + as.matrix(log.cm)[2])
recall <- as.matrix(log.cm)[1] / (as.matrix(log.cm)[1] + as.matrix(log.cm)[3])
accuracy <- (as.matrix(log.cm)[1] + as.matrix(log.cm)[4]) / 7400
avg_accuracy <- (recall+(as.matrix(log.cm)[4] /(as.matrix(log.cm)[2] + as.matrix(log.cm)[4])))/2
f1 <- 2* (precision*recall) / (precision+recall)
auc = auc(rose_processed_test_data_pca.corr$default_payment_next_month, predict_reg, quiet=TRUE)
d = cbind(precision,recall,accuracy,f1,avg_accuracy, auc)
final_log<-rbind(a,b,c,d)
rownames(final_log) = c('corr','corr+under/oversampling','corr+PCA','corr+under/oversampling+PCA')
final_log
tm<-cbind(log.cm1,log.cm2,log.cm3,log.cm4)
kable(tm,longtable =T,booktabs =T,caption ="Confusion matrices of each model")%>%
add_header_above(c(" ","corr "=2,"corr+under/oversampling"=2,"corr+PCA "=2,"corr+under/oversampling+PC"=2))%>%
kable_styling(latex_options =c("repeat_header"))
```
## 2. Support Vector Model (SVM)
Support Vector Machines is a supervised machine learning algorithm which
uses classification algorithms for two-group classification problems.
SVM is an appropriate model to use in this report since we do not need
to interpret the model, and merely need the results of its
classification. Furthermore, SVM works well with high dimensional data,
and is able to solve complex problems conveniently using the kernel
functions.
The type of $svm$ used is C-classification. For a binary classification
as in this case where there are only 2 classes $defaulter$ and
$non-defaulter$, the $C$ parameter comes into the picture when the
dataset is linearly non separable. It accounts for the slack variables
that are introduced into the linear constraints so that the model
becomes feasible. $C>0$ for all $C$.
Cost (C) is a regularisation parameter, it tells the optimizer what it
should optimise more, the distance between data inputs to the hyperplane
or the penalty of mis-classification. A large cost would tell the
optimiser to minimise misclassification since C is the scalar weight of
the penalty of mis-classification.
We iterated different values of C into the svm models to find the C that
gives us the best classification performance. We then used the optimal
value of C to train each respective dataset.
For kernel types, we iterated through different kernel types and found
that the linear kernel gives the best classification performance on the
dataset. We then used the optimal kernel type to train each respective
dataset.
```{r , echo=TRUE, warning=FALSE}
set.seed(1234)
#1 processed_train_data_rosepca.corr, processed_test_data_rosepca.corr
svm.crossmodel <- svm(default_payment_next_month ~ . ,
data=processed_train_data_rosepca.corr,
type="C-classification",
kernel="linear",
cost=1)
#svm.tuned <- tune(svm,default_payment_next_month ~ . ,data=processed_train_data_rosepca.corr, type="C-classification", kernel="linear",ranges=c(1,2^2,2^4,2^6,2^8,2^10))
results_test_cross <- predict(svm.crossmodel, processed_test_data_rosepca.corr[,-83])
log.cm <- table(processed_test_data_rosepca.corr$default_payment_next_month,results_test_cross)
log.cm1 <- table(processed_test_data_rosepca.corr$default_payment_next_month,results_test_cross)
precision <- as.matrix(log.cm)[1] / (as.matrix(log.cm)[1] + as.matrix(log.cm)[2])
recall <- as.matrix(log.cm)[1] / (as.matrix(log.cm)[1] + as.matrix(log.cm)[3])
accuracy <- (as.matrix(log.cm)[1] + as.matrix(log.cm)[4]) / 7400
avg_accuracy <- (recall+(as.matrix(log.cm)[4] /(as.matrix(log.cm)[2] + as.matrix(log.cm)[4])))/2
f1 <- 2* (precision*recall) / (precision+recall)
auc <- roc(response = processed_test_data_rosepca.corr$default_payment_next_month, predictor = as.numeric(results_test_cross), quiet=TRUE)
auc = auc$auc
a = cbind(precision,recall,accuracy,f1,avg_accuracy,auc)
#2 rose_processed_train_data.corr processed_test_data_rosepca.corr
svm.crossmodel <- svm(default_payment_next_month ~ . ,
data=rose_processed_train_data.corr,
type="C-classification",
kernel="linear",
cost=1)