-
Notifications
You must be signed in to change notification settings - Fork 0
/
titanic-rmd.Rmd
488 lines (346 loc) · 16.8 KB
/
titanic-rmd.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
---
title: "Titanic Survival Prediction Pt. 1"
author: "Syarifah Rizki"
date: "2/17/2022"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r pressure, echo=FALSE, fig.cap="The RMS Titanic in port during 1912 (Source: Universal History Archive/Getty Images)", out.width = '100%'}
knitr::include_graphics("Titanic/titanic1.jpg")
```
# An overview
The infamous Titanic disaster that occurred in 1912 took the life of at least 1,500 people, both passengers and crews alike. When the ship sank.
The Titanic competition in Kaggle is a good place to start when we first learn about classification in machine learning. The purpose of this project itself is to predict whether a passenger will survived or not, based on the given information or variables. There are three data sets that we will use here, namely the *train*, *test* and *gender submission* data sets. For Titanic Survival Prediction Part 1, we will try to compare model performances from two different algorithms that we can use in classification, that is the logistic regression and k-Nearest Neighbor (kNN).
# Let's get started!
## Import files and packages
```{r, message=FALSE, warning=FALSE}
# Import packages
library(readr)
library(dplyr)
library(tidyverse)
library(tracerer) # to automatically calculate mode
library(caret) # model evaluation
library(psych) # create dummy variables
library(class) # kNN
```
```{r, message=FALSE, warning=FALSE}
# Import files
data_train <- read_csv("Titanic/train.csv")
```
```{r}
head(data_train)
```
Columns description:
* Survived: Survival. 0 = No, 1 = Yes.
* Pclass: Ticket class. A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower
* Sex: Sex
* Age: Age in years. Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
* SibSp: of siblings / spouses aboard the Titanic. The dataset defines family relations in this way:
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
* parch: of parents / children aboard the Titanic. The dataset defines family relations in this way:
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.
* Ticket: Ticket's number
* Fare: Passenger fare
* Cabin: Cabin's number
* Embarked: Port of embarkation.
- C = Cherbourg
- Q = Queenstown
- S = Southampton
## Data wrangling
```{r}
head(data_train)
```
```{r}
glimpse(data_train)
```
```{r}
# maximum missing values that we tolerated for each columns
nrow(data_train)*0.05
# check missing values from all columns
colSums(is.na(data_train))
```
For data wrangling, we would need to:
1. Change columns to its correct data types
* Columns that need to be converted to factor: **Survived**, **Pclass**, **Sex** and **Embarked**.
2. Drop column `Cabin` and `Name`.
* For `Cabin`, we will drop it since it has a total of missing values that is more than half of the total observations.
* In the case of the `Name` column, to be honest I think it actually stored valuable information that we can use to explore the family relations among the titanic passengers. It is a very useful variable to use when we are exploring the data sets with more advanced algorithm in classification. We could have keep it as it is but since we are using a much simpler model, that is with logistic regression and kNN, we would not use it to create our models.
```{r}
# change columns data types into the correct ones
titanic_train <- data_train %>%
mutate_at(.vars = c("Survived", "Pclass", "Sex", "Embarked"), .funs = as.factor) %>%
select(-c(Cabin, Name, Ticket))
head(titanic_train)
```
## Treatment for missing values
```{r}
# check again for columns that have missing values
colSums(is.na(titanic_train))
```
We have located the columns that have missing values and those columns consisted of **Age** and **Embarked**. Since **Embarked** is a factor, we will replace the two missing values with its mode or the value that appeared most frequently. As for the **Age** column, we can either replace the 177 missing values with the mean or median. If there's no outlier, then we'll replace them with mean but if it does, then it is better to use the median number instead.
```{r}
# check the mode for column `Embarked`
table(titanic_train$Embarked)
```
Turns out, most of the passengers in the Titanic embarked from the Southampton (S) port with 644 passengers in total.
```{r}
# replace missing values in column `Embarked` with the mode (Southampton port)
which(is.na(titanic_train$Embarked)) # locate missing values index in Embarked
titanic_train <- titanic_train %>%
mutate(Embarked = if_else(is.na(Embarked), calc_mode(Embarked), Embarked))
```
```{r}
# check outliers in Age
boxplot(titanic_train$Age)
```
Since we have plenty of outliers in the `Age` column, it is better to replace the missing values with the median.
```{r}
# replace missing values in column `Age` with the median
median(titanic_train$Age, na.rm = T) # check the median age from all passengers without taking into account the NA
titanic_train <- titanic_train %>%
mutate(Age = if_else(is.na(Age), median(Age, na.rm = T), Age))
```
```{r}
# check again whether we still have missing values or not
anyNA(titanic_train)
```
Great! All variables within the **titanic_train** dataframe is now stored in their correct data types and has no more missing values. So far, we have performed data cleansing and we're ready to go and create classification models from it.
```{r}
titanic_train
```
# Machine Learning
As we have mentioned earlier, we will create a prediction of whether a passenger will survived or not using two different classification models; the logistic regression and the k-Nearest Neighbor (kNN).
The first step in creating a prediction with both models is to decide which of the variable that we want to set as the target and the predictor variables. Now, because our aim is to predicted which passengers that will survived or not, we will use the **Survived** column as the **target variable** while the rest of the variables will be used as the predictors.
1. **Target variable**: Column `Survived`. We will also take "1" (Survived) as the positive class for our models.
2. **Predictor variable(s)**: All variables beside column `Survived`
Check whether the target variables have balanced or imbalanced class.
```{r}
prop.table(table(titanic_train$Survived))
```
The proportion for our target class seems to be quite balanced and at this stage, we won't try to downsampling / upsampling the target variable just yet.
## Logistic Regression
Since we were already provided with the separate .csv files for the training and test data set, we don't have to go through the cross-validation step anymore and we could import the **test.csv** and the **gender_submission.csv** files to our environment.
```{r, message=FALSE, warning=FALSE}
# importing test.csv and the gender_submission.csv
data_test <- read_csv("Titanic/test.csv")
gender_sub <- read_csv("Titanic/gender_submission.csv")
#left join column 'Survived' in gender_sub to data_test
data_test_merge <- merge(data_test, gender_sub, by = "PassengerId")
```
```{r}
glimpse(data_test_merge)
```
```{r}
colSums(is.na(data_test_merge))
```
```{r}
# change columns data types into the correct ones
titanic_test <- data_test_merge %>%
mutate_at(.vars = c("Survived", "Pclass", "Sex", "Embarked"), .funs = as.factor) %>%
select(-c(Cabin, Name, Ticket))
head(titanic_test)
```
For the titanic_test data set, the missing values can be located in columns **Age** and **Fare**. Since column **Fare** data type is a numeric, we'll decide whether to replace the missing values with its mean or median.
```{r}
# Check for outliers in columns 'Age' and 'Fare'
boxplot(titanic_test$Age)
boxplot(titanic_test$Fare)
```
Both columns have outliers.
```{r}
# Check the range of fare
range(titanic_test$Fare, na.rm = T)
# Check the range of passenger's age
range(titanic_test$Age, na.rm = T)
```
```{r}
titanic_test %>%
filter(Fare == 0)
```
```{r}
# Treatment for missing values in column `Age` and `Fare`
titanic_test <- titanic_test %>%
mutate(Age = if_else(is.na(Age), median(Age, na.rm = T), Age),
Fare = if_else(is.na(Fare), median(Fare, na.rm = T), Fare))
```
```{r}
anyNA(titanic_test)
```
### Create model
```{r}
titanic_train
```
```{r}
prop.table(table(titanic_train$Survived))
```
```{r}
model_lm <- glm(Survived~., data = titanic_train, family = "binomial")
summary(model_lm)
```
Several things to concluded from the summary above:
1. Column PClass, Sex, Age, SibSp and Embarked are the columns that significantly affected our target variable with p-value that has a score of less than 0.05.
3. AIC or the information loss for model_lm is 806.97 while it has a residual deviance of 784.97.
The Estimate number display the log of odds for each columns. If we want to interpreted it, we can use the *exp()* function to converted them to their odds.
```{r}
# log of odds -> odds (for each predictor variables)
data.frame(odds = exp(model_lm$coefficients))
```
### Feature selection
```{r}
# feature selection
model_all <- step(model_lm, direction = "backward", trace = F)
summary(model_all)
```
```{r}
# model evaluation with aic
model_lm$aic
model_all$aic
```
Based on the AIC, model_all has a slightly lower number compared to model_lm. Therefore, we will use model_all instead of model_sm since lower AIC number means that our model has a better performance.
### Predict
Add new variable to titanic_test that stored the probability of our prediction.
```{r}
titanic_test$Prediction.Prob <- predict(object = model_all, newdata = titanic_test, type = "response")
```
Create another variable that contains the label of survival prediction for each passengers. If a passenger has a probability of more than 0.5, we would labeled them with "1" (positive class/survived). But, if they scored less than 0.5, they would be put in the negative class ("0"/Not survived).
```{r}
titanic_test$Prediction.Label <- ifelse(titanic_test$Prediction.Prob > 0.5, "1","0")
head(titanic_test)
```
```{r}
# change Prediction.Label variable to factor
titanic_test <- titanic_test %>%
mutate(Prediction.Label = as.factor(Prediction.Label))
```
### Model evaluation
After we have done with making our prediction, we are going to evaluate our model by using the confusion matrix from package *caret*.
```{r}
# from package `caret`
lm_model_eval <- confusionMatrix(data = titanic_test$Prediction.Label,
reference = titanic_test$Survived,
positive = "1")
lm_model_eval
```
From the summary above, we can see that our model is able to correctly predict both its positive and negative classes. But unfortunately, the number of the correct prediction from the true positive class (144) is way lower than the negative class (246). It means our model is better at predicting from one class only, which is the negative class. Ideally, we want our model to learn equally well from both classes and not limited to one class only.
Although the target class is considered to be imbalanced, it turns out that our model has a quite high score both for its recall (Sensitivity) and precision (Pos Pred Value). Since our target class isn't that balanced, we wouldn't use Accuracy as the metrics for our model evaluation. Instead, we will focus on the other two metrics, namely recall and precision.
## k-Nearest Neighbor (kNN)
```{r}
glimpse(titanic_train)
```
```{r}
titanic.train <- titanic_train %>%
mutate(Sex = if_else(Sex == "male", 0, 1)) %>%
select(-Embarked)
head(titanic.train)
```
```{r}
prop.table(table(titanic.train$Survived))
```
Before we proceed with creating a kNN model, we first need to check the range of each variables. If the predictor variables have a huge gap in terms of range, then we would have to perform a normalization or scaling. This is done particularly because kNN
```{r}
summary(titanic.train)
```
Summary from titanic.train:
1. The passengers in the titanic.train data set only consisted of those from the 1st (upper) and 2nd (middle) class.
2. The age of the titanic.train passengers range from 0.42 - 80 years old with a median of 28 years old.
3. Judging from the **SibSp** column, 50% of the passengers were travelling alone without being accompanied by a family member.
```{r}
# data wrangling
# replace value "male" with 0 and "female" with 1 in column Sex
# remove other categorical data and predictions result from previous titanic_test dataset that we used for logistic regression
titanic.test <- titanic_test %>%
mutate(Sex = if_else(Sex == "male", 0, 1)) %>%
select(-c(Embarked, Prediction.Prob, Prediction.Label))
head(titanic.test)
```
Create dummy variables that have more than two levels.
```{r}
# from package `psych`
# create dummy variable from column `Pclass` in titanic.train
Pclass_train <- as.data.frame(dummy.code(titanic.train$Pclass))
# create dummy variable from column `Pclass` in titanic.test
Pclass_test <- as.data.frame(dummy.code(titanic.test$Pclass))
```
Tutorials on handling factor variables that have more than two levels: https://quantdev.ssri.psu.edu/sites/qdev/files/kNN_tutorial.html
```{r, message=FALSE, warning=FALSE}
# rename columns from Pclass_train
Pclass_train <- Pclass_train %>%
rename("Pclass.Up" = "1",
"Pclass.Mid" = "2",
"Pclass.Low" = "3")
# rename columns from Pclass_test
Pclass_test <- Pclass_test %>%
rename("Pclass.Up" = "1",
"Pclass.Mid" = "2",
"Pclass.Low" = "3")
head(Pclass_train)
```
```{r}
# combined dummy variables to the titanic.train and titanic.test dataset
train_data <- cbind(titanic.train, Pclass_train)
test_data <- cbind(titanic.test, Pclass_test)
# drop the original Pclass column
train_data1 <- train_data[, -3]
test_data1 <- test_data[, -2]
head(train_data1)
```
```{r}
# filter out predictor variables
titanic_train_x <- train_data1 %>% select_if(is.numeric)
titanic_test_x <- test_data1 %>% select_if(is.numeric)
# filter out target variables
titanic_train_y <- train_data1[,"Survived"]
titanic_test_y <- test_data1[,"Survived"]
```
```{r}
head(titanic_train_x)
head(titanic_train_y)
```
```{r}
# scaling
titanic_train_xs <- scale(titanic_train_x)
titanic_test_xs <- scale(titanic_test_x,
center = attr(titanic_train_xs,"scaled:center"), # center = mean
scale = attr(titanic_train_xs, "scaled:scale")) # scale = standard deviation
```
```{r}
head(titanic_train_xs)
head(titanic_test_xs)
```
### Predict
```{r}
# find optimum k
sqrt(nrow(titanic_train_xs))
```
When we have an even number for target class, we should use an odd number for its optimum k and vice versa.
* Total of target class: 2 *(even number)*
* K: 29 *(odd number)*
```{r}
titanic_pred <- knn(train = titanic_train_xs,
test = titanic_test_xs,
cl = titanic_train_y,
k = 29)
```
```{r}
titanic_pred
```
```{r}
knn_model_eval <- confusionMatrix(data = as.factor(titanic_pred), reference = titanic_test_y, positive = "1")
knn_model_eval
```
We'll measure how well our kNN model was able to learn and correctly predict the target variables by using the confusion matrix model evaluation, just like when we did for logistic regression. Based on the three above metrics, kNN has a quite good performance with a recall of 79.61%, a precision of 85.21% and 87.56$ of accuracy. Now we'll compare the performance from both models between the logistic regression and the kNN.
```{r}
# comparing the model performances between knn and logistic regression
# model evaluation when we use logistic regression
lm_model_eval
```
# Conclusion
We have gone through the process of creating a model with logistic regression and k-Nearest Neighbor (kNN) to find out which of the two models that can better predict whether a Titanic passenger will survived or not. As it turns out, the model that was created with logistic regression has a better result in predicting the survival rate of a passenger in comparison to the one that we made with kNN. Since the target variables and the TN (true negative) / TF (true false) class is imbalanced, we chose to use both recall and precision as our model evaluation metrics. The model with logistic regression scored higher than the kNN in both metrics. It has a recall number of 94.74% and 87.8% for precision while kNN has 79.61% for its recall and 85.21% for precision.