ch18.Rmd

---
title: "資料分析實戰"
author: "郭耀仁"
date: "`r Sys.Date()`"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE)
```

## 簡介

在這份報告裡面我下載 [Kaggle](https://www.kaggle.com/) 的 Titanic 訓練與測試進行資料整理，探索性分析與機器學習的分類模型。

## 資料讀取

我使用 `read.csv()` 函數讀入資料集。

```{r}
titanic <- read.csv("https://storage.googleapis.com/r_rookies/kaggle_titanic_train.csv")
```

## 資料框外觀

我使用 `str()` 函數得知這個資料有 891 個觀測值與 12 個變數。

```{r}
str(titanic)
```

## 描述性統計與資料清理

我使用 `summary()` 函數進行描述性統計。

```{r}
summary(titanic)
```

我發現這個資料的 `Age` 變數有 177 個遺漏值，我決定只留下完整的觀測值訓練。而 `Embarked` 有兩個空值，我決定以 S 填補。

```{r}
titanic <- titanic[complete.cases(titanic), ]
titanic$Survived <- factor(titanic$Survived)
titanic$Embarked <- as.character(titanic$Embarked)
titanic$Embarked[titanic$Embarked == ""] <- "S"
titanic$Embarked <- factor(titanic$Embarked)
```

## 探索性分析

我利用 `ggplot2` 與 `plotly` 套件來作圖。

```{r message = FALSE}
library(ggplot2)
library(plotly)
```

```{r}
# 性別
ggplot_bar_sex <- ggplot(titanic, aes(x = Sex, y = Survived, fill = Sex)) + geom_bar(stat = "identity")
ggplot_bar_sex_plotly <- ggplotly(ggplot_bar_sex)
ggplot_bar_sex_plotly

# Pclass
ggplot_bar_pclass <- ggplot(titanic, aes(x = factor(Pclass), y = Survived, fill = factor(Pclass))) + geom_bar(stat = "identity", width = .7)
ggplot_bar_pclass_plotly <- ggplotly(ggplot_bar_pclass)
ggplot_bar_pclass_plotly
```

## 建立一個分類模型

我利用 `randomForest()` 函數建立一個隨機森林分類模型來預測 `Survived` 變數。

```{r}
# 切分訓練與測試資料
set.seed(87)
n <- nrow(titanic)
shuffled_titanic <- titanic[sample(n), ]
train_indices <- 1:round(0.7 * n)
train <- shuffled_titanic[train_indices, ]
test_indices <- (round(0.7 * n) + 1):n
test <- shuffled_titanic[test_indices, ]

# 建立分類器
library(randomForest)
rf_clf <- randomForest(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = train, ntree = 100)

# 計算 accuracy
prediction <- predict(rf_clf, test[, c("Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")])
confusion_matrix <- table(test$Survived, prediction)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
```

## 探索沒有答案的上傳資料

- `Age` 有 86 個遺漏值
- `Fare` 有 1 個遺漏值
- 上傳資料不能刪除觀測值

```{r}
url <- "https://storage.googleapis.com/py_ds_basic/kaggle_titanic_test.csv"
to_predict <- read.csv(url)
summary(to_predict)
```

## 填補遺漏值

- `Fare` 用平均值填滿。
- `Age` 依照 `Pclass` 的平均年齡填滿

```{r}
library(dplyr)
library(magrittr)

# Fare
fare_mean <- mean(to_predict$Fare, na.rm = TRUE)
to_predict$Fare[is.na(to_predict$Fare)] <- fare_mean

# Age
mean_age_by_Pclass <- to_predict %>%
  group_by(Pclass) %>%
  summarise(mean_age = round(mean(Age, na.rm = TRUE)))
filter_1 <- is.na(to_predict$Age) & to_predict$Pclass == 1
filter_2 <- is.na(to_predict$Age) & to_predict$Pclass == 2
filter_3 <- is.na(to_predict$Age) & to_predict$Pclass == 3
mean_age_by_Pclass
to_predict[filter_1, ]$Age <- 41
to_predict[filter_2, ]$Age <- 29
to_predict[filter_3, ]$Age <- 24

# Summary after imputation
summary(to_predict)
```

## 準備上傳

```{r}
predicted <- predict(rf_clf, newdata = to_predict[, c("Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")])
to_submit <- data.frame(to_predict[, "PassengerId"], predicted)
names(to_submit) <- c("PassengerId", "Survived")
head(to_submit, n = 10)
write.csv(to_submit, file = "to_submit.csv", row.names = FALSE)
```

![Kaggle Submission](kaggle_submission.png)