-
Notifications
You must be signed in to change notification settings - Fork 2
/
29-adults.Rmd
55 lines (41 loc) · 1.54 KB
/
29-adults.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# Case study - The adults dataset.
## Introduction
The adult data set is another famous one from the **UCI - machine learning repository**.
The idea is to predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. Extraction was done by Barry Becker from the 1994 Census database.
Load the libraries
```{r message=FALSE, warning=FALSE}
library(stringr)
```
## Import the data
```{r message=FALSE, warning=FALSE}
df <- read_csv("dataset/adult.csv")
glimpse(df)
```
## Tidy the data
Let's check the level of missing data
```{r}
map_dbl(df, function(.x) sum(is.na(.x)))
```
No missing data! That's great news.
Before we change the <chr> variables into factors, let's see what type of levels we have.
```{r}
df %>% select_if(is.character) %>% map_if(is.character, unique)
```
Allright, so maybe there were no NA, but there are quite a few "?"
The "?" should probably be replaced with NAs.
```{r message=FALSE, warning=FALSE}
df <- read_csv("dataset/adult.csv", na = c("NA", "?"))
# Let's redo a check on the NA now
map_int(df, function(.x) sum(is.na(.x)))
```
Let's now rework the column names to better fit our naming conventions
```{r}
colnames(df) <- c("age", "working_class", "final_weight", "education", "education_num", "marital_status",
"occupation", "relationship", "race", "gender", "capital_gain", "capital_loss", "hours_per_week",
"native_country", "above_50k")
```
```{r}
df2 <- df %>% mutate_if(is_character, as.factor)
levels(df2$working_class)
summary(df2)
```