-
Notifications
You must be signed in to change notification settings - Fork 2
/
02_PracticalDataConcerns_challenge.R
61 lines (28 loc) · 1.47 KB
/
02_PracticalDataConcerns_challenge.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Practical concerns: Review and prepare data for model building
# Goal: Identify segments of high and/or upward trending coronary heart disease prevalance
# Load packages ####
library(dplyr)
# Import data ####
# 5 years of seriatim data on heart attack rates by state, year, sex, age, and race
# Data represent a derivative of this dataset: https://www.kaggle.com/mazharkarimi/heart-disease-and-stroke-prevention/metadata
# Original dataset and its derivatives are protected by a database license and a content license, included in this repository's documents
heartattack <- readRDS("handson_challenges/heartdiseasedataset_modified.RDS")
# Look at the data ####
head(heartattack)
str(heartattack)
# Challenge ####
## 1) Deal with outlier and/or missing values
# a) Find the number of missing values in each field
# YOUR CODE HERE ####
# b) Are the missing values correlated with any other field? Consider using the table() function for cross-tabulating two categorical variables.
# YOUR CODE HERE ####
# c) Impute the missing values.
# YOUR CODE HERE ####
# 2) Derive a new variable for "geographic region" of USA to reduce dimensionality of that field.
# YOUR CODE HERE ####
# 3) Partition off a 30% holdout subset
# YOUR CODE HERE ####
# 4) Fit logistic regression to estimate heart attack odds given region, year, age, sex, and race
# YOUR CODE HERE ####
# 5) Use vif() function from "car" package to test for multicollinearity in the model below
# YOUR CODE HERE ####