-
Notifications
You must be signed in to change notification settings - Fork 4
/
0_MLPS_R_intro.Rmd
166 lines (120 loc) · 8.62 KB
/
0_MLPS_R_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: "0_MLPS_R_introduction"
author: "Zhe Zhang (TA - Heinz CMU PhD)"
date: "1/19/2017"
output:
html_document:
css: ~/Dropbox/avenir-white.css
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
## Recommended R Software
I recommend the use of RStudio program for most of your R coding. Download at <https://www.rstudio.com/>, which is free for personal use. All packages to use in R will be discussed below or in later cheatsheets.
RStudio is great because it combines several important pieces of your coding environment in one place.
* Suggestion: I prefer to have my coding on one half of the screen and the "Console" output -- where I can debug and test code -- on the other half of the screen. Plots, images, and help files then come up occasionally. This isn't the default. To adjust this, go to RStudio's `Preferences > Pane` Layout.
This document is being made in an RMarkdown journal format, which allows easy integration of (1) R code, (2) images, and (3) free text. I *highly suggest* if you're using R for assignments to use RMarkdown journals, which make it easier for both you and the graders.
* Learn more at <http://rmarkdown.rstudio.com/lesson-1.html>. Please take a few moments on this helpful tutorial so that you can write your projects and HW effectively.
* A useful reference sheet: <https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf>.
## R Basics
This is a brief overview and set of brief tutorials for people who may be coming from another programming language or new to R in general.
I personally am a fan of the work done by the RStudio development team, including Hadley Wickham. A great reference for beginning with R for Machine Learning for Problem Solving is Hadley's book, *R for Data Science*, co-authored with Garrett Grolemund. It is available in a easy-to-read online format at: <http://r4ds.had.co.nz/introduction.html>. I will draw on this and copy some particular examples here and suggest you consult this as a key reference.
The R4DS book uses the `tidyverse` packages. These are popular packages for their intuitive and consistent grammar (and are quite fast for most uses too). These are the packages I primarily use. A few key packages from the `tidyverse` are:
* `ggplot2` package (visualization/visualisation)
+ visualization is a key for any machine learning work. It helps understand the data so that we can get a better sense of how to approach the problem and potential issues to address in the data.
+ visualization also is very important for displaying information to others and for your own understanding in the future.
+ *discussed further in the cheatsheet for Lecture 2*
* `dplyr` package (dataset manipulation and transformation)
+ in this class, we will mostly use provided rectangular datasets. Each time, you will want to perform several adjustments on these datasets. `dplyr` makes these adjustments with a consistent and simple syntax. Below are the key actions:
+ `summarise` (getting summary statistics)
+ `group_by` (aggregating/summarizing the data differently for particular subgroups of the data)
+ `mutate` (creating new columns of data)
+ `select` (removing or keeping particular columns in the dataset)
+ `arrange` (sorting the rows of the dataset, or sorting by each subgroup)
+ `filter` (keeping or removing certain rows of the data)
+ *many other commands that are occasionally useful can be found at the manual*:
<https://cran.r-project.org/web/packages/dplyr/dplyr.pdf>
<https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html>
* `tidyr` package (cleaning and re-arranging the dataset) (more on this later)
* `stringr` package (a clean syntax to deal with strings and text data)
## Key R Commands to Start With
To start, I link to specific chapters in the R4DS book. Here's my recommended reading order and a few snippets of code:
**Overview of R ML programming** <http://r4ds.had.co.nz/explore-intro.html>
See the below image from the R4DS book. For this class, it's important that you feel comfortable with each of the steps below. The book covers these topics in more depth.
![](http://r4ds.had.co.nz/diagrams/data-science-explore.png)
```{r, eval=F}
## Importing Data
df_csv <- read.csv(file = "some_csv_file_in_same_folder.csv", stringsAsFactors = FALSE)
# this will import string columns as categorical factor variables,
# which is useful to aware of and may want to be avoided
df_csv <- read.csv(file = "some_csv_file_in_same_folder.csv", stringsAsFactors = TRUE)
# using the tidyverse readr package (stringAsFactors = FALSE default)
library(tidyverse)
df_csv <- read_csv(file = "some_csv_file_in_same_folder.csv")
df_text <- read_table(file = "some_tab_delimited_file.txt")
# very helpful to read the documentation
?read_csv
?read_table
## Investigating Datasets
head(mtcars)
summary(mtcars)
```
*RStudio tips*: use tab completion to help with typos and speed; use tabs inside functions to help remember what inputs they need; use tab to cycle through similarly named functions to see what's available
**Dealing with objects basics** <http://r4ds.had.co.nz/workflow-basics.html>
```{r}
## Creating Your Own Objects
my_value <- 10
my_vector <- c(1, 2, 3, 4, 5)
my_string <- "Hello, World!"
# all three of the below are identical
# each calls the seq() function
my_sequence <- seq(from = 1, to = 10)
my_sequence <- seq(1, 10)
my_sequence <- seq(10)
# we can call the seq() function with a different options
seq(1, 10, by = 2)
seq(1, 10, length.out = 5)
# almost all outputs and objects (including regressions, models, etc)
# can be saved as objects for future use in R
```
For *naming*, I recommend the use of underscores between words for variables. Some people prefer `camelCaseForVariables` too, where all subsequent words are capitalized.
**Visualization basics (3.1 - 3.6)** <http://r4ds.had.co.nz/data-visualisation.html>
This is quite important to feel comfortable with because it is the important first step of MLPS and helps you feel connected with the data. I emphasize the practice here. We cover this in more detail in the `2_MPLS_R_visualization` cheatsheet.
**Data transformation (all sections)** <http://r4ds.had.co.nz/transform.html>
While this will be covered in more in-depth a future cheatsheet (`3_MLPS_R_data_preparation`), it's important to get a feel for these important functions early on. I recommend you start with the simple examples here:
<https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html>
As you begin to play around with more dataset manipulation, perhaps for you project, I highly encourage a closer look at the many useful functions of `dplyr` and `tidyr` packages. `dplyr` has lots of options for manipulating the data and `tidyr` is really helpful in (1) changing the shape of your dataset and (2) filling in missing holes and generally, dealing with "messy" datasets.
**Programming R scripts** <http://r4ds.had.co.nz/workflow-scripts.html>
Lastly and importantly, it's important to treat R as a programming language. Get comfortable writing your own functions, loops, and scripts so that you can repeat your commands and tweak them as you go. Take a look at the following brief sections:
* <http://r4ds.had.co.nz/program-intro.html>
* <http://r4ds.had.co.nz/pipes.html>
* <http://r4ds.had.co.nz/functions.html>
I particularly recommend looking pipes as you may not have seen it before and works very well with the logical verbs of the `dplyr` package.
```{r}
## Some programming examples assuming knowledge of dplyr
library(dplyr)
library(nycflights13)
# notice how piping one functions result to the next makes
# the code read very cleanly and logically
# I recommend trying this to get coherant chunks of code
# in one place. It's possible to over-do it on piping though
united_EWR_flights <- flights %>%
filter(origin == "EWR", carrier == "UA") %>%
group_by(dest, month) %>%
summarise(num_flights = n(),
avg_distance = mean(distance)) %>%
arrange(desc(num_flights))
united_EWR_flights %>% head(n = 20)
```
```{r}
# a basic (not useful) for loop
for (i in 1:nrow(mtcars)) {
if (i %% 5 == 0) {
print(i)
}
}
```
**Lists** <http://r4ds.had.co.nz/lists.html>
Lastly, sometimes you may come across a function that outputs lists. Or, you may want to store data or model objects in something other than a dataframe. Lists can also serve a purpose of a dictionary, which are objects seen in other languages.
To do so, use **lists**. Lists are quite flexible in R, each list item can be named and can be any object in R. Read more about lists if they come up and you may find that lists can be quite helpful.