-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathoutline.qmd
251 lines (176 loc) · 9.68 KB
/
outline.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: Outline for modules
date-modified: 'today'
date-format: long
format:
html:
footer: "CC BY 4.0 John R Little"
license: CC BY
---
## Introduction to R - Refactoring the Workshop
Refactor existing workshop for smaller video segments and flipped presentation
1. Download software & run locally
- [R](https://cran.r-project.org/) & [RStudio](https://rstudio.com/products/rstudio/download/) are not the same thing
- [RStudio keyboard Shortcuts](https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts)
- Cloud Options
- https://rstudio.cloud
- https://cmgr.oit.duke.edu/containers
- R v Python
- R is a *data first* (i.e. analysis) programming language
- Python is a general programming language
- Are there libraries/packages relevant to your work?
- Is there a supportive community?
- [Rfun](https://rfun.library.duke.edu/)
- [R Community](https://community.rstudio.com/)
- [R Ladies](https://rladies.org/) \| Locally, [R Ladies, RTP](https://www.meetup.com/rladies-rtp/)
- [TidyTuesday](https://github.com/rfordatascience/tidytuesday) weekly practice sessions (+ David Robinson's [recorded live feed of TT on YouTube](https://www.youtube.com/watch?v=NY0-IFet5AM&list=PL19ev-r1GBwkuyiwnxoHTRC8TTqP8OEi8) )
- ML leans towards Python, [or does it](https://www.tidymodels.org/learn/)?
- Programmer/Coder v Analyst
- [how to choose a programming language](https://medium.com/better-programming/how-to-choose-a-programming-language-for-a-project-7c7a3e5a4de6)
- [how to choose a religion](https://www.wikihow.com/Find-the-Right-Religion-for-You)
2. An RStudio project & reproducibility
- **You are your most frequent collaborator** separated by time
- **A simple test**: Identify specific computational steps from a six-month old project?
- **A simple goal**: Reproduce your computation on a different computer
- Initial **Reproducibility in a nutshell**
- Do everything with a script
- Avoid point & click
- Use relative paths
- Write your code to run on any similar environment
- Read more: [Initial steps toward reproducible research.](https://kbroman.org/steps2rr/) Karl Broman
- **RStudio Projects enable *Reproducibility***
1. **Relative files paths**
- `read_csv("data_raw/raw_data.csv")`
- ProTip: `..` to move up one level in the directory structure
- Avoid absolute paths
- avoid `setwd()`
- e.g. `setwd("d:/rfiles/myrproject")`
2. ***Restart R and run all chunks***
- avoid: `rm(list=ls())`
3. R Markdown & **literate coding**
- A script integrates code and natural language
- Explain and describe your analysis within your workflow
- Render reports in multiple formats
- Notebooks, slide decks, web pages, dashboards, e-books, journal articles
4. **File structure** matters
- [*Practice of Reproducible Research*](http://www.practicereproducibleresearch.org/) by Kitzes, Turek, Deniz
<!-- -->
EXAMPLE File Structure...
project_name (folder)
|-- project_name.Rproj
|-- README.md
|-- license.txt
| data_raw
| |-- raw_data.csv
| |-- README.txt
| data_clean
| code_source
| |--data_cleaning.Rmd
| |--analysis.Rmd
| images
| reports_results
3. Get Data & Code Repository
- Access your own data file (e.g. CSV)
- [Download](https://github.com/libjohn/intro2r-code) & Expand a GitHub repository
- Click on \*.Rproj
4. Tour of the RStudio environment
- Create a blank project
- Console \| Files / Packages / Help \| Environment \| Script Editor
- R Markdown
- Switch to your other project (from Section 2)
- [Keyboard Shortcuts](https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts)
5. Tidyverse & other library packages
- Packages extend the functionality of base R into your domain
| Practice | Frequency | Command |
|----------|-----------|---------------------------------|
| Install | once | `install.packages("tidyverse")` |
| Load | each time | `library(tidyverse)` |
- [**Tidyverse**](https://tidyverse.org):
1. an opinionated collection of packages with consistent web-based documentation and a supportive community
2. a Meta-package that loads 8 helper packages and installs many consistent utilities
| Name | Purpose |
|---------|-------------------------------------------|
| readr | importing CSV data |
| dplyr | transforming data |
| ggplot2 | visualizing |
| tibble | rectangular grid / data frame |
| tidyr | pivot |
| forcats | categorical data / factors |
| stringr | string data / manipulate natural language |
| purrr | iteration |
- **Other package repositories**
- [MetaCRAN](https://www.r-pkg.org/)
- [CRAN](https://cran.r-project.org/web/packages/)
- [Tidyverse Packages](https://www.tidyverse.org/packages/)
- [BioConductor](https://www.bioconductor.org/packages/release/BiocViews.html#___Software)
- [ROpenSci](https://ropensci.org/)
6. Assignment `<-` & pipe `%>%` and `|>`
7. R, RStudio IDE
- Base R, in the console
- A big calculator
- RStudio & Tidyverse
8. [Quarto](https://quarto.org/docs/get-started/) - coding notebooks and publishing outputs
9. [`dplyr`](https://dplyr.tidyverse.org/) package
- "A grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges"
- `mutate()` adds new variables
- `select()` pick variables / columns
- `filter()` subset data by row
- `summarise()` reduces multiple values into a summary
- `count()` a special case of `summarize()` to tally occurrences
- `arrange()` sort rows
- [RStudio Keyboard Shortcuts](https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts)
10. [`tidyr`](https://tidyr.tidyverse.org/) package
- Make messy data into **tidy data**
- Every variable is a column
- Every row is an observation
- Every cell is a single value
- i.e. **pivots**
11. dplyr revisited
- People who like `pivot_longer()` also like `dplyr::left_joint()`
12. Exploratory Data Analysis (EDA): `ggplot2()` & `skimr`
- `skimr::skim()` from library[(skimr)](https://docs.ropensci.org/skimr/)
- ggplot2(): a brief overview of visualization
13. [`ggplot2()`](https://ggplot2.tidyverse.org/): an introduction to the grammar of graphics, & interactive plots via `plotly`
------------------------------------------------------------------------
## Future modules
20. More on Quarto
21. Interactivity with Quarto and Observable JS
22. Large Data
23. Version Control: git and GitHub
## Quick Start - Demonstration
1. Make a data folder
2. Drag fav.csv into the data folder
3. Make existing folder and RStudio project
4. Open an R Markdown Notebook
5. `library(tidyverses)` plus other libraries
6. IMPORT data
- See Also [*RStudio data import wizard*](https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-RStudio)
- ATTACH data
7. EDA: Visualize `ggplot(data = starwars, aes(hair_color)) + geom_bar()`
8. EDA: `skimr::skim(starwars)`
9. EDA: summary(fav_rating)
10. `left_join(starwars, fivethirtyeight)`
11. Transform data: five dplyr verbs ...
- `filter`, `select`, `arrange`, `mutate`
- `count` / `group_by` & `summarize`
12. Interactive visualization: `ggplotly`
13. linear regression / models (quick syntax introduction)
14. Reports: notebooks, slides, dashboards, word document, PDF, book, etc.
## Resources
- [RStudio Primers](https://rstudio.cloud/learn/primers)
- [DataQuest.io](https://www.dataquest.io/data-science-courses-directory/)
- [Tidy Tuesday](https://github.com/rfordatascience/tidytuesday) practice
- [Rfun](https://rfun.library.duke.edu/) - **R we having fun yet‽**
- [ReMaster the Tidyverse](https://github.com/rstudio-education/remaster-the-tidyverse)
- Update of [Master the Tidyverse](https://github.com/rstudio/master-the-tidyverse)
- [*R for Data Science*](https://r4ds.had.co.nz/) by Wickham and Grolemund
- [Introduction to Data Science](https://introds.org/) by Çetinkaya-Rundel
- [*Hands-On programming*](https://rstudio-education.github.io/hopr/) by Grolemund
- [*ggplot2: Elegant Graphics for Data Analysis*](https://ggplot2-book.org/) by Wickham
- [*Interactive web-based data visualization with R, plotly, and shiny*](https://plotly-r.com/) by Sievert
- [*Data Visualization A practical introduction*](https://socviz.co/makeplot.html#makeplot) by Healy
- [*Text Mining with R*](https://www.tidytextmining.com/) by Silge & Robinson
- [Shiny](https://shiny.rstudio.com/)
- [*Statistical Inference via Data Science: A ModernDive into R and the Tidyverse*](https://moderndive.com/) by Ismay and Kim
- [Tidymodels](https://www.tidymodels.org/) *packages for modeling and machine learning*
- [*Tidy Models with R*](https://www.tmwr.org/) by Kuhn and Silge