-
Notifications
You must be signed in to change notification settings - Fork 0
/
CABLAB_R_online.Rmd
2659 lines (1400 loc) · 120 KB
/
CABLAB_R_online.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "CABLAB R Workshop Series -- Introduction to R"
date: "11/21/2023"
author:
- name: "**Content creator: Steven Martinez**"
output:
html_document:
toc: true
toc_float: true
toc_depth: 2
df_print: paged
css: !expr here::here("/Users/tuh20985/Desktop/CABLAB R Workshops/misc/style_bootcamp_sm.css")
knit: (function(inputFile, encoding) {
out_dir <- './';
rmarkdown::render(inputFile,
encoding=encoding,
output_file=file.path(dirname(inputFile), out_dir, 'index.html')) })
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error = FALSE)
```
```{r establish root directory, include=FALSE}
require(knitr)
opts_knit$set(root.dir = '/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/')
```
# Description
This workshop will provide an introduction into R!
R is a popular programming language that many researchers use for organizing data, visualizing data, and carrying out statistical analyses.
By the end of this workshop series, my hope is that you will feel comfortable enough to work independently in R!
```{r Intro Video, echo = FALSE}
vembedr::embed_url("https://www.youtube.com/watch?v=HluANRwPyNo")
```
[**What people think coding is versus what it actually is**]
## Outline
| **Pre-Workshop**: Downloading R and RStudio Software | **R**: https://ftp.osuosl.org/pub/cran/ **RStudio**: https://posit.co/download/rstudio-desktop/
| **Week 1: Intro to R** | Learning how to navigate the software: R, RStudio, and creating scripts in R Markdown. We will also work on installing and loading packages.
| **Week 2: Working Directories** | Learn how to navigate working directories and read in data into R
| **Week 3: Subsetting** | Understand how to access rows and columns and filter observations
| **Week 4: If Else statements** | Using the ifelse() function to create new columns
| **Week 5: Intro to For Loops** | Learning the structure and application of For loops in R
| **Week 6: Pivoting data from wide to long and long to wide** | Understanding the differences between data in wide-format and long-format
| **Week 7: Merging data frames** | Merging two data frames together
| **Week 8: Data cleaning** | Learning how to apply previously learned functions toward cleaning a raw dataset
| **Week 9: Analyzing Data w/ Categorical Independent Variables** | Conducting statistical analyses with categorical predictors
| **Week 10: Analyzing Data w/ Continuous Independent Variables** | Conducting statistical analyses with continuous and categorical predictors
| **Week 11: Visualizing data: Intro to ggplot** | Learn how to create ggplot visualizations and customize plots
| **Final Project?** | TBD
| **Conclusion** | Closing and general notes
# Are you ready to start learning R?!
![](R memes/Learning_R.jpeg){width=80%}
# Pre-Workshop: Downloading R and RStudio Software
Before the workshop, we'll need to download R and RStudio. Throughout the workshop, we'll be working in RStudio, which will allow us to write code in R. So let's make sure we have both R and RStudio installed before we begin!
1. Download a **R CRAN Mirror**, which basically just hosts the R programming language that we will be using in RStudio. https://cran.r-project.org/
2. Download **RStudio**, which is the main software that we will be using to work with R. https://posit.co/download/rstudio-desktop/
3. Download the **CABLAB-R-Workshop-Series** folder from the CABLAB R Workshop Series Github page (https://github.com/steventmartinez/CABLAB-R-Workshop-Series) by pressing the green **Code** button and downloading the ZIP folder. This is the folder containing the all the files we will be working with for the purposes of this workshop.
4. Open up a new R Markdown document by clicking File > New File > R Markdown. **First time R users will be asked to download packages once they open up an R Markdown file. Click “Yes” to downloading those packages!**
![](images/zip folder.png){width=100%}
# Week 1: Intro to R
## Opening a new R Markdown File
To get things started, open R Studio. Then, let's try opening a new R Markdown document, by clicking File > New File > R Markdown...
**First time R users will be asked to download packages once they open up an R Markdown file. Click "Yes" to downloading those packages!**
![](images/R Markdown.png){width=100%}
This should produce a dialogue box where you can enter the name of the script and your name before selecting OK.
![](images/R Markdown script.png){width=40%}
Next, let's clear out all of the default text that appears in a new R Markdown document, which I have highlighted below:
![](images/R Markdown erase.png){width=70%}
## Intro to R Markdown
In a typical coding script, every line must contain code that the language could interpret. If you want to include notes, you have to include a hash mark (#) before any code in order for the program to “ignore this line”. So, in order to leave ourselves any notes, we had to use hash marks, which can get a bit annoying. However, an R Markdown script does the same things as a typical coding script, but it's more user friendly.
With R Markdown, any code that you would like R to interpret belongs in the coding chunk as illustrated below!
![](images/R Markdown insert code.png){width=35%}
If we want to leave notes, we don’t have to “comment it out”. We can just write long-winded narration that can help others understand why we coded what we coded and what that code does.
That’s because a typical script will interpret any text as a command, unless the text is otherwise marked by a hashtag (#). An R markdown script only interprets things as code when we tell it to, and we tell it what is code by creating a chunk. Chunks are marked by three backticks (```) followed by a {r} and, on another line, three more backticks.
![](images/R Markdown blah blah blah.png){width=35%}
A typical script can’t make sense of this, though. We need to use R markdown scripts to do it. You might be thinking, though, that manually denoting code from non-code seems like extra work, and it is a little bit, but it can also be a lot more convenient because the output of any given chunk will appear in the R Studio Console Window. By output, we just mean the product, sum, or status of whatever calculation or item you are asking R to compute and show you.
R Markdown grants us greater control over what we see and when we see it. To demonstrate, let’s start by creating a new chunk in our markdown document and entering what we see in the image above, you can then follow along with the next bit:
```{r add 2 plus 2}
2 + 2
```
With a typical script, if we want to know the output of a line we ran awhile ago, we either have to rerun it or scroll through the console to find it. With Markdown we can minimize entire chunks and their output by using the minimization button [![Minimization Arrow](images/ChunkMin.png)] on the left side of the window.
If we want to hide output, we can use the expand/collapse button [![Minimize Command](images/OutputMin.png)] on the right side of the output window.
We can choose exactly what we want to run using the the "*Run*" command [![Run Command](images/ChunkRun.png)] in the upper right corner of the chunk.
Also of note, the down-facing arrow (second icon in the upper right corner of the code block) will tell R "Run all of the blocks of command that I have before this block" [![Run All Chunks Command](images/ChunkRunAll.png)]. It can be helpful if you make a mistake and don't want to manually rerun all of the previous blocks one by one to get back to where you were. It also makes your code very easy for other people to run. They can quite literally do it with the click of a button!
If we click the cog icon in the same tray, we can access the output options and manipulate where output appears and what it looks like, but that's beyond the scope of this review [![Settings Command](images/ChunkSettings.png)].
## What's a "Package"?
Packages in R are synonymous with libraries in other languages. They are more or less convenient short-cuts or functions someone else already programmed to save us some work. Somebody else already figured out a very quick way to compute a function so now we don’t have to! We just use their tools to do it.
## Installing packages
Every new package is centralized in R’s repository, so even though thousands of people are working on these things independently, you don’t need to leave R to find them. Before they can be used, they must be installed, and you can do that pretty simply:
```{r install packages, eval = F}
install.packages("PACKAGENAME")
```
If you need to update a package, you can just re-run the above code. If you’re using R Studio, you can also see a list of your packages and their associated descriptions in the ‘Packages’ Tab of your Viewer Window.
![**Packages tab of viewer window where one can visualize previously installed packages**](images/R Packages.png){width=50%}
## Loading packages
Now we’ve installed a package, that doesn’t mean we can use it yet. We need to tell R “We want access to the functions this package has during this session" by calling it with the library() command.
```{r load package via library, eval = F}
library(PACKAGENAME)
```
Notice that we drop the quotation marks now. We just specify the (case-sensitive) package name and it lets R know we are planning on using that this session.
You might be wondering why we need to take this extra step. Sometimes different packages use the same commands, so having more than one of those active at the same time could confuse R (When this does happen, R will usually tell you). Sometimes packages take up a lot of disk space, so having ALL of your packages initialized at once might leave your computer running extremely slow. It’s the same for most languages.
If we ever want to explore the functions contained within a package in conjunction with examples, we can either go to the R documentation website or type ‘??PackageName’ into the Console, which will then populate the Help Tab of the Viewer Window with information on the package.
Let's try installing and loading in a few package for practice. Let's install and load the following packages in R: naniar, report, tidyverse, dplyr, Matrix, lme4, lmerTest, and ggplot2
## Week 1 Exercise: Installing and Loading Packages
```{r Week 1 Exercise, code="'\n\n\n\n'", results=F}
```
```{r Week 1 Exercise - hidden, eval=T, include=F}
install.packages("naniar", repos = "http://cran.us.r-project.org")
install.packages("report", repos = "http://cran.us.r-project.org")
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
install.packages("dplyr", repos = "http://cran.us.r-project.org")
install.packages("Matrix", repos = "http://cran.us.r-project.org")
install.packages("lme4", repos = "http://cran.us.r-project.org")
install.packages("lmerTest", repos = "http://cran.us.r-project.org")
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
library(naniar)
library(report)
library(tidyverse)
library(dplyr)
library(Matrix)
library(lme4)
library(lmerTest)
library(ggplot2)
```
[Click for solution](https://github.com/steventmartinez/CABLAB-R-Workshop-Series/blob/main/exercise_solutions/week1_exercise.R)
## Week 1 Assignment: Install and Load "swirl" library and complete "Module 1: Basic Building Blocks"
Swirl is a really cool package in R that teaches you R programming and data science interactively, at your own pace, and right in the R console! For our first assignment, I think swirl explains some fundamental concepts in a better way than I can, so let's tackle the **"R Programming: The basics of programming in R"** course and complete **Module 1: Basic Building Blocks** in swirl.
Some of it will make sense, and some of it won't (and that's okay!), but I think swirl does a pretty good job of orienting people to how basic operations in R work, and I think this is especially helpful before we start working with any actual data.
Let's give this a try and we can talk through any problems people ran into during our next workshop. I've attached some screenshots below demonstrating how to install and load swirl().
![](images/R swirl 1.png){width=70%}
![](images/R swirl 2.png){width=70%}
# Week 2: Working Directories
## Working Directories in R: What is a Working Directory?
Hopefully swirl() has helped you feel a bit more comfortable in navigating R. Today we will focus on working with directories in R.
A working directory is a fancy term that refers to the default location where R will look for files you want to load and where it will put any files you save. Like any other language or program, R needs to be told where the data that we’d like to work with is located on our computer. It doesn’t just know automatically.
Below we'll use the getwd() command to check out where where your current working directory is.
Using the list.files() command will show you what files exist in your current working directory.
```{r}
getwd() #get your current working directory
list.files() #Use list.files() to check the contents of your working directory
```
## Working Directories in R: Specifying your Working Directory
In order to work with the data that we want to work with, we’ll have to tell R where the files are located, so we can create a new variable containing a filepath to make this process simple so we aren’t writing it out multiple times. Filepaths will differ based on whether you are using a Windows versus a Mac. If you're using a Windows computer, it's likely your file path will exist within your "C:/ Drive". If you're on a Mac, it's likely your file path will start with a forward slash "/". If you're not sure of your path, R makes it relatively easy to find it.
You can press tab when your cursor is to the left of the slash to see a list of directories contained within your computer.
```{r Setting Working Directory Example, eval=FALSE,include=TRUE, message=FALSE, warning=FALSE, error=FALSE}
# For Windows
Path <- "C:/"
# For Mac
Path <- "/"
```
Here’s an example of what you should see:
![**An example of R’s Tab-Controlled drop-down menus**](images/R tab.png){width=30%}
Pressing tab again will enter into a directory, thus showing me the contents of that directory. From there, I can keep hitting tab until I get to the directory, or folder, that contains the files I want to work with. I can then save this filepath, which is just what we call a string (i.e., text that does not contain a quantitative value), as an object named Path. We do so by placing the object on the left of an equal sign (=) or an arrow (<-) and the value that object is taking on the right side of it.
Below, let's assign the filepath where our CABLAB R Workshop Series folder exists to an object called "Path".
```{r Assign working directory to object called "Path"}
# For Windows
Path <- "C:/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"
# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"
```
This format of assigning a value to an object is really important and we’ll keep coming back to it throughout this tutorial!
## Intro to "Fright Night" dataset
For the purposes of this project, we are going to work with the Fright Night dataset! The Fright Night project took place in 2021 at the Eastern State Penitentiary's annual "Halloween Nights" haunted house event in Philadelphia. 116 participants completed a haunted house tour as part of a research study assessing the relationship between threat and memory.
Specifically, we explored 2 main research questions:
**1)** How does naturalistic threat affect memory accuracy?
**2)** Does naturalistic threat affect the way in which we communicate our memories?
![](images/Halloween Nights.png){width=60%}
Participants toured four haunted house segments (Delirium, Take 13, Machine Shop, and Crypt) that included low-threat and high-threat segments. Delirium and Take 13 were low-threat segments, whereas Machine Shop and Crypt were high-threat segments.
To assess memory accuracy, we focused on temporal memory accuracy specifically. Temporal memory refers to memory for the order in which events occur. To measure temporal memory within our study, we focused on accuracy on the recency discrimination task that participants completed for each haunted house segment. As part of the recency discrimination task, participants were shown pairs of trial-unique events within each haunted house segment and asked to select which event came first. In this way, we can determine the accuracy of people's temporal memory for the order of the events they experienced.
![](images/fright night study design.png){width=70%}
![](images/recency discrimination example.png){width=70%}
![](images/recency discrimination full.png){width=70%}
To assess communication styles during memory recall, we focused on the free recall memory task where we asked participants to freely recall their memory for each haunted house segments. We fed the free recall memory transcripts into a natural language processing instrument called the Linguistic Inquiry and Word Count (LIWC) software. LIWC calculates the percentage of words in a given text that belong to linguistic categories that have been shown to index psychosocial constructs. In the example attached below, you can see the percentage of words that contribute to a linguistic category called "Authenticity" which is thought to reflect perceived honesty and genuineness, and the percentage of words that belong to a linguistic category called "Analytical Thinking", which is thought to reflect formal or logical thinking.
![](images/LIWC example.png){width=70%}
There were also 3 experimental conditions: Control, Share, and Test.
**Control condition**: Participants were instructed to tour the haunted house segment as they normally would.
**Share condition**: Participants were instructed to tour the haunted house segment in anticipation of an opportunity to post about their experience on social media afterwards.
**Test condition**: Participants were instructed to tour the haunted house segment in anticipation of being tested on their knowledge of the haunted house segment afterwards.
For the first two segments (Delirium and Take 13), all participants toured the segment in the Control condition. However, in the last two segments (Crypt and Machine Shop), some participants toured the segments in the Control condition, other participants toured Machine Shop in the Share condition and Crypt in the Test condition, while other participants toured Machine Shop in the Test condition and Crypt in the Share condition.
After completing the haunted house tour, participants were assessed at two time points: immediately afterwards and again 1-week later. During the Immediate assessments, participants completed a recency discrimination task and freely recalled their memory for 1 low-threat and 1-high threat haunted house segment. During the one week-delay assessments, participants completed a recency discrimination task and freely recalled their memory for *all* haunted house segments. Check out the study design below as well as the vignette illustrating when the three experimental conditions (i.e., Control, Share, and Test) took place throughout the haunted house tour.
![](images/fright night study design conditions.png){width=60%}
Now that we have a better idea about the study design, we can finally start working with some data!
The dataset that we start off working with for the purposes of the workshop is titled **frightnight_practice.csv**.
## What is a "data frame"?
Before we load in the data, I want to highlight a little terminology. The data that R works with is always contained within what we call a ‘dataframe’. A dataframe represents the same thing that a spreadsheet represents in Excel. It contains many cells that are situated into columns (which have names) and rows (which may or may not have names).
## How do I load data into R?
There are many ways to load data into R and they all depend upon what format the data is in. R can handle data from .csv, .xlsx, .txt, .html, .json, SPSS, Stata, SAS, among others. R also has it’s own data format (.RDA, .Rdata). With the exception of .RDA, .csv is often the cleanest means of reading in data. We won’t cover the other formats, but they are fairly exhaustively covered *<hyperlink/in this tutorial/>*. https://www.datacamp.com/tutorial/r-data-import-tutorial
Before reading in our fright night practice data CSV file, we need to use the setwd() function to tell R where to look for our CSV file. Let's use the Path object that we created earlier to set our working directory to where the frightnight_practice.csv file is located on our computer.
In the most basic sense, we can load our fright night practice data CSV data file using the read.csv() function like this:
```{r setting working directory}
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory
df <- read.csv(file = "frightnight_practice.csv") #Load in the fright night practice csv file
```
The setwd() command accepts our Path variable and tells R where to look for our .csv file. The read.csv() command actually loads in the data. If done correctly, we should see our R Environment populate with a dataframe labeled df.
![](images/df environment.png){width=70%}
A visualization of the Environment Window. Since we're all using the same dataset, the number of observations and variables should be the same as in the picture above. Here, you can think of observations as "rows" and variables as "columns". If you click on df in the environment, it will open in a new tab of your Source Window (The same window you are likely writing script in) where you can view it. However, we can also look at the data in our markdown file though by entering the head() command from base R, which will show us the first few lines:
```{r eval = FALSE}
head(df) #will show you a subset of rows within the Data Frame
View(df) #will open up the full data frame like you would in Excel
```
Amazing! Now we have hundreds of columns of data, like we should. We might also notice that the first row of column is PID, which refers to each participant's ID. You'll see that each participant has 6 rows. Remember that there were two stages of assessment: 1) Immediately after the haunted house tour; and 2) A delay 1-week later. Participants were tested on 2 of the 4 haunted house segments during the Immediate Stage, and they were tested on all 4 haunted house segments during the One-Week Delay stage. As a result, every participant should have 6 rows.
**PID** column -- The participant IDs.
**Section** column -- The name of each haunted house segment.
**Stage** column -- Whether the assessment was immediately afterwards or 1 week later.
**Condition** column -- Participants completed the haunted house segment in the Baseline, Share, or Test condition.
**Fear.rating** column -- Participants were also asked to rate how fearful they found each haunted house segment immediately afterwards.
**TOAccuracy** column -- Their accuracy score on the recency discrimination task for each haunted house segment
**Recall** column -- their free recall for each haunted house segment
## Week 2 Exercise: Working Directories
**1)** Read in the frightnight_wide_exercise.csv CSV file and store it in an object called "df_wide"
**2)** Print out the first few rows using the head() function
**3)** Open up the df_wide dataframe by using the View() function **OR** by clicking on the df_wide dataframe in the global environment
```{r Week 2 Exercise, code="'\n\n\n\n'", results=F}
```
```{r Week 2 Exercise - hidden, eval=F, include=F}
# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"
#set working directory
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory
#Read in the df_wide CSV file
df_wide <- read.csv(file = "frightnight_wide_exercise.csv")
#head()
head(df_wide)
#View()
View(df_wide)
```
[Click for solution](https://github.com/steventmartinez/CABLAB-R-Workshop-Series/blob/main/exercise_solutions/week2_exercise.R)
## Week 2 Assignment: Working Directories
There will be no week 2 assignment :)
# Week 3: Subsetting data
For the Week 3 workshop, let's read in the frightnight_practice CSV file
```{r}
# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"
#set working directory
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory
df <- read.csv(file = "frightnight_practice.csv") #Load in the fright night practice csv file
```
By looking at the dataframe, we can see that we aren’t working with a perfectly clean dataset: some of the rows have missing data! And we don't really need all of the columns in the dataframe to do the analyses that we're interested in doing.
So how do we access rows? How do we access columns? And how can we check what data is missing? Learning how to access specific elements of a data frame is an extremely important part of learning R!
dataframe$column will print out all the rows in that column. Let's print out all the participant IDs that exist in the data frame.
```{r}
df$PID
```
What if we want to see a specific row? Let’s say row 2 within the PID column? To reference a specific row in a given column, I can add brackets and the number of that row behind it:
The code below will print out the second row in the PID column.
```{r}
df$PID[2]
```
However, we can also index the column using it’s relative position. Knowing that the PID column is the first column, I can use bracket notation. Bracket notation is super helpful once you understand its structure. It helps me to think of it as [rows, columns]. Any number that appears before the comma will access rows, and any number that appears after the comma will access columns.
By including the name of the data frame before the bracket notation, we can pull certain rows and columns from that data frame
```{r}
df[1,] # print the first row across all columns
df[,2] # print column 2
df[1,2] # print the first row in column 2
```
Now that we know how to access rows and columns, let's talk about subsetting! Subestting is a technique for filtering rows or columns in a given data frame.
## Conditional Subsetting
Let's say we only cared about participants' experiences for the Infirmary section of the haunted house. In order to do this, let's talk about how operators work in R.
```{r}
#print TRUE or False for whether a row in the Section column reflects "Infirmary" or not
df$Section == "Infirmary"
```
Notice the two equals signs (==). When two value operators (=, >, <, !) are placed next to each other in R, and many other languages, we aren’t assigning a value to an object; we are comparing the values between two different objects. In this instance, using two equals signs, if the two values are equal, it would produce a TRUE value; if not, then a FALSE. This variable which can only take the value of either TRUE or FALSE is called a boolean. When we tell R to compare the value on the right with this specific column, what it is mechanically doing is iterating through each row within this column, comparing the column value, and noting whether the conditional is TRUE or FALSE.
## Subsetting rows and columns using bracket notation
So, we could theoretically plug just about any conditional statement in our subset approaches and subset the data as we wish:
We can subset specific rows that we care about using bracket notation.
We can also subset specific columns using bracket notation.
Let's use bracket notation to subset the rows that belong to participant 1001 and store these rows in a new data frame called "df_1001".
Let's also use bracket notation to subset the PID, Section, Stage, and Recall columns and store these columns in a new data frame called "df_sub"
```{r}
# The nrow() command outputs how many rows the data frame has
# We're doing this to show that both approaches yield the same result
#Subseting rows using bracket notation
df_1001 <- df[df$PID == "1001",]
nrow(df_1001)
#Subseting columns using bracket notation
cols <- c("PID", "Section", "Stage", "Recall") #create a vector of column names that we want to subset
df_sub <- df[, cols] #use bracket notation to pull the columns that we included in the "cols" vector from the df data frame.
```
Understanding the structure of bracket notation [rows, columns] is super important and this structure will be used to carry out more complicated functions that we'll talk about later on in the workshop series!
## Subsetting rows and columns using the subset() function
As mentioned above, we can also use the subset() function to subset rows and columns.
We will first use the subset() function to subset specific rows.
We will also use the subset() function to subset specific columns.
Let's use the subset() function to subset the rows that belong to participant 1001 and store these rows in a new data frame called "df_1001".
Let's also use the subset() function to subset the following columns: PID, Section, Stage, Recall and store these columns in a new data frame called "df_sub"
```{r}
#Subsetting rows using the subset() function
df_1001 <- subset(df, PID == "1001")
nrow(df_1001)
#Subsetting columns using the subset() function
df_sub <- subset(df, select=c(PID, Section, Stage, Recall))
```
Most people prefer to use the subset() function compared to bracket notation since it's a little bit more readable, but it's totally okay to do whatever makes the most sense to you. They both accomplish the same thing, just in slightly different ways.
## Subsetting rows based on multiple conditions
What if, rather than subsetting based on one condition (i.e., rows that belong to participant 1001), we wanted to subset based on multiple conditions?
We can take advantage of OR ( **|** ) and AND ( **&** ) operators using the subset() function.
Below, we will be subsetting all rows where the assessment is based on the Infirmary **OR** Asylum haunted house segments.
```{r}
df_multiple_conditions <- subset(df, Section == "Infirmary" | Section == "Asylum")
```
Here, we are telling R to subset all rows where Section is equal to Infirmary OR Asylum. As you can tell, leveraging the OR ( **|** ) and AND ( **&** ) operators within the subset() function can be especially powerful.
## Week 3 Exercise: Subsetting data
**1)** Create a new data frame called "df2" and subset the following columns from the df data frame: PID, Section, Stage, Fear.rating, and TOAccuracy.
**2)** Do this using bracket notation
**3)** Repeat this using the subset() function.
```{r Week 3 Exercise, code="'\n\n\n\n'", results=F}
```
```{r Week 3 Exercise - hidden, eval=T, include=F}
#Create a new data frame called "df2" and subset the following columns from the df data frame: PID, Section, Stage, Fear.rating, and TOAccuracy
# Approach 1: Bracket notation
cols <- c("PID", "Section", "Stage", "Fear.rating", "TOAccuracy") #create a vector of column names that we want to subset
df2 <- df[, cols] #use bracket notation to pull the columns that we included in the "cols" vector from the df data frame.
#Approach 2: subset() function
df2 <- subset(df, select=c(PID, Section, Stage, Fear.rating, TOAccuracy))
```
[Click for solution](https://github.com/steventmartinez/CABLAB-R-Workshop-Series/blob/main/exercise_solutions/week3_exercise.R)
## Missing data
What if we wanted to see which rows had missing values (e.g., NA) or not? What if, for whatever reason, some participants were not able to complete the temporal memory accuracy assessment for the haunted house events?
We can use the is.na() function to determine which rows have missing values in the Temporal Accuracy (TOAccuracy) column
```{r}
is.na(df$TOAccuracy)
```
This will produce an array of TRUEs and FALSEs of the same length as the rows in the dataframe, because each individual TRUE and FALSE is telling us whether each row in that column meets the condition we defined. If we see a FALSE in the first position, we know that the first row does NOT have a missing value. If we see a TRUE, we know that the second row does NOT have a Temporal Memory accuracy score.
But how can we create a data frame that does not have any missing data (i.e., rows that are blank or have an 'NA' in it)?
Here, we can use bracket notation to create a new data frame called "df_complete" that only includes data that is NOT missing in the TOAccuracy column in the df data frame. By putting an exclamation point in front of the is.na() function, this is our way of telling R that we want it to do the **inverse** of the is.na() function!
This idea of putting an exclamation point before the is.na() is generalizable to many functions, not just is.na().
```{r}
#is.na() function
df_complete <- df[!is.na(df$TOAccuracy),]
```
What if, instead of removing rows that have a missing value in **ONE** column, we wanted to remove any rows that have a missing value in **ANY** column?
Rather than using the is.na() function, I personally like to use the complete.cases() function for situations like this.
```{r}
df_complete <- df[complete.cases(df), ]
```
Here, we are again using bracket notation to tell R, within the df data frame, remove any rows that have a missing value in ANY column in the df data frame. Push the remaining non-empty rows into a new data object called "df.complete".
## Week 3 Assignment: Subsetting data
For this week's assignment, let's continue focusing on subsetting in R.
**1)** Read in the frightnight_practice.csv dataset
**2)** Create a new data frame and subset the following columns from the df data frame: PID, Section, Stage, Recall, TOAccuracy
**3)** From this new data frame, subset only rows that only contains TOAccuracy scores greater than .40
```{r Week 3 Assignment, code="'\n\n\n\n'", results=F}
```
```{r Week 3 Assignment - hidden, eval=T, include=F}
#1) Read in the frightnight_raw_assignment CSV file
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory
df<- read.csv(file = "frightnight_practice.csv") #Load in the fright night raw csv file
#Subset the following columns: PID, Section, Stage, Recall, TOAccuracy
df_clean <- subset(df, select=c(PID, Section, Stage, Recall, TOAccuracy))
#Create a subset of the data that only contains TOAccuracy scores greater than .40
df_clean <- subset(df_clean, TOAccuracy > .40)
```
# Week 4: If Else statements
For the Week 4 workshop, let's read in the frightnight_practice CSV file
```{r}
# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"
#set working directory
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory
df <- read.csv(file = "frightnight_practice.csv") #Load in the fright night practice csv file
```
## If else statements
Let's use an If Else statement to create a new column that represents whether a Section was a high threat or low threat section.
The structure for ifelse() statements is as follows: dataframe$name_of_new_column <- if the value in the Section column has a cell that = "Infirmary", we will insert a value of "Low" in the new Threat column for that cell, else, insert a value of "High" to represent high threat.
```{r}
df$Threat <- ifelse(df$Section == "Infirmary", "Low", "High")
```
However, Infirmary wasn't the only Low threat column! We need to find a way to use the ifelse() function to tell R if the Section is equal to Infirmary OR Asylum, assign a value of "Low", else, assign a value of "High".
We can do this with more than one conditions using the OR (i.e., |) operator or the AND (i.e., &) operator.
The "|" operator means OR in R language. Using the "|" operator allows you to include multiple conditions.
```{r}
df$Threat <- ifelse(df$Section == "Infirmary" | df$Section == "Asylum", "Low Threat", "High Threat")
```
Here, if Section == "Infirmary" OR if Section == "Asylum", assign a "Low" value in the Threat column. Else, assign a "High" value.
## Re-organizing the position of columns
So we just created this Threat column. Any time you create a new column, it appears at the end of the data frame.
What if we wanted to organize our columns in a certain order?
We can do this in multiple ways:
Approach 1: Re-organize multiple columns in a data frame using the subset() function.
Approach 2: Re-organize one specific column in a data frame using the relocate() function.
```{r}
#Approach 1: Re-organize multiple columns in a data frame
df_reorganized <- subset(df, select=c(PID, Stage, Section, Group, Threat, Recall)) #if we want to include all the columns, this may take a while...
#Approach 2: Re-organize one specific column in a data frame
df_reorganized <- df %>% relocate(Threat, .after = Group) #Can relocate columns *after* a certain column
df_reorganized <- df %>% relocate(Threat, .before = Recall) #Can relocate columns *before* a certain column
```
## Week 4 Exercise: If Else statements
Given that Eastern State Penitentiary updates its haunted house segments every year, let's clarify which year haunted house segments were introduced. Infirmary and Ghostly Grounds were introduced in 2019, whereas Asylum and Devil's Den are newer segments and were introduced in 2021.
**1)** Use the ifelse() function to create a new column called Year, where, if the Section was equal to Infirmary or Ghostly Grounds, assign a value of "2019", else, assign a value of "2021".
```{r Week 4 Exercise, code="'\n\n\n\n'", results=F}
```
```{r Week 4 Exercise - hidden, eval=T, include=F}
df$Year <- ifelse(df$Section == "Infirmary" | df$Section == "GhostlyGrounds", "2019", "2021")
```
[Click for solution](https://github.com/steventmartinez/CABLAB-R-Workshop-Series/blob/main/exercise_solutions/week4_exercise.R)
## Week 4: More advanced ifelse statements
For the purposes of this example, let's subset a data frame with the following columns: PID, Section, Stage, Condition, TOAccuracy.
Using the ifelse() function, we're going to categorize Temporal Memory Accuracy performance in 3 groups: High, Medium, or Low.
Let's make a new column called "MemoryStrength" where a Temporal Memory Accuracy score less than or equal to .3 is "Low", any Temporal Memory Accuracy score between .3 and .7 is "Medium", and a Temporal Memory Accuracy score greater than or equal to .7 is "High".
```{r ifelse three conditions}
#Subset a data frame with the following columns: PID, Section, Stage, Condition, TOAccuracy.
df_memory <- subset(df, select=c(PID, Section, Stage, Condition, TOAccuracy))
#a Temporal Memory Accuracy score less than or equal to .3 is "Low"
df_memory$MemoryStrength <- ifelse(df_memory$TOAccuracy <= .3, "Low", NA)
#any Temporal Memory Accuracy score between .3 and .7 is "Medium"
df_memory$MemoryStrength <-ifelse(df_memory$TOAccuracy > .3 & df_memory$TOAccuracy < .7, "Medium", df_memory$MemoryStrength)
#a Temporal Memory Accuracy score greater than or equal to .7 is "High".
df_memory$MemoryStrength <-ifelse(df_memory$TOAccuracy >= .7, "High", df_memory$MemoryStrength)
```
## Week 4 Assignment: If Else statements
**1)** Read in the frightnight_practice.csv file
**2)** Create a new data frame and subset the following columns from the df data frame: PID, Section, Stage, Group, Recall, WordCount
**3)** We need to categorize Word Count during free recall in 3 groups: Long, Medium, or Short.
**4)** Use the ifelse() function to create a new column called "RecallLength" that meets the following criteria: Word count less than or equal to 40 is "Short", word count in between 40 and 60 is "Medium", and word count greater than oe equal to 60 is "Long"
```{r Week 4 Assignment, code="'\n\n\n\n'", results=F}
```
```{r Week 4 Assignment - hidden, eval=T, include=F}
#1) Read in the frightnight_raw_assignment CSV file
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory
df<- read.csv(file = "frightnight_practice.csv") #Load in the fright night raw csv file
#Subset a data frame with the following columns: PID, Section, Stage, Group, Recall, WordCount
df_recall <- subset(df, select=c(PID, Section, Stage, Group, Recall, WordCount))
#word count less than or equal to 40 is "Short"
df_recall$RecallLength <- ifelse(df_recall$WordCount <= 40, "Short", NA)
#word count in between 40 and 60 is "Medium"
df_recall$RecallLength<-ifelse(df_recall$WordCount > 40 & df_recall$WordCount < 60, "Medium", df_recall$RecallLength)
#word count greater than oe equal to 60 is "Long"
df_recall$RecallLength <-ifelse(df_recall$WordCount >= 60, "Long", df_recall$RecallLength)
```
# Week 5: Intro to For Loops
For the Week 5 workshop, let's read in the frightnight_practice CSV file. Before we start working with actual data, we'll work with some general examples first.
```{r}
# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"
#set working directory
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory
df <- read.csv(file = "frightnight_practice.csv") #Load in the fright night practice csv file
```
A for-loop is one of the main control-flow constructs of the R programming language. It is used to iterate over a collection of objects, such as a vector, a list, a matrix, or a dataframe, and apply the same set of operations on each item of a given data structure.
Below, let's walk through the general structure of a for loop and run a quick example of a for loop that will loop through and print array of numbers
```{r introduction to For loop, eval = FALSE}
# -- For Loop general expression ---
for (variable in sequence) {
expression
}
# --- Using a for loop on an array of numbers ---
for (i in 1:10) {
print(i)
}
```
As you can see, i represents a temporary variable that iterates through each value in the 1:10 sequence.
Given that we are using the print() function to print i, the output should print the 1:10 sequence.
## If Else statements in For Loops
Before we continue with for loops, let's do a quick refresher on "if else" statements because they are integral to for loops.
Last week, we went through how to use the ifelse() function to do "if else" statements, which we can do pretty concisely. However, using an "if else" statement within a for loop is a bit different.
```{r eval = FALSE}
#General structure of if statement
if (condition) {
expression
} else {
expression
}
```
Next, let's go through some additional examples to get a better idea of how these "if else" statements actually work!
```{r}
# --- Example of if statement ---
team_A <- 3 # Number of goals scored by Team A
team_B <- 1 # Number of goals scored by Team B
if (team_A > team_B){
print ("Team A wins")
}
# --- Example of if statement with the else statement explicitly mentioned ---
team_A <- 1 # Number of goals scored by Team A
team_B <- 3 # Number of goals scored by Team B
if (team_A > team_B){
print ("Team A will make the playoffs")
} else {
print ("Team B will make the playoffs")
}
```
So far so good. Next, let's wrap these if else statements in a for loop, which makes these arguments especially powerful.
```{r}
#Create a vector that includes the numbers ranging from 1 to 10.
x2 <- 1:10
#For loop where, if i = 1, print "The if condition is TRUE", else, print "The if condition is FALSE"
for (i in 1:length(x2)) {
if (x2[i] == 1) {
print("The if condition is TRUE") }
else {
print("The if condition is FALSE")
}
}
```
Lets break this code down in some more detail.
**1)** for (i in 1:length(x2)) { --- "i" is a temporary variable that store the values of the current position in the range of the for loop. In this case, we are telling R that we want "i" to represent each position within the length of the x2 vector, starting at 1 and going up until 10. "i" will iterate across each of these values (1-10)
**2)** if (x2[i] == 1) { --- This if statement is saying: if the value of i within x2 == 1. Add a new { to indicate the start of the if statement
**3)** print("The if condition is TRUE") } --- print the result if the if statement is true and add a bracket } to signify that its the end of the if statement
**4)** else { --- add a new bracket { to signify that its the start of the else statement.
**5)** print("The if condition is FALSE") } -- print the result if the if statement is false and add a bracket } to signify that its the end of the ifelse statement.
**6)** } --- add a last } to indicate the end of the for loop!
## Week 5 Exercise: Intro to For Loops
**1)** Create a vector that includes the following letters: "A", "B", "C", "D", "E", "F"
**2)** Create a for loop where, if i = A, print This value represents A", else, print "This value does not represent A"
```{r Week 5 Exercise, code="'\n\n\n\n'", results=F}
```
```{r Week 5 Exercise - hidden, eval=T, include=F}
#Create a vector that includes the following letters: "A", "B", "C", "D", "E", "F"
x3 <- c("A", "B", "C", "D", "E", "F")