final_project.Rmd

---
title: "Final Project"
---

Open RStudio.

Open a new R script in R and **save it as** `final_project_LastFirst.R` (where Last and First is your last and first name). 

Careful about: capitalizing, last and first name order, and using `_` instead of `-`.

At the top of your script, write the following (**with appropriate changes**):

```
# Final Project
# Name: Laura Fontanesi
# Date: 24 May 2022
```

The final project is divided in two parts:

- Part 1: Reading, cleaning, and understanding the data
- Part 2: Fitting statistical models to the data and interpret the results

Deadline (for both parts): **3rd of June**

# Final project description

In your final project, you will analyze data from a random dot motion task, a classical paradigm in perceptual decision making research, published in 2017 by Nathan Evans and Scott Brown; [link to the paper](https://link.springer.com/article/10.3758/s13423-016-1135-1).

In this task, participants were required to decide whether a cloud of 40 moving white dots on a black background was moving towards the top-left or top-right of the screen. Eighty-five participants were randomly assigned into one of **six groups** formed by the factorial combination of two factor variables: block type (fixed time vs. fixed trial) and feedback type (low, high, medium). All participants completed 24 blocks in total. 

In the fixed trial condition, there were 40 trials in each block, while in the fixed time condition each block lasted 1 min. 

Participants in the low feedback detail condition received only the trial-by-trial feedback on correctness. Participants in the medium feedback condition received the same trial-by-trial feedback, plus information on their average accuracy and response time every 200 trials. Participants in the high feedback condition received the trial-by-trial feedback and the feedback every 200 trials, plus extra guidance on how they could change their speed–accuracy tradeoff to achieve an optimal reward rate.

In sum, participants were randomly assigned to the following conditions (you should use these labels in the parsed dataset):

- FixedTrial_LowInformation
- FixedTrial_MediumInformation
- FixedTrial_HighInformation
- FixedTime_LowInformation
- FixedTime_MediumInformation
- FixedTime_HighInformation

**Note**: In this data set each subject partipated in only one condition, so each of the condition can be obtained
from the file names, according to the following labels:

- High information feedback (Optim)
- Medium information feedback (Info)
- Low information feedback (Norm)
- Fixed trial (Trial)
- Fixed time (Time)

For example: the condition of '10014_Optim_Time_Friday 29th of May 2015 11/40/00 AM.txt' corresponds to Fixed_Time_High_Information.

## Description of variables in the dataset:

- blkNum: This variable specifies the block number.
- trlNum: This variable specifies the trial number within the block.
- coherentDots: This variable specifies how many dots are moving in the same direction (are coherent).
- numberofDots: This variable specifies how many dots there are.
- percentCoherence: This variable specifies the percentage of dots that are moving in the same direction.
- winningDirection: This variable specifies the dominant direction of the dots ("left' or 'right').
- response: This variable specifies which direction the participant selected ("left' or 'right').
- correct: This variable specifies whether the participant's response was correct or not. Its values can be 0 (false), 1 (true), and 2 ('tooFast' or 'tooSlow').
- eventCount: Ignore.
- averageFrameRate: Ignore.
- RT: This variable specifies the response time in milliseconds.

# Part 1

## Part 1.1: Loading the data

You can find the data at the OSF depository: https://osf.io/8592y/ (only download the `Data` folder). Or on dropbox: https://www.dropbox.com/sh/pz275wapxekjh6z/AAAsLcgz6hAWLSg8cm5rrdVPa?dl=0

Note that the data of each participant is in separate files. We therefore need to import them and merge them in a unique dataset. Instead of writing yourself all the data paths (which would take forever), you can use the following function: https://www.math.ucla.edu/~anderson/rw1001/library/base/html/list.files.html

Here is my example code on how to import the files. You only need to adapt the path to the folder on your system where you downloaded the raw data.

```{r echo=TRUE, message=FALSE, warning=FALSE}
library(tidyverse)

# define the data folder
data_folder = '~/Dropbox/teaching/r-course22/data/final_project/'

# define lists of files
list_files = paste(data_folder, list.files(path=data_folder), sep="")

# apply the read_delim function to each element in the list
data_list = lapply(list_files, function(x) read_delim(x, delim='\t')) 

# loop through all the participants to add the participant number as a column as well as the name of the file, since we have no other way to know in which conditions the participants were in
data = NULL
participant = 0

for (n in 1:(length(data_list))){
    participant = participant + 1
    cond_list = strsplit(list_files[n], "_")
    condition = paste(cond_list[[1]][3], "_", cond_list[[1]][4], sep="")
    data=rbind(data,cbind(participant=participant, condition=condition, data_list[[n]]))
}
```

```{r}
head(data)

glimpse(data)
```

As you can see, there are a few things to fix to make the dataset look a bit cleaner. It's on you now to do the rest: The final data should consist of the following columns:

- participant
- block_number
- trial_number
- condition (use the labels in the description above)
- percent_coherence
- dots_direction
- response
- accuracy
- rt (change them to seconds, which means dividing them by 1000)

The columns in the dataset that are not in this list you should remove. The labels of the column that are different you should change to the ones below.

**Hint:** Note that the `accuracy` column corresponds to the `correct` column in the original dataset. The recoding of the conditions should follow the following scheme:

- Norm_Trial = "FixedTrial_LowInformation"
- Info_Trial = "FixedTrial_MediumInformation"
- Optim_Trial = "FixedTrial_HighInformation"
- Norm_Time = "FixedTime_LowInformation"
- Info_Time = "FixedTime_MediumInformation"
- Optim_Time = "FixedTime_HighInformation"

## Part 1.2: Describing the data

- How many participants are in the dataset?

**Hint:** Try summarizing the data using `n_distinct(participant))`.

- How many participants were there in each condition?

**Hint:** First group the data by `condition` and then summarize using `n_distinct(participant))`.

- Did the participants in the fixed time conditions perform more or less trials than the participants in the fixed trials conditions?

**Hint:** First, you can create a new column called `block_type` with the following recoding of the `condition` column:

- FixedTrial_LowInformation = "FixedTrial"
- FixedTrial_MediumInformation = "FixedTrial"
- FixedTrial_HighInformation = "FixedTrial"
- FixedTime_LowInformation = "FixedTime"
- FixedTime_MediumInformation = "FixedTime"
- FixedTime_HighInformation = "FixedTime"

Then, you can calculate the number of trials per participant (e.g., using `group_by(data, participant)` and then summarize with the count function `n=()`). You should obtain a tibble with as many rows as the number of participants and a column containing the number of trials per participant.

Then, you need to add the information of which condition each participant was assigned to to previous output (of the `summarize` function). An easy way is to use the `distinct` function (https://dplyr.tidyverse.org/reference/distinct.html), specifically using `distinct(data, participant, condition)` which will output a similar tibble as before (i.e., with the same number of rows: one per participant) with the name of the condition each participant was assigned to.

Now, you can join the 2 datasets, it should like this:
```
# A tibble: 85 x 3
   participant n_trials block_type 
         <dbl>    <int> <chr>     
 1           1      911 FixedTime 
 2           2      960 FixedTrial
 3           3      753 FixedTime 
 4           4      960 FixedTrial
 ...
```

Finally, you can group by `block_type` and calculate the mean number fo trial in each level (time vs trial), using again `group_by` and `summarize`.


- Calculate a summary, which includes the average and SD of accuracy and response time per condition (which are 6 in total) and use this to describe the overall performance across participants in these conditions (i.e., which conditions produced hiigher accuracy, which produced faster responses? Do accuracy and speed always trade off?).

- Calculate a summary, which includes the average accuracy and average response time per participant, as well as the percentage of trials below 150 ms (too fast trials) and above 5000 ms (too slow trials).  Are there any participants with more then 10% fast or slow trials?

**Hint:** Create two new columns in the dataset, called `slow_trials` and `fast_trials` using `case_when`: You can assign values of 1 in `slow_trials` when `rt >= 5` and 0 otherwise; and assign values of 1 in `fast_trials` when `rt <= .15` and 0 otherwise. Now you have two columns that, when their average is calculated per participant and multiplied by 100, give you the percentage of slow and fast trials for each participant.

## Part 1.3: Exclude participants

Exclude from the dataset the participants that have less than 60% accuracy. The trials with a response time less than 150 ms or greater than 5000 ms should be also excluded. Check that the accuracy variable only contains 0 and 1.

**Hint:** Create a vector by using the `filter` fuction on the output of the last question in Part 1.2 and extract the participants that have mean accuracy lower than .6, like so:

```
participants_to_exclude = filter(..., meanACC < .6)$participant
```

Since you are likely going to have many participants, it's a bit of a hassle to write `filter(data, participant != X)` for every participant number X. Therefore, you can build a loop that iterates through the participants to exclude and filters them out one by one, like so:

```
for (n in 1:length(participants_to_exclude)) {
  data = data %>% 
    filter(participant != participants_to_exclude[n])
}
```

## Part 1.4: Data visualization

Now it's time to visualize the data set. 

- First, we want to have a look at how the accuracy and response time evolve across the blocks. Create two separate plots, one for response times and one for accuracy, where 6 lines (one per condition) show the development of accuracy/response times across the block numbers. This should more or less recreate Figure 2 from the paper: https://link.springer.com/article/10.3758/s13423-016-1135-1/figures/2 (excluding the top panel). What trends do you observe?

**Hint:** First, calculate the mean response times and accuracy per condition and block number using again `group_by` and `summarize`. Then, use the `geom_line()` function with the summarized data: 

```
ggplot(data = ..., mapping = aes(x = block_number, y = ..., color=condition)) + 
    geom_line()
```

- Second, we want to make two separate plots, one for accuracy and one for response times, which show the average performance iin the 6 conditions. The condition should be plotted in the x-axis and the performance (either accuracy or rt) in the y-axis. You can choose whether you would like to have bar plots or point plots. Add error bars representing confidence intervals.

## Part 1.5 Calculate the reward rate (not required)

In the paper, they define the reward rate as: 

$\frac {PC}{MRT + ITI + FDT + (1-PC)*ET}$

where MRT and PC refer to the mean correct response time and probability of a correct response, ITI is the inter-trial interval (i.e., 100 ms), FDT is the feedback display time (i.e., 300 ms), and ET is the error timeout (i.e., 500 ms).

By calculating the MRT and PC per block per participant (i.e., create a summary using `participant` and `block_number` as grouping variables), you can calculate the `reward_rate` by filling in the rest of the equation above:

```
### group data to calculate mean correct response times
grouped_data_pos = group_by(filter(data, accuracy > 0), participant, block_number)

### group data to calculate mean accuracy
grouped_data = group_by(data, participant, block_number)

### join the two
data_summary = full_join(summarise(grouped_data_pos, MRT = mean(rt)),
                         summarise(grouped_data, PC = mean(accuracy)))

### add condition variable for plotting
data_summary = full_join(data_summary,
                         distinct(data, participant, block_number, condition))

### and now calculate the reward rate as in the equation
data_summary = mutate(data_summary,
                      reward_rate = PC/(...))
```

When you are done, you can plot the reward rate as you did in Part 1.4 for response times and accuracy (so recreate the full Figure 2 from the paper), and also plot the mean reward rate per condition.

# Part 2: Statistical analyses

## Part 2.1

We now want to run 2 separate ANOVA models to test the three main effects (block number, block type, and feedback type) on response times and accuracy (if you calculated the reward rate, you have the option of also calculate the ANOVA on the reward rates).

This includes 3 steps:

- aggregate the data, so that you have mean response time and accuracy for each participant, block number, block type, and feedback type.
- run the ANOVAs on the aggregate data
- interpret the results

## Part 2.2

Finally, we want to run a multilevel regression to the test the three main effects (block number, block type, and feedback type) on response times, **without** aggregating the data. This means that you will have repeated measurements for each participant, block number, block type, and feedback type and that you have to include `participant` as a grouping/hierarchical variable in your multilevel model.

This includes 2 steps:

- run the full model (random slopes and intercapts) on the full dataset
- interpret the results