-
Notifications
You must be signed in to change notification settings - Fork 0
/
CBASSED50_demo.qmd
342 lines (264 loc) · 11.1 KB
/
CBASSED50_demo.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
---
title: "CBASSED50 Demo"
author: "Luigi Colin, Yulia Iakovleva, Christian R Voolstra"
format: html
editor: visual
---
# About CBASSED50
CBASSED50 allows you to process CBASS data. To learn more about CBASSED50 see <https://aslopubs.onlinelibrary.wiley.com/doi/10.1002/lom3.10555>.
# Get Started
## Install CBASSED50
You need to execute this chunk only once to get the `CBASSED50` package installed. If there is a new release, please run to update package components.
```{r install-cbassed50, message=FALSE}
if(!require(devtools)){
install.packages("devtools")
}
devtools::install_github("reefgenomics/[email protected]", force=TRUE)
```
## Load Packages
```{r library-packages, message=FALSE}
library(dplyr)
library(tidyr)
library(ggplot2)
library(readxl)
library(rstudioapi)
library(RColorBrewer)
library(CBASSED50)
```
# Define Data Source
Here you can choose to either run a provided "Internal Dataset" or your own "Custom Dataset". To become familiar with the input file format and check that everything is running smoothly, we suggest to run the "Internal Dataset" first (see below). The internal dataset is also provided in the [`examples`](https://github.com/reefgenomics/CBASSED50/tree/main/examples) folder together with the output files.
## Internal Dataset
The first time you may want to run the code with the dataset that is already provided with the `CBASSED50` package.
If you want to specify your own dataset, go further to the next chunk.
To load the internal dataset, run the following chunk of code:
```{r load-data}
# To load internal dataset that is provided with the R package
data("cbass_dataset")
output_prefix <- "demo"
# Make sure that dataset is loaded and display several first rows
head(cbass_dataset)
```
If you want to know more about our internal dataset, you can call the R documentation:
```{r about-dataset}
?cbass_dataset
```
## Custom Dataset
### Requirements for Custom Data
Your data must contain mandatory columns to pass pre-processing and validation steps.
To retrieve the list of mandatory data columns, call the internal `mandatory_columns` function:
```{r get-mandatory-columns}
mandatory_columns()
```
Your data shouldn't contain any missing data, otherwise the row with missing values will be discarded.
### Specify Project Directory and Working Environment for Custom Data
If you want to load and analyze your data, run the following chunk of the code.
You don't need to execute the chunk below if you use the internal `cbass_dataset`.
Note, the `selectFile` function from the `rstudioapi` package works only in interactive execution and doesn't work on rendering.
```{r specify-working-directory}
# Get the input file path
input_data_path <- selectFile(
caption = "Select XLSX or CSV Input File")
# Read data based on file format
cbass_dataset <- read_data(input_data_path)
# To specify the prefix for output files
output_prefix <- tools::file_path_sans_ext(input_data_path)
rlog::log_info(paste("Your current directory is", getwd()))
rlog::log_info(paste("Your input filename is", basename(input_data_path)))
rlog::log_info(paste("The output files will be written into", output_prefix))
```
# Analyze
## Preprocess and Validate Data
Make your data tidy and validate:
```{r process-and-validate-cbass-dataset}
cbass_dataset <- preprocess_dataset(cbass_dataset)
validate_cbass_dataset(cbass_dataset)
```
## Explore ED5s, ED50s, and ED95s
First you need to decide which grouping property you want to use. For example, you want to group by all combinations of values that come from `Site` , `Condition`, `Species`, and `Timepoint` columns as one merged grouping property.
⚠️ Note, you should never use `Genotype` as a grouping property because this column is used as a `curveid` argument for `drm` modeling (see more about it [here](https://doseresponse.github.io/drc/reference/drm.html)).
Create models:
```{r fititng-dose-responce-models, warning=FALSE}
grouping_properties <- c("Site", "Condition", "Species", "Timepoint")
drm_formula <- "PAM ~ Temperature"
models <- fit_drms(cbass_dataset, grouping_properties, drm_formula, is_curveid = TRUE)
```
Get ED5s, ED50s, and ED95s from models:
```{r get-eds}
eds <- get_all_ed_by_grouping_property(models)
cbass_dataset <- define_grouping_property(cbass_dataset, grouping_properties) %>%
mutate(GroupingProperty = paste(GroupingProperty, Genotype, sep = "_"))
eds_df <-
left_join(eds, cbass_dataset, by = "GroupingProperty") %>%
select(names(eds), all_of(grouping_properties)) %>%
distinct()
head(eds_df)
write.csv(eds_df,
paste(output_prefix, "EDsdf.csv", sep = '_'),
row.names = FALSE)
```
ED5s, ED50s, and ED95s Boxplots:
You can choose colorblind-friendly palettes with `display.brewer.all(colorblindFriendly = T)`.
```{r display-ed50}
eds_boxplot <- eds_df %>% ggplot(
aes(x = Species, y = ED50, color = Condition)) +
geom_boxplot() +
stat_summary(
fun = mean,
geom = "text",
aes(label = round(after_stat(y), 2)), show.legend = F,
position = position_dodge(width = 0.75),
vjust = -1
) +
facet_grid(~ Site) +
ylab("ED50s - Temperatures [C°]")+
scale_color_brewer(palette = "Set2")
ggsave(
paste(output_prefix, "EDs_boxplot.pdf", sep = '_'),
eds_boxplot, width = 16, height = 9)
eds_boxplot
```
## Temperature Response Curve
Before predicting PAM values for plotting, let's explore if everything is fine with each genotype of the dataset:
```{r check-curves-without-curveid, warning=FALSE}
exploratory_curve <-
ggplot(data = cbass_dataset,
aes(
x = Temperature,
y = PAM,
# You can play around with the group value (e.g., Species, Site, Condition)
group = GroupingProperty,
color = Genotype)) +
geom_smooth(
method = drc::drm,
method.args = list(
fct = drc::LL.3()),
se = FALSE,
size = 0.7
) +
geom_point(size = 1.5) +
facet_grid(Species ~ Site ~ Condition) +
scale_color_brewer(palette = "Paired") # Colorblind-friendly palette
ggsave(
paste(output_prefix, "prelim_temprespcurve.pdf", sep = '_'),
exploratory_curve, width = 16, height = 9)
exploratory_curve
```
Predict PAM values for assayed temperature range:
```{r predict-PAM-values, warning=FALSE}
# First fit models without curveid
models <- fit_drms(cbass_dataset, grouping_properties, drm_formula, is_curveid = FALSE)
# The default number of values for range of temperatures is 100
temp_ranges <- define_temperature_ranges(cbass_dataset$Temperature, n=100)
predictions <- get_predicted_pam_values(models, temp_ranges)
```
You may get a warning `NaNs produced`. This can happen if PAM values at a higher temperature exceed PAM values at a lower temperature (the model assumes decreasing PAM values with increasing temperatures).
Pre-process data for visualization:
```{r preprocess-pam-predictions}
predictions_df <-
left_join(predictions,
define_grouping_property(cbass_dataset, grouping_properties) %>%
select(c(all_of(grouping_properties), GroupingProperty)),
by = "GroupingProperty",
relationship = "many-to-many") %>%
distinct()
```
Get ED5s, ED50s, and ED95s summary statistics for groupings:
```{r get-ed50-means}
summary_eds_df <- eds_df %>%
group_by(Site, Condition, Species, Timepoint) %>%
summarise(Mean_ED5 = mean(ED5),
SD_ED5 = sd(ED5),
SE_ED5 = sd(ED5) / sqrt(n()),
Conf_Int_5 = qt(0.975, df = n() - 1) * SE_ED5,
Mean_ED50 = mean(ED50),
SD_ED50 = sd(ED50),
SE_ED50 = sd(ED50) / sqrt(n()),
Conf_Int_50 = qt(0.975, df = n() - 1) * SE_ED50,
Mean_ED95 = mean(ED95),
SD_ED95 = sd(ED95),
SE_ED95 = sd(ED95) / sqrt(n()),
# The value 0.975 corresponds to the upper tail probability
# for a two-tailed t-distribution with a 95%
Conf_Int_95 = qt(0.975, df = n() - 1) * SE_ED95) %>%
mutate(across(c(Mean_ED50, SD_ED50, SE_ED50,
Mean_ED5, SD_ED5, SE_ED5,
Mean_ED95, SD_ED95, SE_ED95,
Conf_Int_5,Conf_Int_50,Conf_Int_95), ~round(., 2)))
summary_eds_df
write.csv(
summary_eds_df,
paste(output_prefix, "summaryEDs_df.csv", sep = '_'),
row.names = FALSE)
```
Join predictions and ED5s, ED50s, and ED95s summary data:
```{r join-predicitons-and-mean-ed50}
result_df <- predictions_df %>%
left_join(summary_eds_df, by = c("Site", "Condition", "Species", "Timepoint"))
```
Plot ED5s, ED50s, and ED95s curves for groupings:
```{r plot-temperature-responce-curve}
tempresp_curve <- ggplot(result_df,
aes(x = Temperature,
y = PredictedPAM,
group = GroupingProperty,
# You can customize the group here
color = Condition)) +
geom_line() +
geom_ribbon(aes(ymin = Upper,
ymax = Lower,
fill = Condition),
alpha = 0.2,
linetype = "dashed") +
geom_segment(aes(x = Mean_ED5,
y = 0,
xend = Mean_ED5,
yend = max(Upper)),
linetype = 3) +
geom_text(mapping=aes(x = Mean_ED5,
y = max(Upper) + 0.12,
label = round(Mean_ED5, 2)),
size = 3, angle = 90, check_overlap = T) +
geom_segment(aes(x = Mean_ED50,
y = 0,
xend = Mean_ED50,
yend = max(Upper)),
linetype = 3) +
geom_text(mapping=aes(x = Mean_ED50,
y = max(Upper) + 0.12,
label = round(Mean_ED50, 2)),
size = 3, angle = 90, check_overlap = T) +
geom_segment(aes(x = Mean_ED95,
y = 0,
xend = Mean_ED95,
yend = max(Upper)),
linetype = 3) +
geom_text(mapping=aes(x = Mean_ED95,
y = max(Upper) + 0.12,
label = round(Mean_ED95, 2)),
size = 3, angle = 90, check_overlap = T) +
facet_grid(Species ~ Site ~ Condition) +
# To add the real PAM and compare with predicted values
geom_point(data = cbass_dataset,
aes(x = Temperature,
y = PAM)) +
xlab("Temperature [C°]")+
scale_y_continuous(expand = c(0, 0.2))
ggsave(
paste(output_prefix, "temprespcurve.pdf", sep = '_'),
tempresp_curve,
width = 16, height = 9)
tempresp_curve
```
Curves display the predicted PAM values, the 95% confidence intervals, and mean ED5s, ED50s, and ED95s for groupings (vertical line).
Now you are ready to interpret your results! :)
# Explore the Output
By the end of this demo you will have 4 output files:
- `demo_EDs_boxplot.pdf`
- `demo_prelim_temprespcurve.pdf`
- `demo_temprespcurve.pdf`
- `demo_EDs_df.csv`
- `demo_summaryEDs_df.csv`
If you choose to use a custom dataset, please note that the output filenames will start with the input filename, rather than `demo_`. This feature ensures that the output files are easily identifiable and associated with the specific input data provided.
# Cite Us
If you use this software, please cite it as below.
> Yulia Iakovleva & Christian R Voolstra. (2023). CBASSED50: R package to process CBASS-derived PAM data. Zenodo. <https://doi.org/10.5281/zenodo.8370644>.