-
Notifications
You must be signed in to change notification settings - Fork 0
/
CopeModSelection.Rmd
235 lines (179 loc) · 13.5 KB
/
CopeModSelection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
---
title: "Copepod Model Selection"
author: "Sarah Gaichas"
date: "`r Sys.Date()`"
output:
html_document:
code_fold: hide
link-citations: yes
csl: "canadian-journal-of-fisheries-and-aquatic-sciences.csl"
bibliography: zoopindex.bib
urlcolor: blue
---
## VAST model selection
Copepod categories based on discussion with ToR 1 subgroup:
* Calanus finmarchicus, a.k.a. Large copeopds SOE (used in small-large index)
* Large copepods ALL: Calanus finmarchicus, Metridia lucens, Calanus minor, Eucalanus spp., Calanus spp.
* Small copepods ALL: Centropages typicus, Pseudocalanus spp., Temora longicornis, Centropages hamatus, Paracalanus parvus, Acartia spp., Clausocalanus arcuicornis, Acartia longiremis, Clausocalanus furcatus, Temora stylifera, Temora spp., Tortanus discaudatus, Paracalanus spp.
* Small copeopods SOE (used in small-large index): Centropages typicus, Pseudocalanus spp., Temora longicornis, Centropages hamatus
Lets do the same model selection as previously. Two stages, first looking at spatial, spatio-temporal random effects and the second looking at covariates. Selection is done for spring and fall models for:
* Calanus finmarchicus (calfin_100m3) = "calfin" in table below,
* Large copepods (calfin_100m3, mlucens_100m3, calminor_100m3, euc_100m3, calspp_100m3) = "lgcopeALL",
* Small copepods (all) (ctyp_100m3, pseudo_100m3, tlong_100m3, cham_100m3, para_100m3, acarspp_100m3, clauso, acarlong_100m3, fur_100m3, ost_100m3, temspp_100m3, tort_100m3, paraspp_100m3) = "smallcopeALL" and
* Small copepods (SOE) (ctyp_100m3, pseudo_100m3, tlong_100m3, cham_100m3) = "smallcopeSOE".
Model selection script is
https://github.com/NOAA-EDAB/zooplanktonindex/blob/main/VASTscripts/VASTunivariate_zoopindex_modselection.R
Models were run using REML and without bias correction.
Two different observation models were applied. The default VAST index standardization (`purpose = "index2"` in `make_settings`) uses a Gamma distribution for positive catches and an alternative "Poisson-link delta-model" using log-link for numbers-density and log-link for biomass per number (`ObsModel= c(2,1)`).
We applied the default observation model to Calanus finmarchicus (calfin_100m3) and Large copepods (calfin_100m3 + mlucens_100m3 + calminor_100m3 + euc_100m3 + calspp_100m3) datasets.
The default was used for index standardization of stomach contents data for pelagic and benthic forage indices. It is intended for continuous data, which includes biomass data and "numbers standardized to a fixed area" (see section starting at line 239 in the VAST user manual [here](https://github.com/James-Thorson-NOAA/VAST/blob/main/manual/VAST_model_structure.pdf). I am interpreting zooplankton abundance per 100 cubic meters as numbers standardized to a fixed area (volume) in applying the Gamma observation model.
For data where there are some years where the species is present in all (or 0) samples, estimating the probability of encounter fails (or at least, VAST won't let you try). In these cases, the options are to treat intercepts representing temporal variability as random effects (by setting `RhoConfig` `Beta` or `Epsilon` entries to something other than 0), or to use a different link function.
We intend for our indices to potentially be used in assessment (though as a covariate rather than an index, so maybe I'm being too strict), so the recommendation is to "minimize covariance in the estimated index by excluding any temporal correlation on model components (i.e., the intercept is a fixed effect in each year, and the spatio-temporal term is independent in each year)" [@thorson_guidance_2019].
All of the small copepods datasets had at least one year where our small copepods groupings were encountered at all stations. None had years with 0 encounters.
Therefore, we used a different link function, the Poisson-link fixing encounter probability=1 for any year where all samples encounter the species. We kept all other settings for index standardization the same, but set (`ObsModel= c(2,4)`).
### Stage 1 results
```{r}
# from each output folder in pyindex,
outdir <- here::here("pyindex_modsel1")
moddirs <- list.dirs(outdir)
moddirs <- moddirs[-1]
# keep folder name
modnames <- list.dirs(outdir, full.names = FALSE)
# function to apply extracting info
getmodinfo <- function(d.name){
# read settings
modpath <- stringr::str_split(d.name, "/", simplify = TRUE)
modname <- modpath[length(modpath)]
settings <- read.table(file.path(d.name, "settings.txt"), comment.char = "",
fill = TRUE, header = FALSE)
n_x <- as.numeric(as.character(settings[(which(settings[,1]=="$n_x")+1),2]))
grid_size_km <- as.numeric(as.character(settings[(which(settings[,1]=="$grid_size_km")+1),2]))
max_cells <- as.numeric(as.character(settings[(which(settings[,1]=="$max_cells")+1),2]))
use_anisotropy <- as.character(settings[(which(settings[,1]=="$use_anisotropy")+1),2])
fine_scale <- as.character(settings[(which(settings[,1]=="$fine_scale")+1),2])
bias.correct <- as.character(settings[(which(settings[,1]=="$bias.correct")+1),2])
#FieldConfig
if(settings[(which(settings[,1]=="$FieldConfig")+1),1]=="Component_1"){
omega1 <- as.character(settings[(which(settings[,1]=="$FieldConfig")+2),2])
omega2 <- as.character(settings[(which(settings[,1]=="$FieldConfig")+3),1])
epsilon1 <- as.character(settings[(which(settings[,1]=="$FieldConfig")+4),2])
epsilon2 <- as.character(settings[(which(settings[,1]=="$FieldConfig")+5),1])
beta1 <- as.character(settings[(which(settings[,1]=="$FieldConfig")+6),2])
beta2 <- as.character(settings[(which(settings[,1]=="$FieldConfig")+7),1])
}
if(settings[(which(settings[,1]=="$FieldConfig")+1),1]=="Omega1"){
omega1 <- as.character(settings[(which(settings[,1]=="$FieldConfig")+3),1])
omega2 <- as.character(settings[(which(settings[,1]=="$FieldConfig")+4),1])
epsilon1 <- as.character(settings[(which(settings[,1]=="$FieldConfig")+3),2])
epsilon2 <- as.character(settings[(which(settings[,1]=="$FieldConfig")+4),2])
beta1 <- "IID"
beta2 <- "IID"
}
#RhoConfig
rho_beta1 <- as.numeric(as.character(settings[(which(settings[,1]=="$RhoConfig")+3),1]))
rho_beta2 <- as.numeric(as.character(settings[(which(settings[,1]=="$RhoConfig")+3),2]))
rho_epsilon1 <- as.numeric(as.character(settings[(which(settings[,1]=="$RhoConfig")+4),1]))
rho_epsilon2 <- as.numeric(as.character(settings[(which(settings[,1]=="$RhoConfig")+4),2]))
# read parameter estimates, object is called parameter_Estimates
if(file.exists(file.path(d.name, "parameter_estimates.RData"))) {
load(file.path(d.name, "parameter_estimates.RData"))
AIC <- parameter_estimates$AIC[1]
converged <- parameter_estimates$Convergence_check[1]
fixedcoeff <- unname(parameter_estimates$number_of_coefficients[2])
randomcoeff <- unname(parameter_estimates$number_of_coefficients[3])
}else if(file.exists(file.path(d.name, "parameter_estimates.txt"))){
reptext <- readLines(file.path(d.name, "parameter_estimates.txt"))
AIC <- as.double(reptext[grep(reptext, pattern = "AIC")+1])
converged <- reptext[grep(reptext, pattern = "Convergence_check")+1]
fixedcoeff <- as.integer(stringr::str_split(reptext[grep(reptext, pattern = "number_of_coefficients")+2],
boundary("word"))[[1]][2])
randomcoeff <- as.integer(stringr::str_split(reptext[grep(reptext, pattern = "number_of_coefficients")+2],
boundary("word"))[[1]][3])
}else{
AIC <- NA_real_
converged <- NA_character_
fixedcoeff <- NA_integer_
randomcoeff <- NA_integer_
}
#index <- read.csv(file.path(d.name, "Index.csv"))
# return model attributes as a dataframe
out <- data.frame(modname = modname,
n_x = n_x,
grid_size_km = grid_size_km,
max_cells = max_cells,
use_anisotropy = use_anisotropy,
fine_scale = fine_scale,
bias.correct = bias.correct,
omega1 = omega1,
omega2 = omega2,
epsilon1 = epsilon1,
epsilon2 = epsilon2,
beta1 = beta1,
beta2 = beta2,
rho_epsilon1 = rho_epsilon1,
rho_epsilon2 = rho_epsilon2,
rho_beta1 = rho_beta1,
rho_beta2 = rho_beta2,
AIC = AIC,
converged = converged,
fixedcoeff = fixedcoeff,
randomcoeff = randomcoeff#,
#index = index
)
return(out)
}
modcompare <- purrr::map_dfr(moddirs, getmodinfo)
```
Models compared using REML are identified by model name ("modname" in Table Sb\@ref(tab:modsel1)) which always includes all prey aggregated, season ("all" for annual models of months 1-12, "fall" for models of months 7-12, and "spring" for models of months 1-6), number of knots (500 for all models), and which fixed and random spatial and spatio-temporal effects were included in which linear predictor (1 or 2). The names for model options and associated VAST model settings are:
Model selection 1 (spatial, spatio-temporal effects, no covariates) options (following @ng_predator_2021) and naming:
* All models set Use_REML = TRUE in fit_model specifications.
* Modeled effects, model name suffix, and VAST settings by model:
1. "_alleffectson" = Spatial and spatio-temporal random effects plus anisotropy in both linear predictors; FieldConfig default (all IID)
1. "_noaniso" = Spatial and spatio-temporal random effects in both linear predictors with anisotopy turned off; FieldConfig default (all IID) and `use_anisotopy = FALSE`
1. "_noomeps2" = Spatial and spatio-temporal random effects plus anisotropy only in linear predictor 1; FieldConfig = 0 for Omega2, Epsilon2
1. "_noomeps2_noaniso" = Spatial and spatio-temporal random effects only in linear predictor 1 with anisotopy turned off; FieldConfig = 0 for Omega2, Epsilon2 and `use_anisotopy = FALSE`
1. "_noomeps2_noeps1" = Spatial random effects plus anisotropy only in linear predictor 1; FieldConfig = 0 for Omega2, Epsilon2, Epsilon1
1. "_noomeps2_noeps1_noaniso" = Spatial random effects only in linear predictor 1 with anisotopy turned off; FieldConfig = 0 for Omega2, Epsilon2, Epsilon1 and `use_anisotopy = FALSE`
1. "_noomeps12" = Anisotropy, but no spatial or spatio-temporal random effects in either linear predictor; FieldConfig = 0 for both Omega, Epsilon
1. "_noomeps12_noaniso" = No spatial or spatio-temporal random effects in either linear predictor; FieldConfig = 0 for both Omega, Epsilon and `use_anisotopy = FALSE`
```{r modsel1}
modselect <- modcompare |>
dplyr::mutate(season = dplyr::case_when(stringr::str_detect(modname, "_fall_") ~ "Fall",
stringr::str_detect(modname, "spring") ~ "Spring",
stringr::str_detect(modname, "_all_") ~ "Annual",
TRUE ~ as.character(NA))) |>
dplyr::mutate(converged2 = dplyr::case_when(stringr::str_detect(converged, "no evidence") ~ "likely",
stringr::str_detect(converged, "is likely not") ~ "unlikely",
TRUE ~ as.character(NA))) |>
dplyr::mutate(copegroup = stringr::str_extract(modname, "[^_]+")) |>
#dplyr::mutate(modname = str_extract(modname, '(?<=allagg_).*')) |>
dplyr::group_by(copegroup, season) |>
dplyr::mutate(deltaAIC = AIC-min(AIC)) |>
dplyr::select(copegroup, modname, season, deltaAIC, fixedcoeff,
randomcoeff, use_anisotropy,
omega1, omega2, epsilon1, epsilon2,
beta1, beta2, AIC, converged2) |>
dplyr::arrange(copegroup, season, AIC)
# DT::datatable(modselect, rownames = FALSE,
# options= list(pageLength = 25, scrollX = TRUE),
# caption = "Comparison of delta AIC values using Restricted Maxiumum Likelihood (REML) for alternative fixed and random effects model structures. See text for model descriptions.")
flextable::flextable(modselect |>
dplyr::select(-c(use_anisotropy,
omega1, omega2, epsilon1, epsilon2,
beta1, beta2))) |>
flextable::set_header_labels(copegroup = "Copepod group",
modname = "Model name",
season = "Season",
deltaAIC = "dAIC",
fixedcoeff = "N fixed",
randomcoeff = "N random",
converged2 = "Convergence") |>
flextable::set_caption("Comparison of delta AIC (dAIC) values using Restricted Maxiumum Likelihood (REML) for alternative fixed and random effects model structures, with number of fixed (N fixed) and random (N random) coefficients. See text for model descriptions associated with each model name.") |>
flextable::fontsize(size = 9, part = "all") |>
flextable::colformat_double(digits = 2) |>
flextable::set_table_properties(layout = "autofit", width = 1)
```
Using REML, models including spatial and spatio-temporal random effects as well as anisotropy were best supported by the data. This was true for the spring dataset and the fall dataset for calfin, lgcopeALL, smallcopeALL, and smallcopeSOE.
Full outputs from stage 1 model selection are posted to google drive [here](https://drive.google.com/drive/folders/1szQzBbn-bQosPK4AqKRbfbduGGMM-BDG?usp=drive_link).
We haven't done stage 2 yet; thoughts on what might be appropriate covariates (if any) are welcome.
## References