-
Notifications
You must be signed in to change notification settings - Fork 24
/
Copy pathREADME.Rmd
541 lines (417 loc) · 24.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
---
output: github_document
bibliography: inst/REFERENCES.bib
---
# modleR: a workflow for ecological niche models
[![:registry status badge](https://andreasancheztapia.r-universe.dev/badges/:registry)](https://andreasancheztapia.r-universe.dev)
<!-- badges: start -->
[![R-CMD-check](https://github.com/Model-R/modleR/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/Model-R/modleR/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
<img src="./vignettes/modleR.png" align="right" alt="" width="120" />
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
eval = TRUE,
cache = FALSE,
fig.path = "man/figures/README-"
)
```
__modleR__ is a workflow based on package __dismo__ [@hijmans_dismo_2017], designed to automatize some of the common steps when performing ecological niche models. Given the occurrence records and a set of environmental predictors, it prepares the data by cleaning for duplicates, removing occurrences with no environmental information and applying some geographic and environmental filters. It executes crossvalidation or bootstrap procedures, then it performs ecological niche models using several algorithms, some of which are already implemented in the `dismo` package, and others come from other packages in the R environment, such as glm, Support Vector Machines and Random Forests.
# Citation
Andrea Sánchez-Tapia, Sara Ribeiro Mortara, Diogo Souza Bezerra Rocha, Felipe Sodré Mendes Barros, Guilherme Gall, Marinez Ferreira de Siqueira. modleR: a modular workflow to perform ecological niche modeling in R. https://www.biorxiv.org/content/10.1101/2020.04.01.021105v1
# Installing
Currently __modleR__ can be installed from GitHub:
```{r install, eval = F}
# Without vignette
remotes::install_github("Model-R/modleR", build = TRUE)
# With vignette
remotes::install_github("Model-R/modleR",
build = TRUE,
dependencies = TRUE,
build_opts = c("--no-resave-data", "--no-manual"),
build_vignettes = TRUE)
```
__Note regarding vignette building__: the default parameters in `build_opts`
include `--no-build-vignettes`. In theory, removing this will include the
vignette on the installation but we have found that `build_vignettes = TRUE` is
also necessary. During installation, R may ask to install or update some
packages. If any of these return an error you can install them apart by running
`install.packages()` and retry. When building the vignette, package __rJava__
and a JDK will be needed. Also, make sure that the maxent.jar file is available
and in the `java` folder of package __dismo__. Please download it [here](
http://www.cs.princeton.edu/~schapire/maxent/). Vignette building may take a
while during installation.
Packages __kuenm__ and __maxnet__ should be installed from GitHub:
```r
remotes::install_github("marlonecobos/kuenm")
remotes::install_github("mrmaxent/maxnet")
```
# The workflow
The workflow consists of mainly four functions that should be used sequentially.
```{r workflow, echo = FALSE}
knitr::include_graphics("vignettes/fig01_workflow.jpg", dpi = 150)
```
1. Setup: `setup_sdmdata()` prepares and cleans the data, samples the pseudoabsences, and organizes the experimental design (bootstrap, crossvalidation or repeated crossvalidation). It creates a metadata file with details for the current round and a sdmdata file with the data used for modeling
2. Model fitting and projecting: `do_any()` makes the ENM for one algorithm and partition; optionally, `do_many()` calls `do_any()` to fit multiple algorithms
3. Partition joining: `final_model()` joins the partition models into a model per species per algorithm
4. Ensemble: `ensemble_model()` joins the different models per algorithm into an ensemble model (algorithmic consensus) using several methods.
## Folder structure created by this package
__modleR__ writes the outputs in the hard disk, according to the following folder structure:
```bash
models_dir
├── projection1
│ ├── data_setup
│ ├── partitions
│ ├── final_models
│ └── ensemble_models
└── projection2
├── data_setup
├── partitions
├── final_models
└── ensemble_models
```
+ We define a _partition_ as the individual modeling round (one training and test data set and one algorithm)
+ We define the _final models_ as joining together the partitions and obtaining __one model per species per algorithm__
+ _Ensemble_ models join together the results obtained by different algorithms [@araujo_ensemble_2007]
+ When projecting models into the present, the projection folder is called `present`, other projections will be named after their environmental variables
+ You can set `models_dir` wherever you want in the hard disk, but if you do not modify the default value, it will create the output under the working directory (its default value is `./models`, where the period points to the working directory)
+ The _names_ of the `final` and `ensemble` folders can be modified, but __the nested subfolder structure will remain the same__. If you change `final_models` default value (`"final_model"`) you will need to include the new value when calling `ensemble_model()` (`final_dir = "[new name]"`), to indicate the function where to look for models. This partial flexibility allows for experimenting with final model and ensemble construction (by runnning final or ensemble twice in different output folders, for example).
## The example dataset
__modleR__ comes with example data, a list called `example_occs` with
occurrence data for four species, and predictor variables called
`example_vars`.
```{r load, echo=FALSE, eval=TRUE, include = F}
devtools::load_all() # for development
```
```{r library, echo=TRUE, eval=FALSE}
library(modleR)
```
```{r species, echo = TRUE, eval = TRUE}
str(example_occs)
species <- names(example_occs)
species
```
```{r dataset, fig.width = 5, fig.height = 5, fig.cap = "Figure 1. The example dataset: predictor variables and occurrence for four species.", eval = TRUE, fig.show='hold'}
library(sp)
par(mfrow = c(2, 2), mar = c(2, 2, 3, 1))
for (i in 1:length(example_occs)) {
plot(!is.na(example_vars[[1]]),
legend = FALSE,
main = species[i],
col = c("white", "#00A08A"))
points(lat ~ lon, data = example_occs[[i]], pch = 19)
}
par(mfrow = c(1, 1))
```
We will filter the `example_occs` file to select only the data for the first species:
```{r occs, message = FALSE, eval = TRUE}
occs <- example_occs[[1]]
```
## Cleaning and setting up the data: `setup_sdmdata()`
The first step of the workflow is to setup the data, that is, to partition it
according to each project needs, to sample background pseudoabsences and to
apply some data cleaning procedures, as well as some filters. This is done by
function `setup_sdmdata()`
`setup_sdmdata()` has a large number of parameters:
```{r args_setup_sdmdata, eval = TRUE}
args(setup_sdmdata)
```
+ `species_name` is the name of the species to model
+ `occurrences` is the data frame with occurrences, lat and lon are the names of the columns for latitude and longitude, respectively. If they are already named `lat` and `lon` they need not be specified.
+ `predictors`: is the rasterStack of the environmental variables
There are a couple options for data cleaning:
+ `clean_dupl` will delete exact duplicates in the occurrence data
+ `clean_nas` will delete any occurrence with no environmental data in the predictor set
+ `clean_uni` will leave only one occurrence per pixel
The function also sets up different experimental designs:
+ `partition_type` can be either bootstrap or k-fold crossvalidation
+ `boot_n` and `cv_n` perform repeated bootstraps and repeated k-fold crossvalidation, respectively
+ `boot_proportion` sets the proportion of data to be sampled as training set (defaults to 0.8)
+ `cv_partitions` sets the number of partitions in the k-fold crossvalidations (defaults to 3) but overwrites part when n < 10, setting part to the number of occurrence records (a jacknife partition).
Pseudoabsence sampling is performed by function has also some options:
+ `real_absences` can be used to specify a set of user-defined absences, with species name, lat and lon columns
+ `geo_filt` will eliminate records that are at less than `geo_filt_dist` between them, in order to control for spatial autocorrelation
+ `buffer_type`: can build a distance buffer around the occurrence points, by taking either the maximal, median or mean distance between points. It can also take a user-defined shapefile as the area for pseudoabsence sampling
+ `env_filter` calculates the euclidean distance and removes the closest areas in the environmental space from the sampling of pseudoabsences
Pseudoabsence points will be sampled (using `dismo::randomPoints()`) _within_ the buffer and outside the environmental filter, in order to control for the area accessible to the species (M in the BAM diagram).
+ `seed`: for reproducibility purposes
```{r sdmdata1sp, eval = TRUE}
test_folder <- "~/modleR_test"
sdmdata_1sp <- setup_sdmdata(species_name = species[1],
occurrences = occs,
predictors = example_vars,
models_dir = test_folder,
partition_type = "crossvalidation",
cv_partitions = 5,
cv_n = 1,
seed = 512,
buffer_type = "mean",
png_sdmdata = TRUE,
n_back = 500,
clean_dupl = TRUE,
clean_uni = TRUE,
clean_nas = TRUE,
geo_filt = FALSE,
geo_filt_dist = 10,
select_variables = TRUE,
sample_proportion = 0.5,
cutoff = 0.7)
```
+ The function will return a `sdmdata` data frame, with the groups for training and test in bootstrap or crossvalidation, a `pa` vector that marks presences and absences, and the environmental dataset. This same data frame will be written in the hard disk, as `sdmdata.txt`
+ It will also write a `metadata.txt` with the parameters of the latest modeling round. If there has been a cleaning step, it will show different values in the "original.n" and "final.n" columns.
+ __NOTE:__ `setup_sdmdata` will check if there's a prior folder structure and `sdmdata.txt` and `metadata.txt` files, in order to avoid repeating the data partitioning.
+ If a call to the function encounters previously written metadata, it will check if the current round has the same parameters and skip the data partitioning. A message will be displayed:
`#> metadata file found, checking metadata`
`#> same metadata, no need to run data partition`
+ If a previous metadata file is found but it has different metadata (i.e. there is an inconsistency between the existing metadata and the current parameters), it will run the function with the current parameters.
## Fitting a model per partition: `do_any()` and `do_many()`
Functions `do_any()` and `do_many()` create a *model per partition, per algorithm*.
The difference between these functions that `do_any()` performs modeling for one
individual algorithm at a time, that can be chosen by using parameter `algorithm`,
while `do_many()` can select multiple algorithms, with TRUE or FALSE statements (just as BIOMOD2 functions do).
The available algorithms are:
+ `"bioclim"`, `"maxent"`, `"mahal"`, `"domain"`, as implemented in __dismo__ package [@hijmans_dismo_2017],
+ Support Vector Machines (SVM), as implemented by packages __kernlab__ [`svmk` @karatzoglou_kernlab_2004] and __e1071__ [`svme` @meyer_e1071_2017],
+ GLM from base R, here implemented with a stepwise selection approach
+ Random Forests [from package __randomForest__ @liaw_classification_2002]
+ Boosted regression trees (BRT) as implemented by `gbm.step()` function in __dismo__ package [@hastie_elements_2001; @elith_working_2009].
Details for the implementation of each model can be accessed in the documentation of the function.
Here you can see the differences between the parameters of both functions. `do_many()` calls several instances of `do_any()` Sometimes you may only want to call `do_many()` but for better control and parallelization by algorithm it may be better to call `do_any()` individually.
```{r args_do_any_do_many, eval = TRUE}
args(do_any)
args(do_many)
```
Calling `do_many()` and setting `bioclim = TRUE` is therefore equivalent to call `do_any()` and set `algorithm = "bioclim"`.
```{r do_any_eval, echo = TRUE, eval = TRUE, results= "hide", message=F, warning= FALSE}
sp_maxnet <- do_any(species_name = species[1],
algorithm = "maxnet",
predictors = example_vars,
models_dir = test_folder,
png_partitions = TRUE,
write_bin_cut = FALSE,
equalize = TRUE,
write_rda = TRUE)
```
The resulting object is a table with the performance metrics, but the actual output is written on disk
```{r sp_maxnet}
sp_maxnet
```
The following lines call for bioclim, GLM, random forests, BRT, svme (from package __e1071__), and smvk (from package __kernlab__)
```{r do_many_show, echo = TRUE, eval = TRUE, results = 'hide', message = FALSE, warning = FALSE}
many <- do_many(species_name = species[1],
predictors = example_vars,
models_dir = test_folder,
png_partitions = TRUE,
write_bin_cut = FALSE,
write_rda = TRUE,
bioclim = TRUE,
domain = FALSE,
glm = TRUE,
svmk = TRUE,
svme = TRUE,
maxent = FALSE,
maxnet = TRUE,
rf = TRUE,
mahal = FALSE,
brt = TRUE,
equalize = TRUE)
```
In addition:
+ `mask`: will crop and mask the partition models into a ShapeFile
+ `png_partitions` will create a png file of the output
At the end of a modeling round, the partition folder containts:
+ A `.tif` file for each partition, continuous, binary and cut by the threshold that maximizes its TSS (TSSmax). Its name will indicate the algorithm, the type of model (cont, bin or cut), the name of the species, the run and partition.
+ Figures in `.png` to explore the results readily, without reloading them into R or opening them in a SIG program. The creation of these figures can be controlled with the `png_partitions` parameter.
+ A `.txt` table with the evaluation data for each partition: `evaluate_[Species name ]_[partition number]_[algorithm].txt`. These files will be read by the `final_model()` function, to generate the final model per species.
+ A file called `sdmdata.txt` with the data used for each partition
+ A file called `metadata.txt` with the metadata of the current modeling round.
+ An optional `.png` image of the data (controlled by parameter `png_sdmdata = TRUE`)
## Joining partitions: `final_model()`
There are many ways to create a final model per algorithm per species. `final_model()` follows the following logic:
```{r finalfig, echo = FALSE}
knitr::include_graphics("vignettes/fig05_finalmodel.png", dpi = 150)
```
+ The partitions that will be joined can be the raw, uncut models, or the binary models from the previous step, they form a `raster::rasterStack()` object.
+ The means for the raw models can be calculated (`raw_mean`)
+ From `raw_mean`, a binary model can be obtained by cutting it by the mean threshold that maximizes the selected performance metric for each partition (`bin_th_par`), this is `raw_mean_th`. From this, values above the threshold can be revovered (`raw_mean_cut`).
+ In the case of binary models, since they have already been transformed into binary, a mean can be calculated (`bin_mean`). This `bin_mean` reflects the consensus between partitions, and its scale is categorical.
+ From `bin_mean`, a specific consensus level can be chosen (i.e. how many of the models predict an area, `consensus_level`) and the resulting binary model can be built (`bin_consensus`). The parameter `consensus_level` allows to set this level of consensus (defaults to 0.5: majority consensus approach).
+ NOTE: The final models can be done using a subset of the algorithms avaliable on the hard disk, using the parameter `algorithms`. If left unspecified, all algorithms listed in the `evaluate` files will be used.
```{r final_model, eval = TRUE}
args(final_model)
```
```{r final, echo = TRUE, eval = TRUE, results = "hide", message=F}
final_model(species_name = species[1],
algorithms = NULL, #if null it will take all the algorithms in disk
models_dir = test_folder,
which_models = c("raw_mean",
"bin_mean",
"bin_consensus"),
consensus_level = 0.5,
uncertainty = TRUE,
overwrite = TRUE)
```
`final_model()` creates a .tif file for each final.model (one per algorithm) under the specified folder (default: `final_models`)
The `raw_mean` final models for each algorithm are these:
```{r plot_final, echo = F, fig.width = 7, fig.height = 6, eval = TRUE}
final.folder <- list.files(test_folder,
recursive = TRUE,
pattern = "final_models",
include.dirs = TRUE,
full.names = TRUE)
final_mods <- list.files(final.folder,
full.names = TRUE,
pattern = "raw_mean.+tif$",
recursive = TRUE)
final_models <- raster::stack(final_mods)
names(final_models) <- sapply(strsplit(names(final_models),
paste0(species[1], '_')),
function(x) x[2])
plot(final_models)
```
## Algorithmic consensus with `ensemble_model()`
The fourth step of the workflow is joining the models for each algorithm into a final ensemble model. `ensemble_model()` calculates the mean, standard deviation, minimum and maximum values of the final models and saves them under the folder specified by `ensemble_dir`. It can also create these models by a consensus rule (what proportion of final models predict a presence in each pixel, 0.5 is a majority rule, 0.3 would be 30% of the models).
`ensemble_model()` uses a `which_final` parameter -analog to `which_model` in `final_model()` to specify which final model(s) (Figure 2) should be assembled together (the default is a mean of the raw continuous models: `which_final = c("raw_mean")`).
```{r ensemble_model, eval = TRUE, message=F, warning=F}
args(ensemble_model)
ens <- ensemble_model(species_name = species[1],
occurrences = occs,
performance_metric = "pROC",
which_ensemble = c("average",
"best",
"frequency",
"weighted_average",
"median",
"pca",
"consensus"),
consensus_level = 0.5,
which_final = "raw_mean",
models_dir = test_folder,
overwrite = TRUE) #argument from writeRaster
```
```{r ensplot, fig.width = 7, fig.height = 6}
plot(ens)
```
# Workflows with multiple species
Our `example_occs` dataset has data for four species.
An option to do the several models is to use a `for` loop
```{r forloop, eval = FALSE}
args(do_many)
args(setup_sdmdata)
for (i in 1:length(example_occs)) {
sp <- species[i]
occs <- example_occs[[i]]
setup_sdmdata(species_name = sp,
models_dir = "~/modleR_test/forlooptest",
occurrences = occs,
predictors = example_vars,
buffer_type = "distance",
dist_buf = 4,
write_buffer = TRUE,
clean_dupl = TRUE,
clean_nas = TRUE,
clean_uni = TRUE,
png_sdmdata = TRUE,
n_back = 1000,
partition_type = "bootstrap",
boot_n = 5,
boot_proportion = 0.7
)
}
for (i in 1:length(example_occs)) {
sp <- species[i]
do_many(species_name = sp,
predictors = example_vars,
models_dir = "~/modleR_test/forlooptest",
png_partitions = TRUE,
bioclim = TRUE,
maxnet = FALSE,
rf = TRUE,
svmk = TRUE,
svme = TRUE,
brt = TRUE,
glm = TRUE,
domain = FALSE,
mahal = FALSE,
equalize = TRUE,
write_bin_cut = TRUE)
}
for (i in 1:length(example_occs)) {
sp <- species[i]
final_model(species_name = sp,
consensus_level = 0.5,
models_dir = "~/modleR_test/forlooptest",
which_models = c("raw_mean",
"bin_mean",
"bin_consensus"),
uncertainty = TRUE,
overwrite = TRUE)
}
for (i in 1:length(example_occs)) {
sp <- species[i]
occs <- example_occs[[i]]
ensemble_model(species_name = sp,
occurrences = occs,
which_final = "bin_consensus",
png_ensemble = TRUE,
models_dir = "~/modleR_test/forlooptest")
}
```
Another option is to use the `purrr` package [@henry_purrr_2017].
```{r purrr example, eval = FALSE}
library(purrr)
example_occs %>% purrr::map2(.x = .,
.y = as.list(names(.)),
~ setup_sdmdata(species_name = .y,
occurrences = .x,
partition_type = "bootstrap",
boot_n = 5,
boot_proportion = 0.7,
clean_nas = TRUE,
clean_dupl = TRUE,
clean_uni = TRUE,
buffer_type = "distance",
dist_buf = 4,
predictors = example_vars,
models_dir = "~/modleR_test/temp_purrr",
n_back = 1000))
species %>%
as.list(.) %>%
purrr::map(~ do_many(species_name = .,
predictors = example_vars,
models_dir = "~/modleR_test/temp_purrr",
bioclim = TRUE,
maxnet = FALSE,
rf = TRUE,
svme = TRUE,
svmk = TRUE,
domain = FALSE,
glm = TRUE,
mahal = FALSE,
brt = TRUE,
equalize = TRUE))
```
```{r purrr_final, eval = FALSE}
species %>%
as.list(.) %>%
purrr::map(~ final_model(species_name = .,
consensus_level = 0.5,
models_dir = "~/modleR_test/temp_purrr",
which_models = c("raw_mean",
"bin_mean",
"bin_consensus"),
overwrite = TRUE))
```
```{r purrr_ensemble, eval = FALSE}
example_occs %>% purrr::map2(.x = .,
.y = as.list(names(.)),
~ ensemble_model(species_name = .y,
occurrences = .x,
which_final = "raw_mean",
png_ensemble = TRUE,
models_dir = "~/modleR_test/temp_purrr",
overwrite = TRUE))
```
These workflows can also be paralellized by species or species algorithms
# References