diff --git a/.Rbuildignore b/.Rbuildignore index dbf0144..2dfa66d 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -2,3 +2,5 @@ ^\.Rproj\.user$ /develop/ ^\.github$ +^doc$ +^Meta$ diff --git a/.gitignore b/.gitignore index 0942889..8a1f74c 100644 --- a/.gitignore +++ b/.gitignore @@ -11,4 +11,10 @@ *.grd *.gri test.r +<<<<<<< HEAD +dylan_train +======= inst/doc +>>>>>>> aaef9acd866713c677bf5699831e99702140c015 +/doc/ +/Meta/ diff --git a/vignettes/Get_Started_1.Rmd b/vignettes/Get_Started_1.Rmd deleted file mode 100644 index 0bef68c..0000000 --- a/vignettes/Get_Started_1.Rmd +++ /dev/null @@ -1,171 +0,0 @@ ---- -title: "1. Get started with rbms - phenology" -author: "Reto Schmucki (UKCEH)" -date: "28 November 2019" -output: - pdf_document: default -subtitle: From counts to flight curve ---- ---------- -In this tutorial, we show how to fit the flight curve function on butterfly counts recorded on a weekly base. We use the R functions implemented in the `rbms` package and the data bundle within the same package. The data are genuine Butterfly Monitoring Scheme counts with transect visit dates. The flight curve computation is based on spline fitted on the counts collected across multiple sites and standardised to sum to 1 (area under the curve is one). - -##### 1. load package and data included in the package - -```{r data} -library(rbms) -data(m_visit) -data(m_count) -``` - -Here, the visit and count data are both packaged in data.table format but can also be provided as data.frame. The function converts them into data.table because this format is more efficient for handling large data sets. While the input format can vary, **the header names need to be consistent**, and some columns are essential for the functions to work. - -Visit data represent the visit date at which each site was visited, and butterflies were monitored. If no butterfly was observed during a visit, the abundance for that specific visit would be set to zero [0]. This allows to subset positive non-zero counts from the count data set, which result in smaller objects to handle. The visit data may contain many columns, but only two are essential for the function. - - -1) **SITE_ID** (can be numbers of characters and treated as a non-numeric factor) -2) **DATE** (by default, the format is ā€œ%Y-%m-%dā€ (e.g., 2019-11-28]). If a different format needs to be specified, use the argument DateFormat) - -```{r data_visit, echo = FALSE} -m_visit -``` - -Count data must be provided in columns with specific headers; more column can be provided, but rbms only use the following four: - -1) **SITE_ID** -2) **DATE** -3) **SPECIES** -4) **COUNT** - -```{r data_count, echo = FALSE} -m_count -``` - -##### **2.** organise the data to cover the time period and monitoring season of the BMS - -With the visit and count data, we need to merge them into a new data.table object that covers the entire time-series of interest. This time-series is structured with the start and end date of the monitoring season and define on a weekly or a daily base (resolution of the flight curve). In a first step, we initialise the time-series with day-week-month-year information. - -**2.1.** First, we initialise a time-series with day-week-month-year information. - -```{r init_timeseries} -ts_date <- rbms::ts_dwmy_table(InitYear = 2000, LastYear = 2003, WeekDay1 = 'monday') -``` - -**2.2.** Ddd the monitoring season to the time-series - -We add the monitoring season to the new time-series, providing the StartMonth and EndMonth arguments. The definition of the monitoring season can be refined with more arguments (StartDay, EndDay). We also define the resolution of the time-series (weekly or daily), where TimeUnit = 'w' will compute the flight curve based on weekly counts. In the alternative, 'd' is used for building a daily based flight curve. The ANCHOR argument adds zeros (0) before and after the monitoring season. This ensures that the flight curve starts and end at zero. - -```{r monitoring_season} -ts_season <- rbms::ts_monit_season(ts_date, StartMonth = 4, EndMonth = 9, StartDay = 1, EndDay = NULL, CompltSeason = TRUE, Anchor = TRUE, AnchorLength = 2, AnchorLag = 2, TimeUnit = 'w') -``` -> NOTE: for species with overwintering adult and early counts, having an Anchor set to zero might sound wrong, and we are currently working on finding an alternative for these cases to represent the flight curve of those species better. - - -**2.3.** Add site visits to the time-series -After the monitoring season is defined for a specific time-period and monitoring season, we use the `ts_monit_site()` function to expand and inform the time-series with the site visits. For this, we use the visit data and link it with the time series contained in ts_season object. - -```{r visit_season} -ts_season_visit <- rbms::ts_monit_site(m_visit, ts_season) -``` - -**2.4.** Add observed count - -The observed counts are then added to the time-series contained in the `data.table` object, using the count and selecting the species of interest with the argument `sp`. Here we use the count recorded for species "2" (species names can also be a string of characters). - -```{r count_visit} -ts_season_count <- rbms::ts_monit_count_site(ts_season_visit, m_count, sp = 2) -``` - -The resulting `data.table` object contains zeros and positive counts recorded along each BMS transects for species `2`, over the entire time-series, but only within the focal monitoring season. This time-series contains the week or days of the counts. When the counts are missing because the site was not visited, the count is informed as NA. - -```{r count_visit_obj, echo=FALSE} -ts_season_count -``` - -##### **3.** Compute the yearly flight curve for the data - -With the object constructed above, we can compute the flight curve for each year with sufficient data. The `flight_curve` function assumes that the annual phenology shares the same shape across sites. - -The objective of this function is to extract this shape that we call the flight curve. Here, we want to impose some minimal threshold to control the quality of the data used to inform this model. This is done by setting the Minimum number of visits MinNbrVisit, the minimum number of occurrences MinOccur and the number of sites MinNbrSite to use to fit the model. - -These values can influence the model, and its sensitivity will depend on the species and the data set. Still, as a minimum requirement, MinOccur should be set >= 2, the MinNbrVisit > MinOccur and MinNbrSite >= 5. These thresholds affect the data that inform your model and the resulting flight curve. When higher values are chosen, fewer sites will be available and if insufficient, the thresholds need to be revised. - - -```{r flight_curve} - ts_flight_curve <- rbms::flight_curve(ts_season_count, NbrSample = 300, MinVisit = 5, MinOccur = 3, MinNbrSite = 5, MaxTrial = 4, GamFamily = 'nb', SpeedGam = FALSE, CompltSeason = TRUE, SelectYear = NULL, TimeUnit = 'w') -``` -> NOTE: for the `flight_curve` function, you will also have to define some parameters for the distribution for the GAM model as well as the maximum number of times to try to fit the model and the number of samples to use. The later will take a random sample from the data set if it contains more site than the number specified. - -From the flight_curve() function, we can retrieve a list of 3 objects: - -- pheno that contain the standardised phenology curve derived by fitting a GAM model, with a cubic spline to the count data; -- model that contains the result of the fitted GAM model; -- data the data used to fit the GAM model. - -We can now extract the pheno object, a data.frame that contains the shape of the annual flight curves, standardised to sum to 1. The flight-curves contained in the pheno object can be visualised with the following line of codes. - -Note for purpose of demonstration, let's imagine that we could not fit a flight curve for year 2002 and the flight curve returned "NA" for that year 2002. Our code needs to be aware of the missing year. - -```{r fc, fig.cap='Flight curve.', tidy=FALSE} -## Extract phenology part -pheno <- ts_flight_curve$pheno - -## add the line of the first year -yr <- unique(pheno[order(M_YEAR),][!is.na(NM), as.numeric(as.character(M_YEAR))]) - -if("trimWEEKNO" %in% names(pheno)){ - plot(unique(pheno[M_YEAR == yr[1], .(trimWEEKNO, NM)]), type = 'l', ylim = c(0, max(pheno[!is.na(NM), ][, NM])), xlab = 'Monitoring Week', ylab = 'Relative Abundance') -} else { - plot(unique(pheno[M_YEAR == yr[1], .(trimDAYNO, NM)]), type = 'l', ylim = c(0, max(pheno[!is.na(NM), ][, NM])), xlab = 'Monitoring Day', ylab = 'Relative Abundance') - -} -## add individual curves for additional years -if(length(yr) > 1) { -i <- 2 - for(y in yr[-1]){ - if("trimWEEKNO" %in% names(pheno)){ - points(unique(pheno[M_YEAR == y , .(trimWEEKNO, NM)]), type = 'l', col = i) - } else { - points(unique(pheno[M_YEAR == y, .(trimDAYNO, NM)]), type = 'l', col = i) - } - i <- i + 1 - } -} - -## add legend -legend('topright', legend = c(yr), col = c(seq_along(c(yr))), lty = 1, bty = 'n') -``` - -The code above produces a flight curve for each of the four years. Now, let's imagine that the model could not fit a flight curve for year 2002 and the flight curve returned "NA" for that year 2002. Our code needs to be aware of the missing year and skip that year. - -```{r fc_withNA, fig.cap='Flight curve without (2002=NA).', tidy=FALSE} -## Extract phenology part -pheno <- pheno[M_YEAR == 2002, NM := NA] - -## add the line of the first year -yr <- unique(pheno[order(M_YEAR),][!is.na(NM), as.numeric(as.character(M_YEAR))]) - -if(length(yr) > 0){ # need at least one year without "NA" for the flight curve - -if("trimWEEKNO" %in% names(pheno)){ - plot(unique(pheno[M_YEAR == yr[1], .(trimWEEKNO, NM)]), type = 'l', ylim = c(0, max(pheno[!is.na(NM), ][, NM])), xlab = 'Monitoring Week', ylab = 'Relative Abundance') -} else { - plot(unique(pheno[M_YEAR == yr[1], .(trimDAYNO, NM)]), type = 'l', ylim = c(0, max(pheno[!is.na(NM), ][, NM])), xlab = 'Monitoring Day', ylab = 'Relative Abundance') - -} -## add individual curves for additional years -if(length(yr) > 1) { -i <- 2 - for(y in yr[-1]){ - if("trimWEEKNO" %in% names(pheno)){ - points(unique(pheno[M_YEAR == y , .(trimWEEKNO, NM)]), type = 'l', col = i) - } else { - points(unique(pheno[M_YEAR == y, .(trimDAYNO, NM)]), type = 'l', col = i) - } - i <- i + 1 - } -} -} - -## add legend -legend('topright', legend = c(yr), col = c(seq_along(c(yr))), lty = 1, bty = 'n') -``` \ No newline at end of file diff --git a/vignettes/Get_Started_1.pdf b/vignettes/Get_Started_1.pdf deleted file mode 100644 index 2221119..0000000 Binary files a/vignettes/Get_Started_1.pdf and /dev/null differ diff --git a/vignettes/Get_Started_2.Rmd b/vignettes/Get_Started_2.Rmd deleted file mode 100644 index afc5048..0000000 --- a/vignettes/Get_Started_2.Rmd +++ /dev/null @@ -1,191 +0,0 @@ ---- -title: 2. Get started with rbms - collate index -subtitle: Inputting missing counts -author: Reto Schmucki (UKCEH) -output: - pdf_document: default -date: 28 November 2019 ---- ---------- - -From the flight curve and the observed counts, we can derive expected values for weeks or days where a site has not been monitored. Together, observed and imputed counts are used to compute abundance indices across sites. Site indices are then used to calculate annual collated indices. - -> see 1. Get started with `rbms` to compute the flight curve object used below (Step 1, 2 and 3) - -```{r data} -library(data.table) -library(rbms) -data(m_visit) -data(m_count) -ts_date <- rbms::ts_dwmy_table(InitYear = 2000, LastYear = 2003, WeekDay1 = 'monday') -ts_season <- rbms::ts_monit_season(ts_date, StartMonth = 4, EndMonth = 9, StartDay = 1, EndDay = NULL, CompltSeason = TRUE, Anchor = TRUE, AnchorLength = 2, AnchorLag = 2, TimeUnit = 'w') -ts_season_visit <- rbms::ts_monit_site(m_visit, ts_season) -ts_season_count <- rbms::ts_monit_count_site(ts_season_visit, m_count, sp = 2) -ts_flight_curve <- rbms::flight_curve(ts_season_count, NbrSample = 300, MinVisit = 5, MinOccur = 3, MinNbrSite = 5, MaxTrial = 4, GamFamily = 'nb', SpeedGam = FALSE, CompltSeason = TRUE, SelectYear = NULL, TimeUnit = 'w') -``` - -##### **4.** Impute predicted counts for missing monitoring dates - -The `impute_count()` function uses the count data generated from the `ts_season_count()` function and the flight curves (ts_flight_curve$pheno) obtained from the `flight_curve()` function. The `impute_count()` function looks for the phenology available (using the nearest year) to estimate and input missing values; the extend of the search for nearest phenology can be limited by setting the YearLimit parameter. By default, this is not restricted and will look over all years available. Like other rbms functions, the imputation can be made on a weekly or daily basis ('w' or 'd'). - -```{r, impute} -## extract phenology data from the ts_fligh_curve list -pheno <- ts_flight_curve$pheno -## note: extraction of the pheno part is not necessary any more, as this is now done internally if not yet extract. Nevertheless the process still work and remains backward compatible. -## ts_flight_curve = ts_flight_curve would also work - -impt_counts <- rbms::impute_count(ts_season_count=ts_season_count, ts_flight_curve=pheno, YearLimit= NULL, TimeUnit='w') -``` - -The `impute_count()` function produces a data.table that contains the original COUNT values, a series of IMPUTED_COUNT over monitoring season, TOTAL_COUNT per site and year, TOTAL_NM (e.g. the proportion of the flight curve covered by the visits) and SINDEX. The SINDEX is the site index that corresponds to the sum of both observed and imputed counts over the sampling season. If the flight curve of a specific year is missing, the `impute_count()` function uses the nearest phenology found. If none is available within the limit of years set by the YearLimit parameter, the function will return no SINDEX for that specific year. - - -```{r, impute_year} -impt_counts_1year <- rbms::impute_count(ts_season_count=ts_season_count, ts_flight_curve=pheno[M_YEAR != 2001, ], YearLimit= 1, TimeUnit='w') - -impt_counts_0year <- rbms::impute_count(ts_season_count=ts_season_count, ts_flight_curve=pheno[M_YEAR != 2001, ], YearLimit= 0, TimeUnit='w') -``` - - -##### **5.** Site and collated indices - -From the imputed count, the site index can be calculated for each site or with a filter that will only keep sites that have been monitored at least a certain proportion of the flight curve. In this example, we set the threshold to 10%, using the MinFC parameter. - -```{r, sindex, message=FALSE, warning=FALSE} -sindex <- rbms::site_index(butterfly_count = impt_counts, MinFC = 0.10) -``` - -With the site indices, annual collated indices can be estimated by fitting a Generalised Linear Model (GLM), where sites and years are modelled as factors. Here we also use the proportion of the flight curve sampled by the observation as a weight for the GLM. Finally, we remove all sites where the species was not observed, setting the parameter rm_zero = TRUE will facilitates model fit. - -```{r, collated_index} -co_index <- collated_index(data = sindex, s_sp = 2, sindex_value = "SINDEX", glm_weights = TRUE, rm_zero = TRUE) -``` - -The collated index computed by the `collated_index()` function can be interpreted as the mean total butterfly count expected on a BMS transect in a given year. - -```{r, collated_index_print, echo = FALSE} -co_index -``` - -The index can be transformed to a log(10) scale - -```{r log_ind} -co_index <- co_index$col_index -co_index_b <- co_index[COL_INDEX > 0.0001 & COL_INDEX < 100000, ] -co_index_logInd <- co_index_b[BOOTi == 0, .(M_YEAR, COL_INDEX)][, log(COL_INDEX)/log(10), by = M_YEAR][, mean_logInd := mean(V1)] - -## merge the mean log index with the full bootstrap dataset -data.table::setnames(co_index_logInd, "V1", "logInd"); setkey(co_index_logInd, M_YEAR); setkey(co_index_b, M_YEAR) -co_index_b <- merge(co_index_b, co_index_logInd, all.x = TRUE) -``` - -The log scaled indices can then be plotted with the following code; here we represent the data with time-series average being centred to two (i.e. `log10(100)`). - -```{r, ind_fig, fig.cap='Collated index.', tidy=FALSE} -col_pal <- c("cyan4", "orange", "orangered2") - -## compute the metric used for the graph of the Collated Log-Index centred around 2 (observed, bootstrap sample, credible confidence interval, linear trend) -b1 <- data.table(M_YEAR = co_index_b$M_YEAR, LCI = 2 + co_index_b$logInd - co_index_b$mean_logInd) -b2 <- data.table(M_YEAR = co_index_b[BOOTi == 0, M_YEAR], LCI= 2 + co_index_b[BOOTi == 0, logInd] - co_index_b[BOOTi == 0, mean_logInd]) - -lm_mod <- try(lm(LCI ~ M_YEAR, data = b2), silent=TRUE) - -plot(b1, col = adjustcolor( "cyan4", alpha.f = 0.2), - xlab = "year", ylab = expression('log '['(10)']*' Collated Index'), - xaxt="n", type = 'n') - -axis(1, at = b2$M_YEAR) - -points(b2[!is.na(LCI),], type = 'l', lty=2, col="grey70") -points(b2, type = 'l', lwd=1.3, col = col_pal[1]) -points(b2[!is.na(LCI),], col= col_pal[1], pch=19) -abline(h=2, lty=2) - -if (class(lm_mod)[1] != "try-error"){ - points(b2$M_YEAR, as.numeric(predict(lm_mod, newdata = b2, type = "response")), - type = 'l', col='maroon4', lwd = 1.5, lty = 1) - } -``` - - -##### **6.** Bootstrap confidence interval - -The confidence interval around the collated indices can be computed from a bootstrap sample, where *n* site indices are randomly resampled (with replacement) *k* time to produce a distribution of the annual collated indices. From this distribution, we can derive the confidence intervals around the collated indices. -We first define *k* bootstrap sample using the function boot_sample(). Here we set *k* to 200 samples, but for a reliable confidence interval, *k* should be at least 1000. For reproducibility, we use set.seed() to generate a repeatable random sample - - -```{r, boot} -set.seed(218795) -bootsample <- rbms::boot_sample(sindex, boot_n = 200) -``` - -Using the `collated_index()` function in a loop, with bootstrap samples informing the argument `boot_ind`, we can now compute *k* collated indies over the entire time-series. - -```{r, collated_index_boot, message=FALSE, warning=FALSE} -co_index <- list() - -## for progression bar, uncomment the following -## pb <- txtProgressBar(min = 0, max = dim(bootsample$boot_ind)[1], initial = 0, char = "*", style = 3) - -for(i in c(0,seq_len(dim(bootsample$boot_ind)[1]))){ - - co_index[[i+1]] <- rbms::collated_index(data = sindex, s_sp = 2, sindex_value = "SINDEX", bootID=i, boot_ind= bootsample, glm_weights=TRUE, rm_zero=TRUE) - -## for progression bar, uncomment the following -## setTxtProgressBar(pb, i) - -} - -## collate and append all the result in a data.table object -co_index <- rbindlist(lapply(co_index, FUN = "[[","col_index")) -``` - -Annual log indices, as well as their average, are then computed from the original sample. Similarly, we compute annual log indices for each bootstrap sample. - -```{r log_ind_boot} -co_index_b <- co_index[COL_INDEX > 0.0001 & COL_INDEX < 100000, ] -co_index_logInd <- co_index_b[BOOTi == 0, .(M_YEAR, COL_INDEX)][, log(COL_INDEX)/log(10), by = M_YEAR][, mean_logInd := mean(V1)] - -## merge the mean log index with the full bootstrap dataset -data.table::setnames(co_index_logInd, "V1", "logInd"); setkey(co_index_logInd, M_YEAR); setkey(co_index_b, M_YEAR) -co_index_b <- merge(co_index_b, co_index_logInd, all.x = TRUE) - -data.table::setkey(co_index_b, BOOTi, M_YEAR) -co_index_b[ , boot_logInd := log(COL_INDEX)/log(10)] -``` - -From the bootstrap samples, we can derive a 95% Confidence Interval, using the corresponding percentiles (i.e., 0.025 and 0.975). - -```{r ind_fig_CI, fig.cap='Collated index with 95% CI.', tidy=FALSE} - -b1 <- data.table(M_YEAR = co_index_b$M_YEAR, LCI = 2 + co_index_b$boot_logInd - co_index_b$mean_logInd) -b2 <- data.table(M_YEAR = co_index_b[BOOTi == 0, M_YEAR], LCI= 2 + co_index_b[BOOTi == 0, logInd] - co_index_b[BOOTi == 0, mean_logInd]) -b5 <- b1[co_index_b$BOOTi != 0, quantile(LCI, 0.025, na.rm = TRUE), by = M_YEAR] -b6 <- b1[co_index_b$BOOTi != 0, quantile(LCI, 0.975, na.rm = TRUE), by = M_YEAR] -lm_mod <- try(lm(LCI ~ M_YEAR, data = b2), silent=TRUE) - -## define graph axis limits and color scheme -yl <- c(floor(min(b5$V1, na.rm=TRUE)), ceiling(max(b6$V1, na.rm=TRUE))) - -col_pal <- c("cyan4", "orange", "orangered2") - -## draw the plot for the selected species -plot(b1, ylim = yl, col = adjustcolor( "cyan4", alpha.f = 0.2), - xlab = "year", ylab = expression('log '['(10)']*' Collated Index'), - xaxt="n", type = 'n') - -axis(1, at = b2$M_YEAR) - -segments(x0 = as.numeric(unlist(b5[,1])), y0 = as.numeric(unlist(b5[,2])), - x1 = as.numeric(unlist(b6[,1])), y1 = as.numeric(unlist(b6[,2])), - col = col_pal[2], lwd = 2) -points(b2[!is.na(LCI),], type = 'l', lty=2, col="grey70") -points(b2, type = 'l', lwd=1.3, col = col_pal[1]) -points(b2[!is.na(LCI),], col= col_pal[1], pch=19) -abline(h=2, lty=2) - -if (class(lm_mod)[1] != "try-error"){ - points(b2$M_YEAR, as.numeric(predict(lm_mod, newdata = b2, type = "response")), - type = 'l', col='maroon4', lwd = 1.5, lty = 1) - } -``` \ No newline at end of file diff --git a/vignettes/Get_Started_2.pdf b/vignettes/Get_Started_2.pdf deleted file mode 100644 index 605c444..0000000 Binary files a/vignettes/Get_Started_2.pdf and /dev/null differ diff --git a/vignettes/vignette.Rmd b/vignettes/vignette.Rmd new file mode 100644 index 0000000..9c7f18e --- /dev/null +++ b/vignettes/vignette.Rmd @@ -0,0 +1,348 @@ +--- +title: "Calculating phenology and collated indices with rbms" +author: "Reto Schmucki (UKCEH) & Dylan Carbone" +date: "23rd of July 2024" +vignette: > + %\VignetteIndexEntry{vignette} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +subtitle: From counts to flight curve to collated indices +--- + +```{r, include = FALSE} +options(repos = c(CRAN = "https://cloud.r-project.org/")) +``` + +The `rbms` package allows users to derive species phenology and estimate abundances of butterfly species. In this tutorial, we use R functions implemented in the `rbms` package to produce the following key outputs: + +1. **Compute a flight curve**: A spline fitted curve, representing the activity of the species within the monitoring season. The flight curve computation is based on a spline fitted on the counts collected across multiple sites and standardised to sum to 1 (area under the curve is one). +2. **Site indices and annual collated indices**: The yearly abundance at each site, and a model estimate of yearly abundance across all sites. + +Pre-requisite steps include the construction of a time series and imputation of counts from the flight curve when abundance data is missing. + +The data used in this vignette are bundled within the package and are from genuine Butterfly Monitoring Scheme surveys. + +## 1. Setup + +First, install and load required packages. +```{r data} +# Install required packages +install.packages("remotes") # To install R packages from github +install.packages("ggplot2") # to generate plots in this tutorial + +# Install rbms from github +# remotes::install_github("https://github.com/RetoSchmucki/rbms.git") + +# Load rbms and ggplot2 +library(rbms) +library(ggplot2) +``` + +Then load the bundled butterfly monitoring scheme data bundled with the rbms package. + +```{r} +data(m_visit) +data(m_count) +``` + +**The header names of the visit and count data need to be consistent, and all columns present in the data loaded for this tutorial need to be present in your data for the functions to work.** + +Visit data represents the date at which each monitoring site was visited. If no butterfly was observed during a visit, the abundance for that specific visit would be set to zero. This allows to subset positive non-zero counts from the count data set, which result in smaller objects to handle. The visit data may contain many columns, but only two are essential for the function. + +*NOTE: Visit and count data are provided in data.table format, and likewise data.table syntax is used for many of the data processing steps in this tutorial. Whilst other formats and packages are acceptable, it is recommended for larger datasets that you continue to use data.table, as it is efficient for handling large data sets* + +1) **SITE_ID** (can be numbers of characters and treated as a non-numeric factor) +2) **DATE** (by default, the format is ā€œ%Y-%m-%dā€ (e.g., 2019-11-28]). If a different format needs to be specified, use the argument DateFormat) + +```{r data_visit, echo = FALSE} +m_visit +``` + +Count data must be provided in columns with specific headers; more column can be provided, but rbms only use the following four: + +1) **SITE_ID** +2) **DATE** +3) **SPECIES** +4) **COUNT** + +```{r data_count, echo = FALSE} +m_count +``` + +##### **2.** organise the data to cover the time period and monitoring season of the BMS + +Using the visit and count data, we need to create a new data.table object representing a time series. We initialise the time-series, specifying the start and end year of the monitoring season, with entries at a weekly or daily basis (the resolution of the flight curve). In a first step, we initialise the time-series with an initial year with the InitYear argument, and end year with the LastYear argument. We also specify the first day of the week with the WeekDay1 argument as either 'sunday' or 'monday'. + +**2.1.** First, we initialise a time-series with day-week-month-year information. + +```{r init_timeseries} +ts_date <- rbms::ts_dwmy_table(InitYear = 2000, LastYear = 2003, WeekDay1 = 'monday') +``` + +**2.2.** Add the monitoring season to the time-series + +We build upon the time series with the monitoring season, providing the StartMonth and EndMonth arguments. The definition of the monitoring season can be refined with more arguments of the start and end date (StartDay, EndDay). We also define the resolution of the time-series as weekly using the Timeunit argument (`w` or `d` for weekly or daily records). The ANCHOR argument adds zeros before and after the monitoring season, to ensure that the flight curve starts and ends at zero. + +*NOTE: For species with overwintering adult and early counts, having an Anchor set to zero might sound wrong, and we are currently working on finding an alternative for these cases to represent the flight curve of those species better.* + +```{r monitoring_season} +ts_season <- rbms::ts_monit_season(ts_date, StartMonth = 4, EndMonth = 9, StartDay = 1, EndDay = NULL, CompltSeason = TRUE, Anchor = TRUE, AnchorLength = 2, AnchorLag = 2, TimeUnit = 'w') +``` + +**2.3.** Add site visits to the time-series + +Now that the monitoring season has been defined, we use the `ts_monit_site()` function to integrate the time-series with the site visits. We use the visit data and link it with the time series contained in ts_season object. + +```{r visit_season} +ts_season_visit <- rbms::ts_monit_site(ts_season, m_visit) +``` + +**2.4.** Add observed count + +The time-series is next integrated with the count data. Here we must specify the focus species in the sp argument, providing either an integer (we have used 2 here), or a string. + +```{r count_visit} +ts_season_count <- rbms::ts_monit_count_site(ts_season_visit, m_count, sp = 2) +``` + +The resulting `data.table` object contains zeros and positive counts recorded along each BMS transects for species 2, over the entire time-series, but only within the focal monitoring season. When the counts are missing because the site was not visited, the count is informed as NA. + +```{r count_visit_obj, echo=FALSE} +print(ts_season_count) +``` + +##### **3.** Compute the yearly flight curve for the data + +With the time season data prepared and saved under ts_season_count, can compute the flight curve, representing the species phenology. + +we can compute the flight curve for each year by using the `flight_curve` function. First we want to impose some minimal threshold to control the quality of the data used to inform this model. This is done by setting the Minimum number of visits (MinNbrVisit), the minimum number of occurrences (MinOccur) and the number of sites (MinNbrSite) to use to fit the model. These values can influence the model, and its sensitivity will depend on the species and the dataset. Still, as a minimum requirement, MinOccur should be set >= 2, the MinNbrVisit > MinOccur and MinNbrSite >= 5. These thresholds affect the data that inform your model and the resulting flight curve. When higher values are chosen, fewer sites will be available and if insufficient, the thresholds need to be revised. + +*NOTE: The `flight_curve` function assumes that species phenology is the same across sites. The sample size of the data must be large enough to calculate the flight_curve* + +*You will also have to define some parameters for the distribution for the GAM model as well as the maximum number of times to try to fit the model and the number of samples to use. The later will take a random sample from the data set if it contains more sites than the number specified.* + +```{r flight_curve} + ts_flight_curve <- rbms::flight_curve(ts_season_count, NbrSample = 300, MinVisit = 5, MinOccur = 3, MinNbrSite = 5, MaxTrial = 4, GamFamily = 'nb', SpeedGam = FALSE, CompltSeason = TRUE, SelectYear = NULL, TimeUnit = 'w') +``` + +We are given a list of 3 elements as an output of the `flight_curve()` function: + +- pheno: The standardised phenology curve derived by fitting a GAM model, with a cubic spline to the count data; +- model: The result of the fitted GAM model; +- data: The data used to fit the GAM model. + +We can now extract the pheno object, a data.frame that contains the shape of the annual flight curves, standardised to sum to 1. The flight-curves contained in the pheno object can be visualised with the following line of codes. + +```{r fc, fig.cap='Flight curve.', tidy=FALSE} +## Extract the phenology output +pheno <- ts_flight_curve$pheno + +# Determine if 'trimWEEKNO' exists and set the appropriate x-axis variable +x_var <- if("trimWEEKNO" %in% names(pheno)){"trimWEEKNO"} else {"trimDAYNO"} + +# Create the ggplot +ggplot(pheno, aes(x = .data[[x_var]], y = NM, color = factor(M_YEAR))) + + geom_line() + + scale_color_discrete(name = "Year") + + labs(x = ifelse(x_var == "trimWEEKNO", "Monitoring Week", "Monitoring Day"), + y = "Relative Abundance") + + theme_minimal() + + theme(legend.position = "top") +``` + +##### **4.** Impute predicted counts for missing monitoring dates + +From the flight curve and the observed counts, we can derive expected values for weeks or days where a site has not been monitored. Together, observed and imputed counts are used to compute abundance indices across sites. Site indices are then used to calculate annual collated indices. + +The `impute_count()` function uses the time season data prepared and saved under ts_season_count, and the flight curves in the pheno output of the `flight_curve()` function. The `impute_count()` function looks for the phenology available (using the nearest year) to estimate and input missing values. The extent of the search for nearest phenology can be limited by setting the YearLimit argument. By default, this is not restricted and will look over all years available. + +```{r, impute} +# Impute the count estimates +# Note that the pheno element is extracted automatically from the ts_flight_curve object +impt_counts <- rbms::impute_count(ts_season_count=ts_season_count, ts_flight_curve=ts_flight_curve, YearLimit= NULL, TimeUnit='w') +``` + +The `impute_count()` function produces a data.table that includes the following columns: +COUNT: The original count values. +IMPUTED_COUNT: An estimate of the count in the event that there is no recording of count (I.E., the original count is NA). The imputed count is derived from the flight curve. If the original count is present, the imputed count is set as equal to the original count. +TOTAL_COUNT: The sum of the Imputed count within the site and year. +SINDEX: The sum of the imputed counts over the sampling season. If the flight curve of a specific year is missing, the `impute_count()` function uses the nearest phenology found. If none are available within the limit of years set by the YearLimit argument, the function will return no SINDEX for that specific year. +TOTAL_NM: The proportion of the flight curve covered by the visits. This is the total count divided by the site index. + +```{r, impute_year} +impt_counts_1year <- rbms::impute_count(ts_season_count=ts_season_count, ts_flight_curve=pheno[M_YEAR != 2001, ], YearLimit= 1, TimeUnit='w') + +impt_counts_0year <- rbms::impute_count(ts_season_count=ts_season_count, ts_flight_curve=pheno[M_YEAR != 2001, ], YearLimit= 0, TimeUnit='w') +``` + +##### **5.** Site and collated indices + +From the imputed counts, the `site_index` function can be used to calculate site index for each year and site. Only sites that exceed a threshold for the proportional representation of the flight curve are included. This threshold is set using the MinFC argument, which In this example is 0.10. + +```{r, sindex, message=FALSE, warning=FALSE} +sindex <- rbms::site_index(butterfly_count = impt_counts, MinFC = 0.10) +``` + +From the site indices, estimates of the annual collated indices are made using the `collated_index()` function. Collated indices represents the mean total butterfly count expected on a BMS transect in a given year. The function fits a Generalised Linear Model (GLM) with site and years included as factorial independent variables. The proportion of the flight curve is included as a GLM weight. When the argument rm_zero is set to TRUE, all sites where the species are not observed are filtered, speeding up the fit of the GLM without altering the output. + +```{r, collated_index} +co_index <- collated_index(data = sindex, s_sp = 2, sindex_value = "SINDEX", glm_weights = TRUE, rm_zero = TRUE) + +co_index +``` + +The index can be transformed to a log(10) scale. + +```{r log_ind} +co_index <- co_index$col_index +co_index_b <- co_index[COL_INDEX > 0.0001 & COL_INDEX < 100000, ] +co_index_logInd <- co_index_b[BOOTi == 0, .(M_YEAR, COL_INDEX)][, log(COL_INDEX)/log(10), by = M_YEAR][, mean_logInd := mean(V1)] + +## merge the mean log index with the full bootstrap dataset +data.table::setnames(co_index_logInd, "V1", "logInd"); data.table::setkey(co_index_logInd, M_YEAR); data.table::setkey(co_index_b, M_YEAR) +co_index_b <- merge(co_index_b, co_index_logInd, all.x = TRUE) +``` + +The log scaled indices can be plotted. Here, we represent the data with time-series average being centered to two (i.e. `log10(100)`) as the standard for British time series reporting. + +```{r, ind_fig, fig.cap='Collated index.', tidy=FALSE} +col_pal <- c("cyan4", "orange", "orangered2") + +## compute the metric used for the graph of the Collated Log-Index centred around 2 (observed, bootstrap sample, credible confidence interval, linear trend) +b1 <- data.table::data.table(M_YEAR = co_index_b$M_YEAR, LCI = 2 + co_index_b$logInd - co_index_b$mean_logInd) +b2 <- data.table::data.table(M_YEAR = co_index_b[BOOTi == 0, M_YEAR], LCI= 2 + co_index_b[BOOTi == 0, logInd] - co_index_b[BOOTi == 0, mean_logInd]) + +lm_mod <- try(lm(LCI ~ M_YEAR, data = b2), silent=TRUE) + +# Create the initial plot with ggplot +p <- ggplot(b2, aes(x = M_YEAR)) + + # Adding the shaded area with white fill + geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf), + fill = "white", alpha = 0.2) + + # Adding the lines and points for non-missing LCI values + geom_line(data = subset(b2, !is.na(LCI)), aes(y = LCI), linetype = "dashed", color = "grey70") + + geom_line(aes(y = LCI), size = 1.3, color = col_pal[1]) + + geom_point(data = subset(b2, !is.na(LCI)), aes(y = LCI), color = col_pal[1], shape = 19) + + # Adding the horizontal line + geom_hline(yintercept = 2, linetype = "dashed") + + # Adding labels and customizing axes + labs(x = "year", y = expression('log '[10]*' Collated Index')) + + scale_x_continuous(breaks = b2$M_YEAR) + + theme_minimal() + + # Making the y-axis appear as a black line + theme(axis.line.y = element_line(color = "black")) + +# Adding the regression line if lm_mod is not a try-error +if (class(lm_mod)[1] != "try-error") { + p <- p + geom_line(aes(y = as.numeric(predict(lm_mod, newdata = b2, type = "response"))), color = "maroon4", size = 1.5, linetype = "solid") +} + +# show the plot +print(p) +``` + +##### **6.** Bootstrap confidence interval + +the `boot_sample()` function can be used to compute a confidence interval around the collated indices. A number of sites are sampled (randomely, with resampling) a number of times, specified by the argument, boot_n. This produces a distribution of the annual collated indices, from which we can can derive the confidence intervals around the collated indices. In this example, we set boot_n. to 200 samples, but for a reliable confidence interval, k should be at least 1000. For reproducibility, we use set.seed() to generate a repeatable random sample + +*NOTE: You should not take large numbers of bootstraps from a dataset with a small number of sites, as resampling sites will not introduce new variance.* + +```{r, boot} +set.seed(218795) +bootsample <- rbms::boot_sample(sindex, boot_n = 200) +``` + +Using the `collated_index()` function in a loop, with bootstrap samples informing the argument `boot_ind`, we can now compute the boot_n number of collated indices over the entire time-series. + +```{r, collated_index_boot, message=FALSE, warning=FALSE} +co_index <- list() + +for(i in c(0,seq_len(dim(bootsample$boot_ind)[1]))){ + + co_index[[i+1]] <- rbms::collated_index(data = sindex, s_sp = 2, sindex_value = "SINDEX", bootID=i, boot_ind= bootsample, glm_weights=TRUE, rm_zero=TRUE) + +} + +# collate and append the results +co_index <- data.table::rbindlist(lapply(co_index, FUN = "[[","col_index")) +``` + +Annual log indices, as well as their average, are then computed from the original sample. To derive confidence intervals, we compute annual log indices for each bootstrap sample. + +```{r log_ind_boot} +co_index_b <- co_index[COL_INDEX > 0.0001 & COL_INDEX < 100000, ] +co_index_logInd <- co_index_b[BOOTi == 0, .(M_YEAR, COL_INDEX)][, log(COL_INDEX)/log(10), by = M_YEAR][, mean_logInd := mean(V1)] + +## merge the mean log index with the full bootstrap dataset +data.table::setnames(co_index_logInd, "V1", "logInd"); data.table::setkey(co_index_logInd, M_YEAR); data.table::setkey(co_index_b, M_YEAR) +co_index_b <- merge(co_index_b, co_index_logInd, all.x = TRUE) + +data.table::setkey(co_index_b, BOOTi, M_YEAR) +co_index_b[ , boot_logInd := log(COL_INDEX)/log(10)] +``` + +From the bootstrap samples, we can derive a 95% Confidence Interval, using the corresponding percentiles (i.e., 0.025 and 0.975). + +```{r ind_fig_CI, fig.cap='Collated index with 95% CI.', tidy=FALSE} + +b1 <- data.table::data.table(M_YEAR = co_index_b$M_YEAR, LCI = 2 + co_index_b$boot_logInd - co_index_b$mean_logInd) +b2 <- data.table::data.table(M_YEAR = co_index_b[BOOTi == 0, M_YEAR], LCI= 2 + co_index_b[BOOTi == 0, logInd] - co_index_b[BOOTi == 0, mean_logInd]) +b5 <- b1[co_index_b$BOOTi != 0, quantile(LCI, 0.025, na.rm = TRUE), by = M_YEAR] +b6 <- b1[co_index_b$BOOTi != 0, quantile(LCI, 0.975, na.rm = TRUE), by = M_YEAR] +lm_mod <- try(lm(LCI ~ M_YEAR, data = b2), silent=TRUE) + +## define graph axis limits and color scheme +yl <- c(floor(min(b5$V1, na.rm=TRUE)), ceiling(max(b6$V1, na.rm=TRUE))) + +col_pal <- c("cyan4", "orange", "orangered2") + +# Convert b5 and b6 to data frames for easier handling in ggplot +b5_df <- data.frame(x0 = as.numeric(unlist(b5[,1])), y0 = as.numeric(unlist(b5[,2]))) +b6_df <- data.frame(x1 = as.numeric(unlist(b6[,1])), y1 = as.numeric(unlist(b6[,2]))) + +# Combine b5 and b6 into a single data frame +segments_df <- cbind(b5_df, b6_df) + +library(ggplot2) + +# Convert b5 and b6 to data frames for easier handling in ggplot +b5_df <- data.frame(x0 = as.numeric(unlist(b5[,1])), y0 = as.numeric(unlist(b5[,2]))) +b6_df <- data.frame(x1 = as.numeric(unlist(b6[,1])), y1 = as.numeric(unlist(b6[,2]))) + +# Combine b5 and b6 into a single data frame +segments_df <- cbind(b5_df, b6_df) + +# Create the initial plot with ggplot +p <- ggplot(b2, aes(x = M_YEAR)) + + # Adding the shaded area with white fill + geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf), + fill = "white", alpha = 0.2) + + # Adding the segments + geom_segment(data = segments_df, aes(x = x0, y = y0, xend = x1, yend = y1), + color = col_pal[2], size = 2) + + # Adding the lines and points for non-missing LCI values + geom_line(data = subset(b2, !is.na(LCI)), aes(y = LCI), linetype = "dashed", color = "grey70") + + geom_line(aes(y = LCI), size = 1.3, color = col_pal[1]) + + geom_point(data = subset(b2, !is.na(LCI)), aes(y = LCI), color = col_pal[1], shape = 19) + + # Adding the horizontal line + geom_hline(yintercept = 2, linetype = "dashed") + + # Adding labels and customizing axes + labs(x = "year", y = expression('log '[10]*' Collated Index')) + + scale_x_continuous(breaks = b2$M_YEAR) + + coord_cartesian(ylim = yl) + + theme_minimal() + + # Making the y-axis appear as a black line and setting the background color to white + theme( + panel.background = element_rect(fill = "white", color = "black"), + axis.line.y = element_line(color = "black") + ) + +# Adding the regression line if lm_mod is not a try-error +if (class(lm_mod)[1] != "try-error") { + p <- p + geom_line(aes(y = as.numeric(predict(lm_mod, newdata = b2, type = "response"))), color = "maroon4", size = 1.5, linetype = "solid") +} + +# Print the plot +print(p) + +```