gRasslands
is part of the SUSALPS research
project, which is researching the
sustainable use of alpine and pre-alpine grasslands in Southern Germany.
A key indicator of ecosystem health is the floral biodiversity, which
can be measured through indices like the species number or the shannon
and simpson index. The goal of this project was to estimate floral
diversity with the help of Sentinel-2 reflectances between 2022 and
2023. This package provides the functions and code used to train a
random forest model with the extracted reflectances at the sample plot
locations. Species inventories sampled in the same plots served as
response variable. Parts of the data are also available through this
package.
You can install the development version of gRasslands via devtools
:
devtools::install_github("Siedrid/gRasslands")
In this section you find documentation how your data needs to be preprocessed to train your random forest model. This is shown by means of Sentinel-2 surface reflectances extracted at 60 plot locations in Lower Franconia, Bavaria. The Sentinel-2 scenes were atmospherically corrected using the MAJA processor. For some areas Sentinel-2 L2A products (MAJA preprocessed) are freely distributed on the THEIA portal. Of course, also the Sentinel-2 L2A products distributed via the Copernicus Browser or the Google Earth Engine are suitable for this analysis.
library(gRasslands)
summary(refl.df)
#> plot_names B2 B3 B4
#> Length:11248 Min. :0.000 Min. :0.000 Min. :0.000
#> Class :character 1st Qu.:0.019 1st Qu.:0.052 1st Qu.:0.021
#> Mode :character Median :0.028 Median :0.061 Median :0.044
#> Mean :0.033 Mean :0.064 Mean :0.055
#> 3rd Qu.:0.046 3rd Qu.:0.074 3rd Qu.:0.084
#> Max. :0.180 Max. :0.215 Max. :0.242
#> NA's :8446 NA's :8446 NA's :8446
#> B5 B6 B7 B8
#> Min. :0.009 Min. :0.038 Min. :0.051 Min. :0.006
#> 1st Qu.:0.093 1st Qu.:0.229 1st Qu.:0.263 1st Qu.:0.283
#> Median :0.111 Median :0.276 Median :0.321 Median :0.340
#> Mean :0.117 Mean :0.285 Mean :0.340 Mean :0.357
#> 3rd Qu.:0.140 3rd Qu.:0.336 3rd Qu.:0.408 3rd Qu.:0.421
#> Max. :0.281 Max. :0.506 Max. :0.675 Max. :0.682
#> NA's :8449 NA's :8449 NA's :8449 NA's :8446
#> B8A B11 B12 dat
#> Min. :0.050 Min. :0.006 Min. :0.005 Min. :2022-01-06
#> 1st Qu.:0.307 1st Qu.:0.191 1st Qu.:0.083 1st Qu.:2022-06-16
#> Median :0.366 Median :0.225 Median :0.106 Median :2022-10-23
#> Mean :0.382 Mean :0.245 Mean :0.127 Mean :2022-12-07
#> 3rd Qu.:0.447 3rd Qu.:0.293 3rd Qu.:0.166 3rd Qu.:2023-06-19
#> Max. :0.712 Max. :0.459 Max. :0.287 Max. :2023-12-05
#> NA's :8449 NA's :8449 NA's :8449
Reflectances at the plot locations and in each Sentinel-2 band from 2022
to 2023 were already extracted and are stored in the refl.df
data
frame. The refl.df
data frame has a column for each Sentinel-2 band,
and two additional columns for the plot names and date of the Sentinel-2
acquisition. You can find the code to create this data frame in
data-raw/refl.df.R
. The functions used in the refl.df.R script are
also provided within this package. In the following the refl.df
data
frame is brought into the right form to be processed in the random
forest model.
int.ts <- interpolate.ts(refl.df, plot.column = "plot_names") # interpolate missing values
int.ts <- na.omit(int.ts)
monthly_min.df <- comp_monthly(int.ts, date.column = "dat", stat = "min") # composite to monthly minimum reflectances
monthly_min.df.piv <- pivot.df(monthly_min.df) # pivot the data into wide table
For the compositing, the monthly minimum reflectances were used, to
create an evenly distributed time series. We also found out, that the
minimum reflectance fits the best for the biodiversity modeling. In the
next step we take a look at the response variable: the alpha-diversity
indices. Species inventories were collected in the same 60 plots in May
2022 and April 2023. Both datasets are processed together in step II.
With get_cover
, the cloud cover, i.e. the percentage of the masked
area of one acquisition inside the study area is returned. With
plt.band_composite
, an RGB band composite of the study area can be
plotted and the plot locations added.
summary(div.df)
#> plot_names shannon simpson specn
#> Length:59 Min. :0.02819 Min. :0.01083 Min. :10.12
#> Class :character 1st Qu.:0.81288 1st Qu.:0.30712 1st Qu.:20.68
#> Mode :character Median :1.74052 Median :0.46006 Median :31.15
#> Mean :1.81174 Mean :0.49094 Mean :30.57
#> 3rd Qu.:2.78330 3rd Qu.:0.68315 3rd Qu.:43.05
#> Max. :3.93927 Max. :0.99736 Max. :49.37
#> X Y
#> Min. :677759 Min. :5528198
#> 1st Qu.:678610 1st Qu.:5528541
#> Median :679250 Median :5528797
#> Mean :680548 Mean :5529222
#> 3rd Qu.:684102 3rd Qu.:5529788
#> Max. :685066 Max. :5530580
study_area <- get_study_area(div.df, "X", "Y")
div.sf <- div2sf(div.df, x.column = "X", y.column = "Y", epsg.code = 25832, write = F)
acq <- get_acquisitions("2022", "04", "E:/Grasslands_Biodiv/Data/SatData/")[1]
cloud_cover <- get_cover(acq, study_area) # calculate cloud cover
#> Cloud Cover is 8.53%
plt.band_composite(acq, bands = c("B4", "B3", "B2"), study_area, df = div.df, add.plots = T)
#> NULL
In the eastern part of the study area, an area of almost 10% is masked.
The function get_cover
can also be used to filter suitable
acquisitions. The alpha diversity indices calculated and provided in the
div.df
data frame are the species number, shannon and simpson index.
Many studies have shown, that the species number is the best response
variable, therefore this alpha-diversity index will be used in the
following random forest model. The code to calculate these indices is
provided in the data-raw
folder in the div.df.R script. The values for
the diversity indices were randomized in the end for data privacy
reasons. That’s why we have to expect rather low accuracies for the
model.
For the training, only the minimum reflectances from the months March to September are used. The winter months are influenced by clouds and snow and are limited by less plant growth/cover, which could potentially impact our results negatively.
s = 91
biodiv_ind = "specn"
m.nowinter <- c(3:9)
data_frame.nowinter <- RF_predictors(monthly_min.df.piv, m.nowinter) # use only months from March to September
rf_data <- preprocess_rf_data(data_frame.nowinter, div.df, biodiv_ind) # merge reflectance and alpha diversity dataframe
train_index <- get_train_index(rf_data, s = s) # split samples into training and testing (70:30)
forest <- RF(rf_data, train_index = train_index, s = s) # train Random Forest
#> Lade nötiges Paket: ggplot2
#> Lade nötiges Paket: lattice
print(forest)
#> Random Forest
#>
#> 43 samples
#> 140 predictors
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 5 times)
#> Summary of sample sizes: 38, 38, 39, 39, 40, 38, ...
#> Resampling results across tuning parameters:
#>
#> mtry RMSE Rsquared MAE
#> 2 12.35664 0.2148993 10.89479
#> 25 12.77095 0.2098879 11.39549
#> 48 12.83464 0.2309922 11.44717
#> 71 12.93177 0.2119190 11.55964
#> 94 12.89543 0.2291998 11.52141
#> 117 12.87095 0.2243601 11.47428
#> 140 12.97108 0.2226513 11.54687
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 2.
In the output of the forest
variable, it is summarized that 43 samples
were used for the training (i.e. 70% of the dataset) and 140 predictors
(i.e. 10 bands x 2 years x 7 months). Per default a cross-validation
with 10 folds and 5 repeats is used. The forest with the highest R2 and
the lowest RMSE is returned in the end. Training and testing results are
visualized in a scatter plot with the actual species number on the
x-axis and the predicted species number on the y-axis. Further
statistics can be summarized in a csv file with the function write.RF
.
This function is especially usefull, when testing different compositing
methods, and month combinations or running the model multiple times with
different seeds.
RF.summary(forest, rf_data, div.df, train_index, "specn", plot_labels = F) # returns scatter plot
#write.RF("no winter", "specn", forest, 10, csv.path)
The R2 is given for the training and testing split. The testing split
was not used to train the random forest model. With plot_labels = T
,
the points are labeled according to their plot names.
plt.varimp(forest)
plt.varimp
is an important function to evaluate the predictors
according to their band, year and month. With the original data, the
SWIR bands and the bands in the visible domain are the most important
Sentinel-2 bands (A). March is by far the most important month in the
prediction (D).
For the spatial prediction, all variables, that trained the random
forest, need to be stacked to a spatial Raster, on which the species
number can be predicted. In the case of the random forest trained with
all summer months, the monthly minimum raster composites of all
acquisitions from March until September 2022 and 2023 respectively need
to be calculated first. Due to the limited storage capacity, these
raster composites can’t be part of this package. The code to calculate
these raster composites is provided in the data-raw
folder in the
comp.S2.bands.R
script. On request, we can make these composites
available. The code to calculate spatial prediction maps is provided in
the following.
# select months March to September as predictors
comp_path <- "E:/Grasslands_BioDiv/Data/S2_min_composites"
fls <- list_comp_months(comp_path, m.nowinter)
min.brick <- stack_S2_months(fls)
s2_pred <- terra::predict(min.brick, model = forest, na.rm = T)
comp_path
is the path to the directory, where the monthly raster
composites are stored. After creating a list of these raster composites,
the rasters are stacked with the terra package and then transformed into
a brick object. With predict
, the random forest model forest
is
applied to the brick. To mask non-grasslands, we used the High
Resolution Grassland Layer, provided by
Copernicus.
We used the 2018 product with a 10m spatial resolution:
# Mask non-Grasslands with Copernicus Grassland Layer
grass.mask.path <- "E:/Grasslands_BioDiv/Data/Copernicus_Grassland/GRA_2018_010m_03035_V1_0.tif"
grass.mask <- terra::rast(grass.mask.path)
s2_pred.masked <- mask.grasslands(s2_pred, grass.mask)
#> |---------|---------|---------|---------|========================================= |---------|---------|---------|---------|=========================================
plt.diversity(s2_pred.masked, biodiv_ind = "specn")
In the following, the entire workflow of the analysis is visualized. This package was designed to encourage a similar analysis at grassland sites, where species inventories are available. A valuable database for such inventories and environmental parameters is also the Biodiversity Exploratories Information System: https://www.bexis.uni-jena.de/ This is an important step towards a broader understanding of grassland sites, how to manage them and protect their valuable ecosystem services.
Laura Obrecht: [email protected]
Dr. Sophie Reinermann: [email protected]