01_Semivariogram.Rmd

---
title: "Semivariogram Clouds and Plots"
author: "Michael McManus, US EPA/ORD"
date: "01/08/2025"
output:
  html_document: default
  pdf_document: default
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Outline

The first dataset we are using is from the data paper <https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1890/14-1345.1> by Stallings et al. We will use salinity data collected at 80 sites in the Gulf of Mexico from the Big Bend down to the Springs Coast of Florida. In the code chunks below we will cover the first bullet using the salinity data. For the remaining 3 bullets we will use the James River Spatial Stream Network (SSN) object. The James River SSN contains data from VDEQ's spatially balanced stream survey (ProbMon), the USFS's National Stream Internet (NSI), which has flowlines conditioned for SSN analysis, and EPA's Stream Catchment dataset (StreamCat), which can be associated to the NSI by a common COMID. Our exploratory spatial data analysis here will only use Euclidean distance.

* Reading and Writing Spatial Data
* Summarizing Distances
* Semivariogram Clouds and Semivariogram Plots
* Randomization of Semivariograms


## Libraries
```{r, libraries}

library(tidyverse)
library(lubridate)
library(sf)
library(mapview)
library(leaflet)
library(leafpop) # for popups in mapview
library(gstat) # for semivariograms
library(lattice) # for random semivariogram plots
library(spmodel) # for spatial modeling
library(scales) # comma instead of scientific notation
library(plotly) # interactive plots
library(spmodel) # has empirical semivariogram (esv) function
library(SSN2) # for spatial stream network (SSN) modeling
library(janitor) # clean_names function

```

## Reading in Spatial Data
This chunk takes a data frame with geographic coordinates and converts it to a simple features (sf) object that uses the project coordinate reference system (crs) of Albers Equal Area, which has the units of meters. The output from this code chunk is a map in the Viewer tab of monitoring stations, points, and seagrass polygons.
```{r reading_data, echo=FALSE}

hab_1 <- read_csv("data/BBSG_2009-2010_habitat_and_trawl-tow_data.csv")
glimpse(hab_1)

# See Long and Lat in decimal degrees

coords <- data.frame(longitude  = hab_1$Long, latitude = hab_1$Lat)
head(coords)

# See Marc Weber workshop above about Coordinate Reference Systems (CRS)
hab_sf1 <- st_as_sf(hab_1, coords = c("Long", "Lat"), crs =4269) %>% st_transform(5070)
class(hab_sf1)

# read in shapefile
bigbend_seagrass <- st_read("data/bigbend_grass.shp")
class(bigbend_seagrass)
# asking do the two sf objects have the same CRS
st_crs(hab_sf1) == st_crs(bigbend_seagrass)

bigbend_seagrass <- bigbend_seagrass %>% st_transform(5070)
st_crs(hab_sf1) == st_crs(bigbend_seagrass)

mapview(hab_sf1) + mapview(bigbend_seagrass)
```

## Writing sf Objects to Shapefiles and Geopackage Geodatabase
A sf object can be written out as an ESRI shapefile. If you have several sf objects, such as points for sites and polygons of seagrass areas, you can keep them together by writing them out as a geopackage geodatabase. The output from this code will appear under File Explorer where your working directory is located and will include a shapefile and geopackage that we wrote to that location.
Note that geopackage is the default spatial format for QGIS. The R package SSNbler <https://cran.r-project.org/package=SSNbler> can be used with QGIS, or ArcGIS Pro, to build a spatial stream network (SSN) object.
```{r geopackage, eval=FALSE}

# Writing a single shapefile to the working directory
st_write(hab_sf1,
dsn = "hab1.shp",
driver = "ESRI Shapefile",
append = FALSE)

# This will put the sf layer into the geodatabase
st_write(hab_sf1, dsn = file.path(getwd(), "bbsg_v1_2024_05_06.gpkg"), layer = "habitat", driver = "GPKG", quiet = FALSE)

# We can check on what layers are in the geodatabase.
st_layers("bbsg_v1_2024_05_06.gpkg") 
 
# Now we can put in the polygons.
st_write(bigbend_seagrass, dsn = file.path(getwd(), "bbsg_v1_2024_05_06.gpkg"), layer = "bigbend_seagrass", driver = "GPKG", quiet = FALSE, delete_layer=TRUE)

st_layers("bbsg_v1_2024_05_06.gpkg") 

# Note if we were to modify hab_sf1 in R, for example, then to replace the existing layer you have to specify both delete_layer = TRUE and append = TRUE arguments so that the old sf object is deleted and the new one is appended to replace it as shown in the commented out code below.

# st_write(hab_sf1, dsn = file.path(getwd(), "st_layers("bbsg_v1_2024_05_06.gpkg") "), layer = "hab_sf1", delete_layer = TRUE, append = TRUE, driver = "GPKG", quiet = FALSE)
```

## Import SSN Object & Extract Data Frame
We will now start using the James SSN. It easier to explore the data frame of an SSN object by extracting it `r ssn_get_data`. Note below that DFobs is both an sf object and a data frame so we can readily use it to make an interactive map.
```{r ssn_import_extract}

j_ssn1a <- SSN2::ssn_import("ssn_object/James_071024_pluspreds.ssn", predpts = "sites")
class(j_ssn1a)

DFobs <- SSN2::ssn_get_data(j_ssn1a)
class(DFobs)

mapview(DFobs)
names(DFobs)
DFobs <- clean_names(DFobs)
names(DFobs)
# now including specified pop-ups
mapview(DFobs, zcol = "vscivcpmi", cex = "vscivcpmi", alpha.regions = .8, legend = TRUE, popup = popupTable(DFobs, zcol = c("station_id_2", "vscivcpmi")))
```

## Distance Summary
As we will be looking at spatial variation in vsci, or semivariance, as a function of distance it is important to get a distance summary of the stations.
We use st_distance function to create pairwise distance matrix among all the sites. Then, because a Euclidean distance matrix is symmetrical, we extract half of that matrix and convert it to a vector then get the summary of the distances, min, Q1, median, mean, Q3, and max. Knowing max distance is important `r variogram` function uses that distance to set default cutoff. That cutoff is the spatial separation distance up to which a pair of points are included in semivariance estimates.  The default is the length of the diagonal of bounding box spanning the data, which is ~ approximately the max distance.  Our default cutoff will be 299,805/3 = 99,935 meters or ~ 100 km. 

```{r distance}
distmat_obs <- st_distance(DFobs, DFobs,by_element = FALSE)

dim(distmat_obs) # get the dimensions of the distance matrix
rdistmat_obs <- distmat_obs[1:199, ] # change to match # sites/rows
rdistmat1_obs <- as.vector(rdistmat_obs)

round(summary(rdistmat1_obs[rdistmat1_obs!=0]))

```

## Clouds and Semivariograms
We will start with a semivariogram cloud to examine spatial variation in vsci.  Specifically, just as we had calculated all pairwise distances between sites, we will calculate all pairwise semivariances in vsci, and plot those semivariances as a function of their distances.

A semivariogram cloud presents a smear of points, but by binning those points into distance classes and by binning the semivariances of those points through the width argument we can make a semivariogram plot.  With the semivariogram plot we are asking:  do stations near each other have similar vsci scores?  Is the semivariance, gamma, small at smaller distances and does it increase with larger distances? 
We want data to be unimodal and approximately symmetric for a semivariogram cloud and plot so we do some data exploration.
```{r vsci_histogram}

ggplot(DFobs, aes(vscivcpmi)) + geom_histogram()
# histogram of vscivcpmi

# log transformation not help so stick with original data
ggplot(DFobs, aes(log(vscivcpmi))) + geom_histogram()
summary(DFobs$vscivcpmi)
```

In this semivariogram cloud, we see that points that are nearer each other on the left-hand part of the x-axis have small semivariances and as we move right to points further apart the semivariances increase. 

We can view the the semivariance cloud as a data frame so we can sort on gamma, the semivariance, to find our maximum gamma, 1303.6 occurs at a pair of sites about 60 km apart. But, what are the two stations having that maximum gamma, where are those sites and what are their vsci values?
```{r semivar_cloud}
# Semivariogram Cloud
z_sci_cloud =  variogram(vscivcpmi ~ 1, DFobs, cloud = TRUE)
class(z_sci_cloud)
plot(z_sci_cloud) # default plot from gstat
# quite of smear of points
z1_cloud <- as.data.frame(z_sci_cloud)
names(z1_cloud)
View(z1_cloud)

# use ggplot2 and plotly to make cloud interactive
zcloud_plotly <- ggplot(z1_cloud, aes(dist, gamma)) + geom_point()  + labs(x = "Euclidean Distance (m)", y = "Semivariance", title = "VSCI Semivariogram Cloud") + scale_x_continuous(labels=comma)
ggplotly(zcloud_plotly)
# https://r-spatial.github.io/gstat/reference/plot.variogramCloud.html

view(DFobs)
```

## Hand Calculate Gamma
Based on the row numbers of the data frames, we find the vsci values for the two stations that produced the maximum gamma.
```{r calculate_gamma}

# 27 and 99 are the the row numbers in the data frame DFobs so we can get get the vsci measurements and hand calculate gamma in z1.
# From DFobs we get
# 27, site 2-BLB002.04, has a vscivcpmi of 79.7
# 99, site 2-PWT001.97, has a vscivcpmi of 28.7

# gamma = 1/2(z1 - z2)^2 = 1/2(79.7 - 28.7)^2 = 1/2(52^2) = 1/2(2601)
# gamma = 1300.5
```

To deal with the smear of data from the cloud, we are going to bin the data by a size class, or lag, so we can calculate the average distance in that bin and its corresponding average gamma, or semivariance. Now that we are working with averages, you understand why we want our data to be unimodal and symmetric.

Note the smaller scale now on the y-axis. How are these 15 points, and their semivariances and distances, related to the cloud we saw earlier? When we view the semivariance plot data frame, z_sci_sv1, we can see the number of pairs of points and their average distance and average gamma corresponding to the plot.
```{r semivariogram_plot}
# Semivariogram Plot
# accept the defaults
z_sci_sv1 = variogram(vscivcpmi ~1, DFobs)
plot(z_sci_sv1, ylab = "VSCI Semivariance")


plot(z_sci_sv1, plot.numbers = TRUE) #shows number of pairs in each binned point and we see that in the corresponding semivariogram object
View(z_sci_sv1)
```


## Specify Cutoff and Width Arguments
We now change the cutoff and width defaults. Width specifies the distance size class, or bin or lag size, over which we will average the distances and semivariances we had in the cloud.  The specifications below mean we look at those points from the cloud in 10 km increments out to 80 km so this semivariogram plot will have 8 points plotted.
```{r change_width}
z_sci_sv2 = variogram(vscivcpmi ~1, DFobs, cutoff = 80000, width = 10000)
plot(z_sci_sv2, ylab = "VSCI Semivariance")
View(z_sci_sv2)
# now view cloud data frame and filter for distances (dist) 0-10000
View(z1_cloud)
# those are 307 cloud points that had both averages taken of distance and gamma to make that first point in the semivariogram plot

filter(z_sci_cloud, dist <= 10000) %>% 
  summarize(
    np = dplyr::n(),
    mean_dist = mean(dist),
    mean_gamma = mean(gamma)
  )
# these results match the z_sci_sv2 results
```

## Random Semivariograms
Up to now we have calculated an empirical semivariogram based on the VSCI data collected at the ProbMon stations. We now want to compare that empirical semivariogram to randomized semivariograms. This comparison lets us evaluate if we have evidence of spatial dependence, or spatial autocorrelation versus spatial independence.

I downloaded Chapter 8 code from Applied Spatial Data Analysis with R, <https://asdar-book.org/>, for generating random variograms. An initial empirical, or sample, variogram is calculated and plotted for vsci, and that object is saved. Then vsci values are randomly sampled and assigned to the coordinates so that a random semivariogram can be calculated. That is done 100 times, and the grey lines of the random semivariograms are plotted with the blue line of the empirical semivariogram. The semivariogram for vsci shows suggestive evidence of spatial autocorrelation compared to randomized semivariograms, which illustrate the assumption of spatial independence.

```{r random_semivar}
print(xyplot(gamma ~ dist, z_sci_sv1, pch = 3, type = 'b', lwd = 2, col = 'darkblue',
             panel = function(x, y, ...) {
               for (i in 1:100) {
                 DFobs$random = sample(DFobs$vscivcpmi)
                 v = variogram(random ~ 1, DFobs)
                 llines(v$dist, v$gamma, col = 'grey')
               }
               panel.xyplot(x, y, ...)
             },
             ylim = c(0, 160), xlab = 'distance', ylab = 'VSCI semivariance'
))
```

## Robust Semivariogram
Calculating a robust semivariogram lets us assess if outliers could be affecting the spatial pattern. 
```{r robust_semivar}

# so use Cressie-Hawkins to create robust variogram

z_sci_sv3 = variogram(vscivcpmi ~1, DFobs, cressie = TRUE)
plot(z_sci_sv3, cressie =TRUE, ylab = "VSCI Robust Semivariance")
# change ylim below in random semivariogram based on this plot

print(xyplot(gamma ~ dist, z_sci_sv3, pch = 3, type = 'b', lwd = 2, col = 'darkblue',
             panel = function(x, y, ...) {
               for (i in 1:100) {
                 DFobs$random = sample(DFobs$vscivcpmi)
                 v = variogram(random ~ 1, DFobs, cressie = TRUE)
                 llines(v$dist, v$gamma, col = 'grey')
               }
               panel.xyplot(x, y, ...)
             },
             ylim = c(0, 160), xlab = 'distance', ylab = 'VSCI robust semivariance'
))

```
We had no marked difference in the first two bins of classical semivariance versus Cressie-Hawkins semivariance. The default semivariogram and robust variogram often do not differ dramatically, which is the case here. Note that the R package spmodel, <https://cran.r-project.org/package=spmodel>, calculates and plots both types of semivariograms.

## After ESDA
The follow-up to exploratory spatial data analysis is spatial modeling so check out the workshop by Mike Dumelle and Ryan Hill. For extra credit, analyze vsci as a function of elevation, first, with a non-spatial, simple linear regression, and then with a spatial model based on Euclidean distance. However, because we have stream network distances, we will need to use the SSN2 R package. An SSN analysis handles 3 kinds of distances:  Euclidean, Flow-Connected (when monitoring sites have an upstream to downstream relationship) and Flow-Unconnected (when monitoring sites are on different branches of the network and share a confluence). An alternative analysis to consider is random forest regression kriging, which is also available using the splmRF function in the spmodel package. The paper by Canion et al. 2019 is the best example I know of showing the benefits of random forest regression kriging over just a random forest analysis (see Table 3).

Canion, A., McCloud, L. and Dobberfuhl, D., 2019. Predictive modeling of elevated groundwater nitrate in a karstic spring-contributing area using random forests and regression-kriging. Environmental Earth Sciences, 78(9), p.271.

This blog describes a limitation of tree-based predictive analytics.
<https://freerangestats.info/blog/2016/12/10/extrapolation#:~:text=Extrapolation%20is%20tough%20for%20trees!%20At%20a%20glance:%20Tree-based%20predictive> 

## Resources:  Workshops, Websites, and Books
Some spatial data analysis resources, with a brief description given followed by the link.

Geospatial Data Science in R website, semivariogram modeling.
<https://zia207.github.io/geospatial-r-github.io/semivariogram-modeling.html>

Class notes from Professor Dixon at Iowa State University on semivariograms.
<https://pdixon.stat.iastate.edu/stat406/notes/part%203b-4.pdf>

Link to free pdf of "Spatial Statistical Data Analysis for GIS Users by Konstantin Krivoruchko at ESRI.
<https://community.esri.com/t5/arcgis-geostatistical-analyst/quot-spatial-statistical-data-analysis-for-gis-users-quot/td-p/394418>

Article on spmodel R package.
Dumelle, M., Higham, M. and Ver Hoef, J.M., 2023. spmodel: Spatial statistical modeling and prediction in R. Plos one, 18(3), p.e0282524.
<https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0282524>

Article on SSN2 R package.
<https://joss.theoj.org/papers/10.21105/joss.06389.pdf>

An Introduction to 'SSNbler': Assembling Spatial Stream Network (ssn) Objects in R.
<https://pet221.github.io/SSNbler/articles/introduction.html>

Book on Spatial Linear Models for Environmental Data by Dale Zimmerman and Jay M. Ver Hoef. 
<https://www.routledge.com/Spatial-Linear-Models-for-Environmental-Data/Zimmerman-VerHoef/p/book/9780367183349>

Recent workshop by EPA Statistician Mike Dumelle and EPA Aquatic Biologist Ryan Hill on Spatial Analysis and Statistical Modeling with R and spmodel.
<https://usepa.github.io/spworkshop.sfs24/>

Using geopackage to save/store spatial data.
<https://mapping-in-r-workshop.ryanpeek.org/02_import_export_gpkg>

Spatial Data and Analysis in R Workshop by EPA Geographer Marc Weber.
<https://mhweber.github.io/R-User-Group-Spatial-Workshop-2021/>

## Disclaimer
The United States Environmental Protection Agency (EPA) project code is provided on an "as is" basis and the user assumes responsibility for its use. EPA has relinquished control of the information and no longer has responsibility to protect the integrity , confidentiality, or availability of the information. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by EPA. The EPA seal and logo shall not be used in any manner to imply endorsement of any commercial product or activity by EPA or the United States Government.