collocation_notebook_MODIS_VIIRS_in-situ.Rmd

---
title: "Triple Collocation of VIIRS, Aqua, and Buoy Matchups"
output:
  html_notebook:
    toc: True
    theme: united
---

This notebook describes the steps necessary to conduct a triple collocation
study of SST values derived from (a) Aqua - MODIS, (b) Suomi_NPP - VIIRS and
(c) in situ observations form moored and drifting buoys.

```{r prep_workspace, include=FALSE}
# Load necessary packages
library(dplyr)
library(ggplot2)
library(mapview)
library(futile.logger)
library(lubridate)
library(tidyr)

ggplot <- function(...) {ggplot2::ggplot(...) + theme_bw()}

# Source own functions
# Define direwctory where functions are for each operating system

if (Sys.info()["sysname"] == 'Windows') {
  fun.dir <- 'D:/matchups/r-projects/Matchup_R_Scripts/Functions/'
} else if (Sys.info()["sysname"] == 'Linux') {
  fun.dir <- '/home/ckk/Projects/Matchup_R_Scripts/Functions/'
}

fun.file <- paste0(fun.dir, 'common_functions.R')

source(file = fun.file,
  local = FALSE, echo = FALSE, verbose = FALSE)
rm(fun.dir, fun.file)

```


# Building a Dataset for Triple Collocation

In a previous step, we read in matchups for Aqua and VIIRS and stored them in
binary (.Rdata) objects. For Aqua, we modified the regular expression to list 
files so that only matchups from 2010 onwards were read. The first step is to
restore those objects. Note that the name of the stored binary objects (.Rdata)
contain the date in which they were created. Be sure to edit the names of the
objects to ensure that the most recent versions are being used.  

```{r load_VIIRS_AQUA_objects, include=FALSE}

# Load .Rdata object for VIIRS matchups resulting from script 01_read_matchups.R
# We assume... TO DO

if (Sys.info()["sysname"] == 'Windows') {
  # Windows directory for VIIRS matchups
  win.dir <- 'D:/matchups/r-projects/R_MUDB/viirs/v641ao/mia/results/objects/'
  # Name of .Rdata file (update as needed)
  win.file <- 'VIIRS_Suomi_NPP_MIA_L2GEN_ALL_Class_6.4.1_ao_2016_11_23_with_ancillary.Rdata'
  viirs.file <- paste0(win.dir, win.file)
  if (!file.exists(viirs.file)) {
    stop('Input file does not exist')
  } else {
    load(viirs.file, verbose = TRUE)
  }
  rm(win.dir, win.file, viirs.file)
} else if (Sys.info()["sysname"] == 'Linux') {
  # Linux directory for VIIRS matchups
  linux.dir <- '~/Projects/Matchup_R_Scripts/Results/objects/'
  # Name of .Rdata file (update as needed)
  linux.file <- 'VIIRS_Suomi_NPP_MIA_L2GEN_ALL_Class_6.4.1_ao_2016_11_22_with_ancillary.Rdata'
  viirs.file <- paste0(linux.dir, linux.file)
  if (!file.exists(viirs.file)) {
    stop('Input file does not exist')
  } else {
    load(viirs.file, verbose = TRUE)
  }
  rm(linux.dir, linux.file, viirs.file)
}

# Create 'VIIRS' object so it has a shorter name
VIIRS <- VIIRS_Suomi_NPP_MIA_L2GEN_ALL_Class_6.4.1_ao_2016_11_22
rm(VIIRS_Suomi_NPP_MIA_L2GEN_ALL_Class_6.4.1_ao_2016_11_22)

# Load .Rdata object for AQUA matchups resulting from script 01_read_matchups.R

if (Sys.info()["sysname"] == 'Windows') {
  # Windows directory for AQUA matchups
  win.dir <- 'D:/matchups/r-projects/R_MUDB/modis/aqua/gsfc/l6cv6/results/objects/'
  # Name of .Rdata file (update as needed)
  win.file <- 'MODIS_Aqua_GSFC_ALL_Class_6.4.1_ao_2016_11_23_with_ancillary.Rdata'
  viirs.file <- paste0(win.dir, win.file)
  if (!file.exists(viirs.file)) {
    stop('Input file does not exist')
  } else {
    load(viirs.file, verbose = TRUE)
  }
  rm(win.dir, win.file, viirs.file)
} else if (Sys.info()["sysname"] == 'Linux') {
  # Linux directory for VIIRS matchups
  linux.dir <- '~/Projects/Matchup_R_Scripts/Results/objects/'
  # Name of .Rdata file (update as needed)
  #linux.file <- 'MODIS_Aqua_GSFC_ALL_Class_6.4.1_ao_2016_11_26_with_ancillary.Rdata'
  linux.file <- 'MODIS_Aqua_GSFC_ALL_Class_6.4.1_ao_2017_02_23_with_ancillary.Rdata'
  aqua.file <- paste0(linux.dir, linux.file)
  if (!file.exists(aqua.file)) {
    stop('Input file does not exist')
  } else {
    load(aqua.file, verbose = TRUE)
  }
  rm(linux.dir, linux.file, aqua.file)
}

# Create 'AQUA' object so it has a shorter name
AQUA <- MODIS_Aqua_GSFC_ALL_Class_6.4.1_ao_2017_02_23
rm(MODIS_Aqua_GSFC_ALL_Class_6.4.1_ao_2017_02_23)

```

The restored VIIRS matchups span the period between `r min(VIIRS$buoy.timedate)` and `r max(VIIRS$buoy.timedate)`.The original VIIRS dataset had `r nrow(VIIRS)` matchups.
We also read Aqua matchups for the period between `r min(AQUA$buoy.timedate)`
and `r max(AQUA$buoy.timedate)`. The number of Aqua matchups was `r nrow(AQUA)`.

These restored objects have many more variables than we need.
The Aqua matchups have `r ncol(AQUA)` variables, and the VIIRS matchups have
`r ncol(VIIRS)` columns. Therefore, a second step eliminates many variables
or columns that are not needed for the collocation study.

```{r filter_variables, include = FALSE}

# Define variables that will be retained for collocation study for each sensor
# and matchup format

modis.colloc.vars <- keep_matchup_variables_collocation(matchup.format = 'GSFC',
  sensor = 'MODIS', ancillary.data = TRUE)

viirs.colloc.vars <- keep_matchup_variables_collocation(matchup.format = 'MIA_L2GEN',
  sensor = 'VIIRS', ancillary.data = TRUE)

AQUA.trimmed <- dplyr::select(AQUA, one_of(modis.colloc.vars))
VIIRS.trimmed <- dplyr::select(VIIRS, one_of(viirs.colloc.vars))

rm(modis.colloc.vars, viirs.colloc.vars)

```

After the reduction of unnecessary variables, the trimmed Aqua matchups
have `r ncol(AQUA.trimmed)` variables, and the trimmed VIIRS matchups have
`r ncol(VIIRS.trimmed)` columns. 

The third preparatory step is to convert the VIIRS BTs and SSTs from degrees 
Kelvin to Celsius, whereas Aqua BTs, and SSTs already are expressed in Celsius.
For consistency we convert VIIRS values to degrees Celsius from degrees Kelvin.

```{r convert_VIIRS_K_to_C, echo = FALSE}

# Convert VIIRS BTs and SSTs from degrees Kelvin to Celsius

# Identify columns of 'VIIRS.trimmed' that need to be converted from K to C
pattern <- paste("^cen", "^med", "^max", "^min", sep = "|")
tt1 <- which((grepl(pattern, names(VIIRS.trimmed))))

# Convert selected columns from K to C
VIIRS.trimmed[tt1] <- apply(VIIRS.trimmed[tt1], 2, FUN = degK.to.degC)

rm(pattern, tt1)

```

To confirm that the conversion from Kelvin to Celsius was successful, we output
the summary of the 'cen.sst' column before and after the conversion.

```{r verify_K_to_C, echo = FALSE}
cat('BEFORE K to C conversion\n')
summary(VIIRS$cen.sst)
cat('AFTER K to C conversion\n')
summary(VIIRS.trimmed$cen.sst)

```

Because VIIRS and Aqua SSTs are calculated as 'skin' values, we neeed to add back 0.17 degC
to each of these SSTs to make them compatible with the buoys' 'bulk' SST.

```{r skin_to_bulk}

VIIRS.trimmed$cen.sst <- VIIRS.trimmed$cen.sst + 0.17
AQUA.trimmed$cen.sst <- AQUA.trimmed$cen.sst + 0.17

```

The final preparatory step before joining the VIIRS and Aqua matchups is to add
a suffix to the variables of each data set indicating the source of the variable.
This allows us to see the source of a variable when we join For example, variable
'cen.sst' will be re named to 'cen.sst.vrs' if it corresponds to VIIRS and 'cen.sst.aqu'
if it comes from the Aqua - MODIS matchups. We do not change the names of columns 'buoy.id'
and 'buoy.date', as those are the common variables in both datasets that will be used
to join them in the next section.

```{r rename_columns, echo=FALSE}

# Change column names of VIIRS and AQUA objects so that when we join
# the source of each variable (VIIRS or AQUA) is clear.

colnames(AQUA.trimmed) <- paste0(colnames(AQUA.trimmed), ".aqu")
colnames(VIIRS.trimmed) <- paste0(colnames(VIIRS.trimmed), ".vrs")

# We do not change the names of columns 'buoy.id' and 'buoy.date'
# because they are used as common keys for joining VIIRS and AQUA data.

tt1 <- colnames(AQUA.trimmed)
tt2 <- which(tt1 %in% "buoy.date.aqu")
colnames(AQUA.trimmed)[tt2] <- "buoy.date"
tt3 <- which(tt1 %in% "buoy.id.aqu")
colnames(AQUA.trimmed)[tt3] <- "buoy.id"

tt1 <- colnames(VIIRS.trimmed)
tt2 <- which(tt1 %in% "buoy.date.vrs")
colnames(VIIRS.trimmed)[tt2] <- "buoy.date"
tt3 <- which(tt1 %in% "buoy.id.vrs")
colnames(VIIRS.trimmed)[tt3] <- "buoy.id"

rm(tt1,tt2,tt3)

# Output a few variables to confirm the renaming
colnames(VIIRS.trimmed[1:10])
colnames(AQUA.trimmed[1:10])

```

## Joining VIIRS and Aqua Matchups

A collocation study requires data from at least three sources that are relatively 
coincident in space and time. We require that each collocated record include SST
measurements from all three techniques collected within at most 60 minutes and one 
kilometer from each other. We have two SST measurements derived from satellite-based
infrared radiometers (Suomi Npp - VIIRS and Aqua - MODIS). Since the buoy data are
common elements to both VIIRS and Aqua matchups, we use them as a common key to 
join Aqua and VIIRS matchups. 

```{r join_VIIRS_and_AQUA, echo=TRUE}

# Convert timedates from POSIXct objects to characters for the dplyr joining
# dplyr does not accept POSIXct variables for joining

VIIRS.trimmed$buoy.timedate.vrs <- as.character(VIIRS.trimmed$buoy.timedate.vrs)
VIIRS.trimmed$sat.timedate.vrs <- as.character(VIIRS.trimmed$sat.timedate.vrs)
AQUA.trimmed$buoy.timedate.aqu <- as.character(AQUA.trimmed$buoy.timedate.aqu)
AQUA.trimmed$sat.timedate.aqu <- as.character(AQUA.trimmed$sat.timedate.aqu)

# Join VIIRS and AQUA into one object (orig_j)

orig_j <- dplyr::inner_join(VIIRS.trimmed, AQUA.trimmed,
  by = c("buoy.date", "buoy.id"))

# This line excludes non-exact matches and we lose too many potentially useful matchups
#orig_j <- dplyr::inner_join(AQUA, VIIRS, by = c("buoy.pftime", "buoy.lon", "buoy.lat", "buoy.id"))

```

```{r dates_back_to_POSIX, echo=TRUE}

# After join, we convert back the timedate objects
# from character into POSIXct objects so they can be used in time calculations

tt1 <- orig_j %>% dplyr::select(buoy.timedate.aqu)
tt2 <- as.vector(tt1[[1]])
tt3 <- lubridate::ymd_hms(tt2)
orig_j$buoy.timedate.aqu <- tt3

tt1 <- orig_j %>% dplyr::select(sat.timedate.aqu)
tt2 <- as.vector(tt1[[1]])
tt3 <- lubridate::ymd_hms(tt2)
orig_j$sat.timedate.aqu <- tt3

tt1 <- orig_j %>% dplyr::select(buoy.timedate.vrs)
tt2 <- as.vector(tt1[[1]])
tt3 <- lubridate::ymd_hms(tt2)
orig_j$buoy.timedate.vrs <- tt3

tt1 <- orig_j %>% dplyr::select(sat.timedate.vrs)
tt2 <- as.vector(tt1[[1]])
tt3 <- lubridate::ymd_hms(tt2)
orig_j$sat.timedate.vrs <- tt3

rm(tt1,tt2,tt3)
```

## Filtering VIIRS and Aqua Matchups

At this point, we have `r nrow(orig_j)` joined matchups. However, these matchups
do not have enough spatial or temporal coincidence to meet our requirements stated 
above or for a triple collocation error analysis. We did not join by the buoy's 
timedate or latitude/longitude because the 'dplyr::inner_join' function only matches
if key values are *exactly* the same in both datasets, thereby reducing the number of
potentially useful matchups. We instead apply a series of filtering steps to ensure the
conditions stated above for a triple collocation error analysis. We also filter matchups
so we only have good quality retrievals to minimize the chances of errors in satellite
SST retrievals due to cloud contamination.

```{r filtering_orig_j, echo=FALSE}

# Filter collocated data by solz, platform, time difference

orig_filtered <- dplyr::tbl_df(orig_j) %>%
  # Only nightime retrievals (solar zenith angle >= 90 degrees)
  dplyr::filter(solz.aqu >= 90 & solz.vrs >= 90) %>%
  # Allow only buoy SST measurements (not ships or radiometers) 
  dplyr::filter(insitu.platform.aqu %in% c("DriftingBuoy","MooredBuoy")) %>%
  # Eliminate SST qualities for AQUA and VIIRS less than 0 (best quality) 
  dplyr::filter(qsst.aqu == 0 & qsst.vrs == 0) %>%
  # VIIRS and AQUA sat times must be less or equal than 1hr (3600 secs) from each other
  dplyr::filter(abs(sat.timedate.aqu - sat.timedate.vrs) <= 3600) %>%
  # Buoy retrieval times for VIIRS and AQUA must be less than 1 min apart
  # The buoy times should coincide but may have roundoff differences
  dplyr::filter(abs(buoy.timedate.aqu - buoy.timedate.vrs) <= 60)

# Allow only retrievals for which distance between the buoy positions
# listed in VIIRS and AQUA are less than 100 meters

uu1 <- cbind(orig_filtered$buoy.lon.vrs, orig_filtered$buoy.lat.vrs)
uu2 <- cbind(orig_filtered$buoy.lon.aqu, orig_filtered$buoy.lat.aqu)
uu3 <- geosphere::distVincentyEllipsoid(uu1, uu2)
cat('Distance between buoy positions in VIIRS and AQUA\n')
summary(uu3)
uu4 <- which(uu3 < 100) # Identify distances < 100 m
orig_filtered <- orig_filtered[uu4, ] # Filter out distances >= 100 m
rm(uu1,uu2,uu3,uu4)

# Explore distances between satellite and buoy positions for VIIRS
uu1 <- cbind(orig_filtered$buoy.lon.vrs, orig_filtered$buoy.lat.vrs)
uu2 <- cbind(orig_filtered$sat.lon.vrs, orig_filtered$sat.lat.vrs)
uu3 <- geosphere::distVincentyEllipsoid(uu1, uu2)
cat('Distance between sat and buoy positions for VIIRS\n')
summary(uu3)
rm(uu1,uu2,uu3)

# Explore distances between satellite and buoy positions for AQUA
uu1 <- cbind(orig_filtered$buoy.lon.aqu, orig_filtered$buoy.lat.aqu)
uu2 <- cbind(orig_filtered$sat.lon.aqu, orig_filtered$sat.lat.aqu)
uu3 <- geosphere::distVincentyEllipsoid(uu1, uu2)
cat('Distance between sat and buoy positions for AQUA\n')
summary(uu3)
rm(uu1,uu2,uu3)

# Explore distances between VIIRS and AQUA pixel positions
uu1 <- cbind(orig_filtered$sat.lon.vrs, orig_filtered$sat.lat.vrs)
uu2 <- cbind(orig_filtered$sat.lon.aqu, orig_filtered$sat.lat.aqu)
uu3 <- geosphere::distVincentyEllipsoid(uu1, uu2)
cat('Distance between satellite pixel positions for AQUA and VIIRS\n')
summary(uu3)
rm(uu1,uu2,uu3)

```

After applying the various filters we are left with `r nrow(orig_j)` matchups. 
These matchups now have enough spatial and temporal coincidence for us to apply
the collocation techniques.

## Spatial distribution of collocated matchups

Now that we a collocated database with enough temporal and spatial similarity to
be used in collocation, we explore the nature of this database. Here we explore
the geospatial distribution collocated matchups. We show that collocated matchups
exist all over the world with enough incidence in different regions for us to interpret
the results of a global collocation experiment without bias.

A standard Mercator map with points for collocated matchups.
```{r points_on_map}
library(maps)
library(mapdata)

Cairo::CairoSVG(file = 'matchups_map.ps',
         width = 9, height = 6, onefile = TRUE,
         bg = "transparent", pointsize = 18)

maps::map("world",
  col = "gray70",
  fill = TRUE)

points(orig_filtered$buoy.lon.aqu, orig_filtered$buoy.lat.aqu,
  pch = '.', col = 'tomato')
box()
dev.off()

```

The same plots as above just using **ggplot** instead.
```{r points_on_map_ggmap}
# Plot location of collocated data 

library(ggmap)
mp <- NULL
mapWorld <- borders("world", colour="gray50", fill="gray70") # create a layer of borders
mp <- ggplot() +   mapWorld
mp <- mp + ggplot2::labs(x = NULL, y = NULL)

# Now layer the matchup locations on top
mp <- mp + ggplot2::geom_point(data = orig_filtered,
  aes(x = buoy.lon.aqu, y = buoy.lat.aqu),
  color = "tomato", size = 0.7, alpha = 0.2) +
#ggplot2::ggsave(filename = 'point_distribution.ps', device = 'ps',
       #width = 8, height = 6, units = 'in') +
ggplot2::scale_x_continuous(breaks = seq(from = -180, to = 180, by = 60)) +
ggplot2::scale_y_continuous(breaks = seq(from = -90, to = 90, by = 30))

plot(mp)

```

This is the standard Mercator projection of a map with hexagonal bins overlayed showing
the density of points per bin.

```{r map_with_hexbinning}
# Create base map
require(ggmap)

mp <- NULL
mapWorld <- borders("world", colour = "gray50", fill = "gray70") # create a layer of borders
mp <- ggplot() +  mapWorld +
  ggplot2::scale_x_continuous(breaks = seq(from = -180, to = 180, by = 60)) +
  ggplot2::scale_y_continuous(breaks = seq(from = -90, to = 90, by = 30)) +
  coord_fixed(ratio = 1)

# WARNING: Be careful as this worked with ggplot2 2.1.0 on Linux
# Try to work on different machines with varying versions of ggplot2
# Now layer the buoys on top with hexagonal binning
mp <- mp + 
  ggplot2::stat_binhex(data = orig_filtered,
    aes(x = buoy.lon.aqu, # Use hexagonal binning command and give x and y input
    y = buoy.lat.aqu,
    fill = cut(..count.., c(0, 100, 500, 1000, 1500, Inf))), # Divides matchups per bin into discrete chunks and colors likewise
    binwidth = c(10, 10)) +
  mapWorld + labs(x = NULL, y = NULL) +
  #scale_fill_hue('value') + # Standard colors with discrete chunking
  scale_fill_brewer(palette = 'YlOrRd') + # Change colors to Yellow, Orange, and Red - many diff 
  guides(fill = guide_legend(title = "N of matchups"))

ggplot2::ggsave(filename = 'point_distribution.ps', device = 'ps',
       width = 8, height = 6, units = 'in')
mp
```

An interactive map.
```{r interactive_map, echo=TRUE}
matchup.coords <- as.matrix(cbind(lon = orig_filtered[, 'buoy.lon.aqu'],
  lat = orig_filtered[, 'buoy.lat.aqu']))

crs.string <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"

pts <- sp::SpatialPoints(coords = matchup.coords,
  proj4string = sp::CRS(crs.string))

pts <- as.data.frame(pts)
pts <- data.frame(pts, orig_filtered$insitu.platform.aqu, orig_filtered$buoy.id)
sp::coordinates(pts) <- ~ buoy.lon.aqu + buoy.lat.aqu
sp::proj4string(pts) <- sp::CRS(crs.string)

mapview::mapview(pts, alpha = .2, cex = 1)
```


## Temporal distribution of collocated matchups

In order to interpret the global collocation properly, we must understand the temporal 
distribution of matchups. Algorithm changes will effect the accuracy of SST retrievals
and SST accuracy is dependent on seasonal atmospheres (hence the recomputation of algorithms
coefficients on a seasonal bases). We show that there are more matchups during the North American
summer months than the winter months. This finding allows us to interpret the results of
the global collocation with the appropriate understanding that the global standard deviation
of random errors will be biased towards the standard deviation of errors in summer months.

```{r heatmap_of_matchups}

yy <- lubridate::year(orig_filtered$buoy.timedate.vrs)     # Year
mm <- lubridate::month(orig_filtered$buoy.timedate.vrs)    # Month

tt1 <- cbind(yy, mm)

tt2 <- dplyr::tbl_df(tt1) %>%
  dplyr::group_by(yy, mm) %>%
  dplyr::summarise(n = n()) %>%
  dplyr::arrange(yy, mm)

yy.vals <- sort(unique(yy))
mm.vals <- sort(unique(mm))
tt3 <- expand.grid(mm = mm.vals, yy = yy.vals)

tt4 <- dplyr::left_join(tt3, tt2) %>%
  dplyr::select(yy, mm, n)

brks <- pretty(tt4$n, n = 5)
#brks <- floor(quantile(tt4$n, probs = seq(0, 1, 0.25), na.rm = TRUE))

tt4$n <- cut(tt4$n, breaks = brks)
tt4$mm <- ordered(tt4$mm, labels = month.abb)
tt4$yy <- ordered(tt4$yy)

hm <- ggplot2::ggplot(data = tt4, aes(mm, yy)) +
  ggplot2::geom_tile(aes(fill = n), colour = "white") +
  ggplot2::scale_fill_brewer(type = "seq", palette = 'YlOrRd', direction = 1) +
  ggplot2::theme_bw() +
  ggplot2::theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  ggplot2::ggsave("Temporal_heatmap.ps", device = 'ps',
       width = 8, height = 6, units = 'in')
hm
```

## Association between all SST sources

A key assumption in collocation is that the random errors must be uncorrelated. Because
MODIS and VIIRS are two very similar instruments that use similar retrieval techniques (i.e.
both are infrared radiometers) and view extraordinarily similar atmospheres (see high temporal
and spatial similarity explored above), we hypothesize that Aqua and VIIRS retrieval random
errors are highly correlated. We attempt to explore the validity of no correlation assumption
within our data. We start by exploring just the correlation between retrievals in each of the three sensors.

```{r viirs_aqua_sst, echo=FALSE}

ccc <- RColorBrewer::brewer.pal(n = 8, name = 'YlOrRd')
ddd <- orig_filtered$cen.sst.aqu - orig_filtered$cen.sst.vrs

ggplot(data = orig_filtered, mapping = aes(x = cen.sst.aqu, y = ddd)) +
  stat_bin2d(binwidth = c(0.2, 0.1)) +
  scale_fill_gradientn(colours = ccc) +
  geom_abline(intercept = 0, slope = 0, color = 'steelblue') +
  lims(x = c(-2, 32), y = c(-3,3)) +
  ggplot2::labs(title = "VIIRS and AQUA SSTs", x = 'AQUA SST', y = 'VIIRS SST minus AQUA SST') +
  ggplot2::ggsave(filename = 'AQUA_and_VIIRS.ps', device = 'ps',
       width = 8, height = 6, units = 'in') +
  theme(text = element_text(size = 16))

fit_aqua_viirs_sst <- lm(cen.sst.aqu ~ cen.sst.vrs, data = orig_filtered)

# Compute robust standard deviation

RSD_aqua_viirs <- IQR(fit_aqua_viirs_sst$residuals) / 1.348

```

```{r viirs_buoy_sst, echo=FALSE}
ccc <- RColorBrewer::brewer.pal(n = 8, name = 'YlOrRd')
ddd <- orig_filtered$cen.sst.vrs - orig_filtered$buoy.sst.vrs

ggplot(data = orig_filtered, mapping = aes(x = cen.sst.vrs, y = ddd)) +
  stat_bin2d(binwidth = c(0.2, 0.1)) +
  scale_fill_gradientn(colours = ccc) +
  geom_abline(intercept = 0, slope = 0, color = 'steelblue') +
  lims(x = c(-2, 32), y = c(-3,3)) +
  ggplot2::labs(title = "VIIRS and Buoy SSTs", x = 'VIIRS SST', y = 'VIIRS SST minus Buoy SST') +
  ggplot2::ggsave(filename = 'VIIRS_and_Buoy.ps', device = 'ps',
       width = 8, height = 6, units = 'in') +
  theme(text = element_text(size = 16))

fit_viirs_buoy_sst <- lm(cen.sst.vrs ~ buoy.sst.vrs, data = orig_filtered)

# Compute robust standard deviation

RSD_viirs_buoy <- IQR(fit_viirs_buoy_sst$residuals) / 1.348
```

```{r aqua_buoy_sst, echo=FALSE}
ccc <- RColorBrewer::brewer.pal(n = 8, name = 'YlOrRd')
ddd <- orig_filtered$cen.sst.aqu - orig_filtered$buoy.sst.vrs

ggplot(data = orig_filtered, mapping = aes(x = cen.sst.aqu, y = ddd)) +
  stat_bin2d(binwidth = c(0.2, 0.1)) +
  scale_fill_gradientn(colours = ccc) +
  geom_abline(intercept = 0, slope = 0, color = 'steelblue') +
  lims(x = c(-2, 32), y = c(-3,3)) +
  ggplot2::labs(title = "AQUA and Buoy SSTs", x = 'AQUA SST', y = 'AQUA SST minus Buoy SST') +
  ggplot2::ggsave(filename = 'AQUA_and_Buoy.ps', device = 'ps',
       width = 8, height = 6, units = 'in') +
  theme(text = element_text(size = 16))

fit_aqua_buoy_sst <- lm(cen.sst.aqu ~ buoy.sst.vrs, data = orig_filtered)

# Compute robust standard deviation

RSD_aqua_buoy <- IQR(fit_aqua_buoy_sst$residuals) / 1.348
```

The correlation between SST sources is extraordinarily high. However, this still does 
not tell us what the correlation between random errors in retrievals may be. We attempt
to estimate this parameter by finding the correlation between Aqua minus buoy SSTs and
VIIRS - buoy SSTs. Because of the validity of buoy SSTs, we can assume the buoy
SST to be the true value of the measured SST variable. Aqua minus buoy SST and VIIRS
minus buoy SST are then approximations for the random error. We find the correlation
in the approximate random error.

## Correlation between Aqua and VIIRS Errors
```{r aqua_viirs_errors}

tt1 <- orig_filtered$cen.sst.aqu - orig_filtered$buoy.sst.vrs
tt2 <- orig_filtered$cen.sst.vrs - orig_filtered$buoy.sst.vrs
ccc <- RColorBrewer::brewer.pal(n = 9, name = 'YlOrRd')[3:9]

errors <- ggplot(mapping = aes(x = tt1, y = tt2)) +
  stat_bin2d(binwidth = c(0.1, 0.1)) +
  scale_fill_gradientn(colours = ccc) +
  geom_abline(intercept = 0, slope = 1, color = 'steelblue') +
  lims(x = c(-3, 3), y = c(-3,3)) +
  ggplot2::labs(title = "AQUA and VIIRS SST Errors", x = 'AQUA SST minus Buoy SST', y = 'VIIRS SST minus Buoy SST') +
  ggplot2::ggsave(filename = 'AQUA_and_VIIRS_errors.ps', device = 'ps',
       width = 8, height = 8, units = 'in') +
  theme(text = element_text(size = 16))

fit_aqua_viirs_errors <- lm(tt1 ~ tt2)

# Compute robust standard deviation

RSD_aqua_viirs_errors <- IQR(fit_aqua_viirs_errors$residuals) / 1.348

```

We show that the correlation between Aqua and VIIRS errors is extraordinarily high, 
almost `r cor(tt1, tt2)`, for two sources whose errors should be uncorrelated for collocation to work 
correctly. We now conduct the collocation experiment and explore our results, knowing that
Aqua and VIIRS errors are correlated.

# Error estimation using triple collocation

## Global experiment: all matchups in collocated DB (CDB)

We find that the collocation estimated standard deviations do not agree with previous
experiments to quantify the random error. We determine correlation between random errors of
Aqua and VIIRS to be the cause for this apparent discrepancy.

```{r collocation_functions, include=FALSE}

compute_bias <- function(source1, source2, source3) {
  if (!is.null(nrow(source1)) |
    !is.null(nrow(source2)) |
    !is.null(nrow(source3)) |
    length(source1) <= 1 |
    length(source2) <= 1 |
    length(source3) <= 1) {
    stop("Only vectors can be passed as arguments to the calculate_bias function.")
  }

  bias_source1source3 <- mean(source1) - mean(source3) #Bias between source1 and source2
  
  bias_source2source3 <- mean(source2) - mean(source3) #Bias between source2 and source3
  
  # Return a dataframe of biases that can re-scale each observation type
  
  biases <- c(bias_source1source3, bias_source2source3)
  names(biases) <- c("bias_src1src3", "bias_src2src3")
  
  return(biases)
}   # End of compute_bias


compute_variance <- function(source1, source2, source3) {
  if (!is.null(nrow(source1)) |
    !is.null(nrow(source2)) |
    !is.null(nrow(source3)) |
    length(source1) <= 1 |
    length(source2) <= 1 |
    length(source3) <= 1) {
    stop("Only vectors can be passed as arguments to the calculate_variance function")
  }
  
  # Call bias function to re-scale measurements
  biases <- compute_bias(source1, source2, source3)
  
  #Scale sources to one source to compute variance
  source1_scaled <- source1 - biases['bias_src1src3']
  source2_scaled <- source2 - biases['bias_src2src3']
  
  # Variance of errors in observation types
  variance_source1 <- mean((source1_scaled - source2_scaled) * (source1_scaled - source3))              # Variance of source1
  variance_source2 <- mean((source1_scaled - source2_scaled) * (source2_scaled - source3))              # Variance of source2
  variance_source3 <- mean((source1_scaled - source3) * (source2_scaled - source3))                     # Variance of source3
  
  # Take all the individual variances and return them as a dataframe
  variances <- data.frame(variance_source1, variance_source2, variance_source3)
  colnames(variances) <- c("var_src1", "var_src2", "var_src3")
  return(variances)
} # end of compute_variance

```

```{r run_collocation_global}

# Make sure the last argument to compute_variance corresponds to the buoy SSTs

variances_global <- compute_variance(orig_filtered$cen.sst.aqu,
  orig_filtered$cen.sst.vrs,
  orig_filtered$buoy.sst.vrs)

biases_global <- compute_bias(orig_filtered$cen.sst.aqu,
  orig_filtered$cen.sst.vrs,
  orig_filtered$buoy.sst.vrs)

sd_global <- round(sqrt(abs(variances_global)), 3)

sd_global

```

## Impact of Correlation in Random Errors on Triple Collocation Standard Deviations

We explore the quantity of misprediction in collocation standard deviations when errors
are known to be correlated. We create three datasets that are similar to the two
satellite and one buoy dataset we are using for collocation. The datasets contain
60,000 measurements, range from 0 C to 30 C, and have standard deviations of 
0.4 for Aqua and VIIRS and 0.25 for buoy. We then systematically change the
random error correlation in these datasets. We plot the ratio of predicted standard
deviation to the correct standard deviation with the varying values of correlation.
Note that we keep the standard deviation of the random error the same throughout the
experiments.

```{r  colloc_corr_assumptions_sim, echo=FALSE}

# Create an empirical curve for the over/under-estimation of variances with varying correlations

# Set seed so findings are reproducible
set.seed(2)

nobs <- 65000 # Number of observations
ntrials <- 100.00 # Number of trials
corr_values <- seq(from = 0, to = .99, by = .01) # Increments of correlation to test

# Truth variable
truthx <- runif(n = nobs, min = 0, max = 30)

# Biases
alpha1 <- 0
alpha2 <- 0
alpha3 <- 0

# SDs for each observation source
sd_aq <- .4
sd_vi <- .4 # sd_aq should roughly equal sd_vi
sd_bu <- .25

# Dataframe to store predicted variances per correlation
preds_case <- as.data.frame(matrix(999, ncol = 3, nrow = length(corr_values)))
colnames(preds_case) <- c("meas1", "meas2", "meas3")
rownames(preds_case) <- as.character(corr_values)

for (i in 1:length(corr_values)) {
  # Create correlation matrix
  R <- matrix(cbind(1.0,corr_values[i],0.0,  corr_values[i],1.0,0.0,  0.0,0.0,1.0),nrow=3)
  U <- t(chol(R))
  
  # Create dataframe to store predicted variances
  preds_trials <- as.data.frame(matrix(999, ncol = 3, nrow = ntrials))
  colnames(preds_trials) <- c("Meas1", "Meas2", "Meas3")
  rownames(preds_trials) <- as.character(seq(from = 1, to = ntrials, by = 1))
  
  for (trials in seq(from = 1, to = ntrials, by = 1)) {
    if (trials %in% c(1, 50, 100)) {
      cat('Trial', trials, 'of correlation value', corr_values[i], '\n')
    }
    # Create uncorrelated errors
    noi1 <- rnorm(nobs, 0, sd_aq)
    noi2 <- rnorm(nobs, 0, sd_vi)
    noi3 <- rnorm(nobs, 0, sd_bu)
    
    # Correlate the uncorrelated errors
    random.normal <- t(cbind(noi1, noi2, noi3))
    X <- U %*% random.normal
    newX <- t(X)
    raw <- as.data.frame(newX)
    
    # Correlated errors
    epsi1 <- raw[, 1]
    epsi2 <- raw[, 2]
    epsi3 <- raw[, 3]
    
    # Create 3 datasets with biases and correlated errors
    meas1 <- truthx + alpha1 + epsi1
    meas2 <- truthx + alpha2 + epsi2
    meas3 <- truthx + alpha3 + epsi3
    
    # Compute predicted variances
    variances <- round(sqrt(abs(compute_variance(meas1, meas2, meas3))), 3)
    
    # Store predicted variances in a dataframe
    preds_trials[trials, ] <- variances
  }
  # Take the median of each dataset's predicted variance and store in a dataframe
  cat('Taking the median of all trials for correlation value', corr_values[i], '\n')
  #if (corrs == .29) {
  #  preds_case[30, 1] <- median(preds_trials[, 1])
  #  preds_case[30, 2] <- median(preds_trials[, 2])
  #  preds_case[30, 3] <- median(preds_trials[, 3])
  #} else if(corrs == .58) {
  #  preds_case[59, 1] <- median(preds_trials[, 1])
  #  preds_case[59, 2] <- median(preds_trials[, 2])
  #  preds_case[59, 3] <- median(preds_trials[, 3])
  #} else {
  preds_case[i, 1] <- median(preds_trials[, 1])
  preds_case[i, 2] <- median(preds_trials[, 2])
  preds_case[i, 3] <- median(preds_trials[, 3])
  #}
}

# Prep dataframe for plotting
prop1 <- preds_case[, 1] / preds_case[1, 1]
prop2 <- preds_case[, 2] / preds_case[1, 2]
prop3 <- preds_case[, 3] / preds_case[1, 3]

# Linear models fitting the proportions as a function of correlation
fit_prop1 <- lm(prop1 ~ corr_values)
fit_prop2 <- lm(prop2 ~ corr_values)
fit_prop3 <- lm(prop3 ~ corr_values)

int1 <- fit_prop1$coefficients[1]
slo1 <- fit_prop1$coefficients[2]

int2 <- fit_prop2$coefficients[1]
slo2 <- fit_prop2$coefficients[2]

# Plot proportions
corr_sim <- ggplot2::ggplot() +
  geom_line(aes(corr_values, prop1), size = 1.5, colour = 'blue') +
  geom_line(aes(corr_values, prop2), size = 1.5, colour = 'red') +
  geom_line(aes(corr_values, prop3), size = 1.5, colour = 'green') +
  ylim(c(0, 2)) + 
  labs(x = "Correlation between AQUA - Buoy and VIIRS - Buoy SSTs", y = "Proportion of Actual Variance Predicted") +
  ggsave("Correlation_sim.ps", device = 'ps',
    width = 8, height = 8, units = 'in')

```

## Correcting Collocation Estimates for Correlation in Errors of Different SST Sources

By having an estimation of the correlation of random errors between Aqua and VIIRS (0.7)
and knowledge of how correlation effects collocation-estimated standard deviations, we can
compute standard deviations on our correlated dataset and correct by dividing by the factor 
of predicted standard deviation to actual standard deviation determined from our above numerical analysis.

```{r correcting_glob_experi_for_corr}

corrections <- c(prop1[71], prop2[71], prop3[71]) # Correlation between aqua and viirs errors ~ 0.7 - get correction values for 0.7
sd_global_corrected <- sd_global / corrections # Correct sds
sd_global_corrected
```


# Analysis and further experiments