thesis-publication.Rmd

---
title: "Assessing Phosphorus Sources with Synoptic Sampling in the Surface Waters of a Mixed-Use, Montane Watershed"
author: "Austin W. Pearce"
date: "Documentation last updated on September 18, 2017"
output: 
  tufte::tufte_handout
---
# Intro

I consider myself somewhat new to R. I've only been coding for a year. And most of the coding I know is for basic statistics and making simple data visualizations. However, in an effort to better document my thesis data for the next person who might find it, below is some documentation and code for the synoptic sampling data I collected while a master's student at BYU from 2015 - 2017 with Dr. Neil Hansen.

This notebook differs from the other (`thesis.Rmd`) in that graphics here have been revised to reflect simpler, cleaner styles.

# Setup
Most of the packages required for all code below can be installed in one fell swoop by installing the `tidyverse` package. GIS shapefiles were imported using `rgdal`. The `viridis` package provides pleasing color palettes that are also colorblind friendly. `gridExtra` allows multiple ggplots to be arranged side-by-side.

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo    = FALSE,
                      message = FALSE,
                      warning = FALSE)
library(lubridate)
library(tufte)
library(tidyverse)
library(rgdal)
library(gridExtra)
library(cowplot)

theme_set(
  theme_bw(base_family = "serif",
           base_size = 14) +
    theme(plot.background = NULL,
          plot.margin = margin(5, 5, 0, 0, "pt"),
          panel.grid = element_blank(),
          axis.line = element_line(color = "black"),
          axis.text = element_text(color = "black"),
          axis.ticks = element_line(color = "black"),
          legend.title.align = 0.5,
          legend.key.height = unit(x = 2, units = "pt"),
          legend.justification=c(1,1), legend.position=c(1,1),
          legend.background = element_blank()
    )
)

lab.lps <- expression('Streamflow (L s'^-1*')')
```

# The Watershed Map

To effectively orient the reader who views these spatial data, below is code to produce a basemap of the Wallsburg watershed. This basemap is simplified to only include the watershed boundary and the three major tributaries. Values for *Latitude* and *Longitude* are not included on the axis to maintain simplicity, and because the coordinates of the sites do not really matter in the context of this study.

*Note: streams were read in as separate shapefiles because I had issues plotting when all streams were in one shapefile.*

```{r build map, include=FALSE}
# Shapefiles are in WGS 84 lat-long projection
# Main Creek (mc), Spring Creek (sc), and so on
bound <- readOGR('input/shapefiles', 'boundary')
mc    <- readOGR('input/shapefiles', 'maincreek')
sc    <- readOGR('input/shapefiles', 'springcreek')
lhc   <- readOGR('input/shapefiles', 'littlehobble')
lhclf <- readOGR('input/shapefiles', 'littlehobble2')

# build ggplot object for outline of watershed
bound <- geom_polygon(data = bound, aes(x = long, y = lat),
                      col = "gray50", fill = NA)

# build ggplot objects for the streams
sc    <- geom_path(data = sc,    aes(long, lat), col = "#cccccc")
mc    <- geom_path(data = mc,    aes(long, lat), col = "#cccccc")
lhc   <- geom_path(data = lhc,   aes(long, lat), col = "#cccccc")
lhclf <- geom_path(data = lhclf, aes(long, lat), col = "#cccccc")

# A ggplot basemap of Wallsburg watershed with streams without boundary
map.wallsburg <- ggplot() +
  sc + mc + lhc + lhclf +
  coord_quickmap() + 
  labs(x = NULL,
       y = NULL) +
  theme_tufte() +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_text(size = 10),
        panel.grid = element_blank(),
        legend.text = element_text(size = 10),
        strip.text = element_text(size = 10),
        legend.position = 'top')

#remove other objects in order to maintain neat R environment
remove(bound, lhc, lhclf, mc, sc)

# Save as RDS for quicker loading next time.
write_rds(map.wallsburg, path = 'input/wallsburg_map_tufte.rds')
```

If the user wants, the previous code can be skipped once the RDS file is created. In this way, the user can skip reading and plotting the shapefiles each time the notebook is run. But it's optional given that the user may want to reproduce the map each time to ensure accuracy in the code.

```{r display-map, fig.width=4, fig.height=5}
map.wallsburg = readRDS('input/wallsburg_map_tufte.rds')
map.wallsburg
```

# Historic Discharge at Outlet
The historical data on the discharge of the Wallsburg Watershed shows clearly the annual flux of water from snowmelt, and helped to determine the sampling dates. Below is the code to generate that plot.

```{r historic-annual-hydrograph, include = FALSE}
# Get data
# separate data for month averaging
outlet <- read_csv('input/outlet.csv') %>% 
  dplyr::filter(. , TP < 1)

outlet$DOY <- lubridate::yday(outlet$DATE)

outlet_dates <- separate(outlet, col = DATE, into = c("Y", "M", "D"))

outlet_dates$M <- month.abb[as.numeric(outlet_dates$M)]

# not necessary but helpful 
outlet_dates$LITERS <- outlet_dates$CFS*28.3168 # convert cfs to L/s

#reorder month names from alphabetic
outlet_dates$M <- factor(outlet_dates$M, levels = month.abb)
```

```{r median-flow-annual}
# median flows for each month from 1984 - 2014
# title = "Median Monthly Discharge for 30 Years: 1984 - 2014"

flow.med <- aggregate(LITERS ~ M, data = outlet_dates, median)

flow.median <- ggplot(data = flow.med,
                     aes(x = M,
                         y = LITERS,
                         group = 1)) +
  geom_line() +
  geom_point(size = 2) +
  scale_y_continuous(breaks = seq(0, 2000, 250)) +
  labs(x = "Month", y = expression('Median Discharge (L s'^-1*')'))

flow.median

ggsave('output/pub/median-discharge.png',
       width = 7, height = 3, dpi = 1200)
```


Some looking at this might note that the volume of water is relatively little; this is a stream in the western United States, after all, a semi-arid region.

# Streamflow
The streamflow data for this analysis can be found in two CSV files. The `master.csv` file contains the streamflow data as measured in the seven sampling campaigns. The `flowmeans.csv` contains the data as averaged between the two years, such that the *Peak* streamflow value for MC01, for example, is the average streamflow at MC01 based on the measurement from May 2015 and May 2016. In a way, this averaging helps to smooth out year-to-year variability and simplify analysis. More on data collection can be read in the thesis document `thesis.pdf`.

```{r flow-data, include=FALSE}
# All streamflow (flow) data
flow = read_csv("input/master.csv", na = 'no_access') %>%
  filter(., site != 'MC04') %>%
  select(., id:ls,-cfs) %>%
  drop_na(.)

# streamflow data simplified into the three hydrologic periods:
# 'Rise', 'Peak', & 'Base'
flow_means = read_csv("input/flowmeans.csv")

# This variable reorders the three hydrologic periods into a chronological
# order in terms of the annual hydrograph of the watershed
flow_means$hydros = factor(flow_means$hydro,
                           levels = c('Rise', 'Peak', 'Base'))

# the next data frame provides only the dry sites, which is important to show
# during the baseflow 2015 campaign
flow_dry = filter(flow, ls == 0)

flow_201507 = filter(flow, date == '2015-07-29')
```

As described in the thesis section on traditional synoptic sampling approach, the baseflow sampling campaign on 2015 provided certain insights into the P loading patterns of the watershed. The streamflows are as follows in the plot below, starting with the baseflow streamflow of 2015 and then the streamflows in each of the averaged hydrologic periods.

```{r baseflow-2015, fig.width=4, fig.height=5}

# Baseflow 2015
plot.flow201507 = map.wallsburg +
  geom_point(data = flow_201507,
             aes(x = long, y = lat, size = ls), # fill inside of point
             pch = 16, alpha = 0.1, col = 'black') +
  geom_point(data = flow_201507,
             aes(x = long, y = lat, size = ls),
             col = 'black', pch = 21) +
  geom_point(data = flow_dry,
             aes(x = long, y = lat, shape = as.factor(ls)),
             size = 2) +
  scale_size_area(name = 'Streamflow\n(L/s)', max_size = 12) +
  scale_shape_discrete(name = '', labels = c('Dry'))

plot.flow201507

ggsave('output/pub/base-flow.png',
       width = 5, height = 7, dpi = 1200)
```


```{r flow-over-time}

# bubble plot for each sampling campaign
plt.flow = map.wallsburg +
  geom_point(data = flow_means,
             aes(x = long, y = lat, size = ls), # fill
             pch = 16, alpha = 0.1, col = "black") +
  geom_point(data = flow_means,
             aes(x = long, y = lat, size = ls),
             col = 'black', pch = 21) +
  facet_wrap(~ hydros,
             labeller = labeller(.multi_line = FALSE)) +
  scale_size_area(breaks = c(0, 50, 100, 200, 300),
                  max_size = 12, name = "Streamlow (L/s)") +
  scale_shape_discrete(name = '',
                       labels = c('No Access'))
plt.flow

ggsave('output/tufte/all-flow.png',
       width = 12, height = 6, dpi = 1200)
```

# Concentrations
The concentration data for this analysis can be found in two CSV files. The `master.csv` file contains the concentration data as measured in the seven sampling campaigns. The `concmeans.csv` contains the data as averaged between the two years, such that the *Peak* concentration value for MC01, for example, is the average concentration at MC01 based on the measurement from May 2015 and May 2016. In a way, this averaging helps to smooth out year-to-year variability and simplify analysis. More on data collection can be read in the thesis document `thesis.pdf`.

```{r concentration-data, include=FALSE}
conc = read_csv('input/master.csv', na = c('dry', 'no_access', 'lost')) %>%
  filter(., site != 'MC04') %>%
  select(., id:dop,-cfs,-ls) %>%
  gather(., 'fraction', 'conc', 10:14) %>%
  drop_na(.)

conc_mean = read_csv('input/concmeans.csv', na = 'dry') %>%
  filter(., site != 'MC04') %>%
  gather(., 'fraction', 'conc', 8:12) %>%
  drop_na(.)

conc_mean_dry = read_csv('input/concmeans.csv') %>%
  filter(., site != 'MC04') %>%
  gather(., 'fraction', 'conc', 8:12) %>%
  dplyr::filter(., conc == 'dry')

# Create variable that helps reorder P Fractions
conc_mean$fractions = factor(conc_mean$fraction,
                              levels = c('TP','PP','TDP','DRP','DOP'))
conc_mean$hydros = factor(conc_mean$hydro,
                           levels = c('Rise', 'Peak', 'Base'))

conc_mean_tp = filter(conc_mean, fraction == 'TP')
```

The concentrations are as follows in the plots below, starting with the baseflow streamflow of 2015, then the concentrations in each of the averaged hydrologic periods, and then the concentrations of each fraction in each hydrologic period.

```{r concentrations}

conc_base = filter(conc, date == '2015-07-29')

conc_base_tp = filter(conc_base, fraction == 'tp')

conc_dry = read_csv('input/concmeans.csv') %>%
  filter(., site != 'MC04') %>%
  gather(., 'fraction', 'conc', 8:12) %>%
  filter(., conc == 'dry')

# TP Only
plt.base_conc = map.wallsburg +
  geom_point(data = conc_base_tp, 
             aes(long, lat, size = conc),
             pch = 21, col = 'black') +
  geom_point(data = conc_base_tp,
             aes(long, lat, size = conc),
             pch = 16, col = "black", alpha = 0.1) +
  geom_point(data = flow_dry,
             aes(x = long, y = lat, shape = as.factor(ls)),
             size = 2) +
  scale_size_area(breaks = c(0, 0.05, 0.1, 0.25),
                  max_size = 10,
                  name = 'Total P\n(mg/L)') +
  scale_shape_discrete(name = '', labels = c('Dry'))

plt.base_conc

ggsave('output/tufte/base-conc.png',
       width = 5, height = 7, dpi = 1200)
```

```{r mean-TP}
# Mean TP overtime
plt.conc_tp = map.wallsburg +
  geom_point(data = conc_mean_tp, 
             aes(long, lat, size = conc),
             col = 'black', pch = 21) +
  geom_point(data = conc_mean_tp,
             aes(long, lat, size = conc),
             pch = 16, col = "black", alpha = 0.1) +
  scale_size_area(breaks = c(0, 0.05, 0.1, 0.2, 0.3),
                  max_size = 10,
                  name = 'Total P (mg/L)') +
  facet_wrap(~ hydros, nrow = 1, labeller = labeller(.multi_line = FALSE))

plt.conc_tp

ggsave('output/tufte/all-conc.png',
       width = 12, height = 6, dpi = 1200)
```

```{r 5P-conc, fig.height=8, fig.width=5}
# ggplot for concentration as a function of fraction and date
plt.conc = map.wallsburg +
  geom_point(data = conc_mean,
             aes(long, lat, size = conc),
             col = 'black', pch = 21) +
  geom_point(data = conc_mean,
             aes(long, lat, size = conc), # fill
             pch = 16, col = "black", alpha = 0.1) +
  scale_size_area(breaks = c(0, .01, 0.05, 0.1, 0.2),
                  max_size = 8,
                  name = 'Concentration\n(mg/L)' ) +
  facet_grid(hydros ~ fractions,
             labeller = labeller(.multi_line = FALSE)) +
  theme(legend.position = 'left')
plt.conc

ggsave('output/tufte/frac-conc.png',
       width = 10, height = 6, dpi = 1200)
```

# Loads

```{r load-data}
# import data
load = read_csv('input/master_load.csv',
                 na = c('no_access', 'lost')) %>%
  filter(., site != 'MC04') %>%
  select(., id:DOP, -cfs, -ls) %>% 
  gather(., 'fraction', 'load', 10:14)

load_mean = read_csv('input/loadmeans.csv') %>% 
  filter(., site != 'MC04') %>% 
  gather(., 'fraction', 'load', 8:12)

load_mean$fractions = factor(load_mean$fraction,
                              levels = c('TP','PP','TDP','DRP','DOP'))
load_mean$hydros = factor(load_mean$hydro,
                           levels = c('Rise','Peak','Base'))

load_mean_tp = filter(load_mean, fraction == 'TP')

load_base = filter(load, date == '2015-07-29')

load_base_tp = filter(load_base, fraction == 'TP')

load_dry = read_csv('input/master_load.csv') %>%
  filter(., site != 'MC04') %>% 
  select(., id:DOP, -cfs, -ls) %>% 
  gather(., 'fraction', 'load', 10:14) %>% 
  filter(., load == 'no_access' | load == 'lost')
```

The loads at each site are as follows in the plots below, starting with the baseflow streamflow of 2015, then the concentrations in each of the averaged hydrologic periods, and then the concentrations of each fraction in each hydrologic period.

```{r load}
#TP Load at Baseflow in 2015
plt.base_load = map.wallsburg +
  geom_point(data = load_base_tp, 
             aes(long, lat, size = load),
             pch = 21, col = 'black') +
  geom_point(data = load_base_tp,
             aes(long, lat, size = load),
             pch = 16, col = "black", alpha = 0.1) +
  scale_color_discrete(name = 'No Result',
                       labels = 'Dry\nStreambed') +
  scale_size_area(breaks = c(0, 1, 3, 5),
                  max_size = 10,
                  name = 'TP Load\n(mg/s)')

plt.base_load

ggsave('output/tufte/base-load.png',
       width = 5, height = 7, dpi = 1200)

# TP Loads in each hydrologic period
plt.load_tp = map.wallsburg +
  geom_point(data = load_mean_tp,
             aes(long, lat, size = load),
             pch = 21, col = 'black') +
  geom_point(data = load_mean_tp,
             aes(long, lat, size = load), # fill
             pch = 16, col = 'black', alpha = 0.1) +
  scale_size_area(breaks = c(0, 5, 10, 20, 30),
                  max_size = 10,
                  name = "Load\n(mg/s)") +
  facet_wrap(~ hydros,
             ncol = 4,
             labeller = labeller(.multi_line = FALSE))
plt.load_tp

ggsave('output/tufte/all-load.png',
       width = 12, height = 6, dpi = 1200)
```

```{r loads by fractions, echo=FALSE, fig.height=8, fig.width=5}
# Loads by fraction in each hydrologic period
plt.load = map.wallsburg +
  geom_point(data = load_mean,
             aes(long, lat, size = load),
             pch = 21, col = "black") +
  geom_point(data = load_mean,
             aes(long, lat, size = load), # fill
             pch = 16, col = "black", alpha = 0.1) +
  scale_size_area(breaks = c(0, 5, 10, 20, 30),
                  max_size = 8,
                  name = "Load\n(mg/s)") +
  facet_grid(hydros ~ fractions,
             labeller = labeller(.multi_line = FALSE)) +
  theme(legend.position = "left")
plt.load

ggsave('output/tufte/frac-load.png',
       width = 12, height = 8, dpi = 1200)
```

Once more, the results shown spatially as in the baseflow period of 2015.

```{r gridExtra, echo=FALSE, fig.height=8, fig.width=7}
# Put all baseflow 2015 plots together for export
grid.arrange(plt.flow201507, plt.base_conc, plt.base_load, ncol = 2)
```

# Bar Charts

While the map plots shown previously provide insight in to spatial loading patterns, there was a need to focus on the profiles of Main Creek and Spring Creek through bar charts to better understand how P fractions were changing over space, time, and in relation to each other. Remember the abbreviations are such: total P (TP), particulate P (PP), dissolved reactive P (DRP), and dissolved organic P (DOP). Total dissolved P (TDP) is simply the sum of DRP and DOP, and TP is simply the sum of TDP and PP.

```{r bar-data}
load_mc_tp = filter(load_mean, stream == 'M') %>% 
  filter(., fraction == 'TP')

load_mc_3frac = filter(load_mean, stream == 'M') %>% 
  filter(., fraction != 'TP' & fraction != 'TDP')

load_mc_base = filter(load_base_tp, stream == 'M')

load_mean_3frac = filter(load_mean, fraction != 'TP' & fraction != 'TDP')

load_sc_3frac = filter(load_mean, stream == 'S') %>% 
  filter(., fraction != 'TP' & fraction != 'TDP')
```


```{r bar-01, warning=FALSE}
# Barplot for TP loads over time
# Follow this layout for other bar charts

bar = ggplot() +
  scale_fill_manual(values = c('#000000',
                               '#555555',
                               '#aaaaaa'),
                    name = "P Fraction") +
  theme_tufte() +
  theme(legend.position = 'right',
        strip.text = element_text(size = 10),
        axis.text.x = element_text(angle = 90, vjust = 0.5))

bar.mc_tp = bar +
  geom_col(data = load_mc_tp,
           aes(x = site, y = load, fill = fractions),
           position = 'stack') +
  facet_grid(hydros ~ .,
             scales = 'free') +
  labs(x = NULL, y = 'Load (mg/s)')

bar.mc_tp

ggsave('output/tufte/bar-mc-all.png',
       width = 5, height = 8, dpi = 1200)
```

```{r bar-02}

# Bar chart showing Main Creek profile in baseflow 2015

bar.mc_base = bar +
  geom_col(data = load_mc_base,
           aes(x = site, y = load, fill = fraction)) +
  labs(x = NULL,
       y = 'Load (mg/s)')

bar.mc_base

ggsave('output/tufte/bar-mc-base.png',
       width = 6, height = 4, dpi = 1200)

# Bar chart with sites along x, and loads of free scale on y
# This allows for one to compare the pattern of load increases and decreases
# across various hydrologic periods
bar.mc = bar +
  geom_col(data = load_mc_3frac,
           aes(x = site, y = load, fill = fractions),
           position = 'stack') +
  facet_grid(hydros ~ .,
             scales = 'free') +
  labs(x = NULL,
       y = 'Load (mg/s)')
bar.mc

ggsave('output/tufte/bar-mc-frac.png',
       width = 5, height = 8, dpi = 1200)

# This has the fractions scaled to 100 % (which represents TP)
# In this way, proportions can be compared across sites.
bar.mc_percent = bar +
  geom_col(data = load_mc_3frac,
           aes(x = site, y = load, fill = fractions),
           position = 'fill') +
  facet_grid(hydros ~ .) +
  labs(x = NULL,
       y = 'Percent Composition')

bar.mc_percent

ggsave('output/tufte/bar-mc-perc.png',
       width = 5, height = 8, dpi = 1200)

# For each site, which isn't really necessary if you just look vertically
# on the other graphs

bar.sites_percent = bar +
  geom_col(data = load_mean_3frac,
           aes(x = hydros, y = load, fill = fractions),
           position = 'fill') +
  facet_wrap(~ site) +
  labs(x = NULL,
       y = 'Load (mg/s)')

bar.sites_percent

ggsave('output/tufte/bar-sites.png',
       width = 8, height = 8, dpi = 1200)
```

The chart `bar.sites_percent` didn't make it into the thesis, but it highlights in a different way some of the insights explained in the thesis document about temporal trends in P loading across the watershed.

Similar charts were made for Spring Creek because Spring Creek was identified as an important contributor to P loads in Main Creek.

```{r spring-creek-bar, warning=FALSE}
bar.sc = bar +
  geom_col(data = load_sc_3frac,
           aes(x = site, y = load, fill = fractions),
           position = 'stack') +
  facet_grid(hydros ~ .,
             scales = 'free') +
  labs(x = NULL,
       y = 'Load (mg/s)')

bar.sc

ggsave('output/tufte/bar-sc-frac.png',
       width = 5, height = 8, dpi = 1200)

bar.sc_percent = bar +
  geom_col(data = load_sc_3frac,
           aes(x = site, y = load, fill = fractions),
           position = 'fill') +
  facet_grid(hydros ~ .) +
  labs(x = NULL,
       y = 'Percent Composition')

bar.sc_percent

ggsave('output/tufte/bar-sc-perc.png',
       width = 5, height = 8, dpi = 1200)
```