_main.Rmd

--- 
title: "R as GIS for Economists"
author: "Taro Mieno"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
documentclass: book
bibliography: [RGIS.bib]
biblio-style: apalike
link-citations: yes
description: "This is a minimal example of using the bookdown package to write a book. The output format for this example is bookdown::gitbook."
---

# Preface {-}

This book is being developed as part of my effort to put together course materials for my data science course targeted at upper-level undergraduate and graduate students at the University of Nebraska Lincoln. This books aims particularly at **spatial data processing for an econometric project**, where spatial variables become part of an econometric analysis. Over the years, I have seen so many students and researchers who spend so much time just processing spatial data (often involving clicking the ArcGIS (or QGIS) user interface to death), which is a waste of time from the perspective of academic productivity. My hope is that this book will help researchers become more proficient in spatial data processing and enhance the overall productivity of the fields of economics for which spatial data are essential.  

**About me**

I am an Assistant Professor at the Department of Agricultural Economics at University of Nebraska Lincoln, where I also teach Econometrics for Master's students. My research interests lie in precision agriculture, water economics, and agricultural policy. My personal website is [here](http://taromieno.netlify.com/). 

**Comments and Suggestions?**

Any constructive comments and suggestions about how I can improve the book are all welcome. Please send me an email at tmieno2@unl.edu or create an issue on [the github page](https://github.com/tmieno2/R-as-GIS-for-Economists)  of this book.

<hr>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

## Why R as GIS for Economists? {-}

R has extensive capabilities as GIS software. In my opinion, $99\%$ of your spatial data processing needs as an economist will be satisfied by R. But, there are several popular options for GIS tasks other than R:

+ Python
+ ArcGIS
+ QGIS

Here I compare them briefly and discuss why R is a good option.

### R vs Python {-}

Both R and Python are actually heavily dependent on open source software GDAL and GEOS for their core GIS operations (GDAL for reading spatial data, and GEOS for geometrical operations like intersecting two spatial layers).^[For example, see the very first sentence of [this page](https://cran.r-project.org/web/packages/sf/index.html)] So, when you run GIS tasks on R or Python you basically tell R or Python what you want to do and they talk to the software, let them do the job, and return the results back to us. This means that R and Python are not very different in its capability at GIS tasks as they are dependent on the common open source software for many GIS tasks. When GDAL and GEOS get better, R and Python get better (with a short lag). Both of them have good spatial visualization tools as well. Moreover, both R and Python can communicate with QGIS and ArcGIS (as long you as have them installed of course) and use their functionalities from within R and Python via the bridging packages: `RQGIS` and `PyQGIS` for QGIS, and `R-ArcGIS` and `ArcPy`.^[We do not learn them in this lecture note because I do not see the benefits of using them.] So, if you are more familiar with Python than R, go ahead and go for the Python option. From now on, my discussions assume that you are going for the R option, as otherwise, you would not be reading the rest of the book anyway.

### R vs ArcGIS or QGIS {-}

ArcGIS is commercial software and it is quite expensive (you are likely to be able to get a significant discount if you are a student at or work for a University). On the other hand, QGIS is open source and free. It has seen significant developments over the decade, and I would say it is just as competitive as ArcGIS. QGIS also uses open source geospatial software GDAL, GEOS, and others (SAGA, GRASS GIS). Both of them have a graphical interface that helps you implement various GIS tasks unlike R which requires programming. 

Now, since R can use ArcGIS and QGIS through the bridging packages, a more precise question we should be asking is whether you should program GIS tasks using R (possibly using the bridging packages) or manually implement GIS tasks using the graphical interface of ArcGIS or QGIS. The answer is programming GIS tasks using R. First, manual GIS operations are hard to repeat. It is often the case that in the course of a project you need to redo the same GIS task except that the underlying datasets have changed. If you have programmed the process with R, you just run the same code and that's it. You get the desired results. If you did not program it, you need to go through many clicks on the graphical interface all over again, potentially trying to remember how you actually did it last time.^[You could take a step by step note of what you did though.] Second and more important, manual operations are not scalable. It has become much more common that we need to process many large spatial datasets. Imagine you are doing the same operations on $1,000$ files using a graphical interface, or even $50$ files. Do you know who is good at doing the same tasks over and over again without complaining? A computer. Just let them do what they like to do. You have better things do. 

Finally, should you learn ArcGIS or QGIS in addition to (or before) R? I am doubtful. As economists, GIS tasks we need to do are not super convoluted most of the time. Suppose $\Omega_R$ and $\Omega_{AQ}$ represent the set of GIS tasks R and $ArcGIS/QGIS$ can implement, respectively. Further, let $\Omega_E$ represent the set of skills economists need to implement. Then, $\Omega_E \in \Omega_R$ $99\%$ (or maybe $95\%$ to be safe) of the time and $\Omega_E \not\subset \Omega_{AQ}\setminus\Omega_R$ $99\%$ of the time. Personally, I have never had to rely on either ArcGIS or QGIS for my research projects after I learn how to use R as GIS. 

One of the things ArcGIS and QGIS can do but R cannot do ($\Omega_{AQ}\setminus\Omega_R$) is creating spatial objects by hand using a graphical user interface, like drawing polygons and lines. Another thing that R lags behind ArcGIS and QGIS is 3D data visualization. But, I must say neither of them is essential for economists at the moment. Finally, sometime it is easier and faster to make a map using ArcGIS and QGIS especially for a complicated map.^[Let me know if you know something that is essential for economists that only ArcGIS or QGIS can do. I will add that to the list her.] 

### Summary {-}

+ **You have never used any GIS software?**

Learn R first. If you find out you really really cannot complete the tasks you would like to do on R, then turn to other options. 

+ **You have used ArcGIS or QGIS and do not like them because they crash often?**

Why don't you try R?^[I am not saying R does not crash. R does crash. But, often times, the fault is yours, not the software's.] You may realize you actually do not need them.

+ **You have used ArcGIS or QGIS before and are so comfortable with them, but need to program repetitive GIS tasks?**

Learn R and maybe take advantage of `R-ArcGIS` or `RQGIS`, which this book does not cover.

+ **You know for sure that you need to run only a simple GIS task once and never have to do any GIS tasks ever again?**

Stop reading and ask one of your friends to do the job. Pay him/her $\$20$ per hour, which is way below the opportunity cost of setting up either ArcGIS or QGI and learning to do that simple task on them.


## How is this book different from other online books and resources? {-}

We are seeing an explosion of online (and free) resources that teach how to use R for spatial data processing.^[This phenomenon is largely thanks to packages like `bookdown` [@Rbookdown], `blogdown` [@Rblogdown], and `pkgdown` [@Rpkgdown] that has lowered the cost of professional contents creation much much lower than before. Indeed, this book was built taking advantage of the `bookdown` package.]  Here is an incomplete list of such resources:

+ [Geocomputation with R](https://geocompr.robinlovelace.net/)
+ [Spatial Data Science](https://keen-swartz-3146c4.netlify.app/)
+ [Spatial Data Science with R](https://www.rspatial.org/index.html)
+ [Introduction to GIS using R](https://www.jessesadler.com/post/gis-with-r-intro/)
+ [Code for An Introduction to Spatial Analysis and Mapping in R](https://bookdown.org/lexcomber/brunsdoncomber2e/)
+ [Introduction to GIS in R](https://annakrystalli.me/intro-r-gis/index.html)
+ [Intro to GIS and Spatial Analysis](https://mgimond.github.io/Spatial/index.html)
+ [Introduction to Spatial Data Programming with R](http://132.72.155.230:3838/r/)
+ [Reproducible GIS analysis with R](http://staff.washington.edu/phurvitz/r_gis/)
+ [R for Earth-System Science](http://geog.uoregon.edu/bartlein/courses/geog490/index.html)
+ [Rspatial](http://rspatial.org/index.html)
+ [NEON Data Skills](https://www.neonscience.org/resources/data-skills)
+ [Simple Features for R](https://r-spatial.github.io/sf/)
<!-- + [Nick Eubank](https://www.nickeubank.com/gis-in-r/) -->

Thanks to all these resources, it has become much easier to self-teach R for GIS work than six or seven years ago when I first started using R for GIS. Even though I have not read through all these resources carefully, I am pretty sure every topic found in this book can also be found _somewhere_ in these resources (except the demonstrations). So, you may wonder why on earth you can benefit from reading this book. It all boils down to search costs. Researchers in different disciplines require different sets of spatial data skills. The available resources are either very general covering so many topics that economists are very unlikely to use. It is particularly hard for those who do not have much experience in GIS to identify whether particular skills are essential or not. So, they could spend so much time learning something that is not really useful. The value of this book lies in its deliberate incomprehensiveness. It only packages materials that satisfy the need of most economists, cutting out many topics that are likely to be of limited use for economists. 

For those who are looking for more comprehensive treatments of spatial data handling and processing in one book, I personally like [Geocomputation with R](https://geocompr.robinlovelace.net/) a lot. Increasingly, the developer of R packages created a website dedicated to their R packages, where you can often find vignettes (tutorials), like [Simple Features for R](https://r-spatial.github.io/sf/). 

## What is going to be covered in this book? {-}

The book starts with the very basics of spatial data handling (e.g., importing and exporting spatial datasets) and moves on to more practical spatial data operations (e.g., spatial data join) that are useful for research projects. This books is still under development. Right now, only Chapter 1 is available. I will work on the rest of the book over the summer. The "coming soon" chapters are close to be done. I just need to add finishing touches to those chapters. The "wait a bit" chapters need some more work, adding contents, etc.  

+ Chapter 1: Demonstrations of R as GIS (available)
	* groundwater pumping and groundwater level
	* precision agriculture
	* land use and weather
	* corn planted acreage and railroads
	* groundwater pumping and weather
+ Chapter 2: The basics of vector data handling using `sf` package (coming soon)
	* spatial data structure in `sf`
	* import and export vector data
	* (re)projection of spatial datasets
	* single-layer geometrical operations (e.g., create buffers, find centroids)
	* other miscellaneous basic operations
+ Chapter 3: Spatial interactions of vector datasets (coming soon)
	* spatially subsetting one layer based on another layer
	* extracting values from one layer to another layer^[`over` function in `sp` language]
+ Chapter 4: The basics of raster data handling using `raster` and `terra` packages (coming soon)
	* import and export raster data
	* stacking raster data
+ Chapter 5: Spatial interactions of vector and raster datasets (wait a bit)
	* extracting values from a raster layer to a vector layer
+ Chapter 6: Efficient spatial data processing (wait a bit)  
	* parallelization 
+ Chapter 7: Downloading publicly available spatial datasets (wait a bit)
	* Sentinel 2 (`sen2r`)
	* USDA NASS QuickStat (`tidyUSDA`)
	* PRISM (`prism`)
	* Daymet (`daymetr`)
	* USGS (`dataRetrieval`)
+ Chapter 8: Parallel computation (wait a bit)

As you can see above, this book does not spend any time on the very basics of GIS concepts. Before you start reading the book, you should know the followings at least (it's not much): 

+ What Geographic Coordinate System (GCS), Coordinate Reference System (CRS), and projection are ([this](https://annakrystalli.me/intro-r-gis/gis.html) is a good resource)
+ Distinctions between vector and raster data ([this](https://gis.stackexchange.com/questions/57142/what-is-the-difference-between-vector-and-raster-data-models) is a simple summary of the difference)

Finally, this book does not cover spatial statistics or spatial econometrics at all. This book is about spatial data _processing_. Spatial analysis is something you do _after_ you have processed spatial data.


## Conventions of the book and some notes {-}

Here are some notes of the conventions of this book and notes for R beginners and those who are not used to reading `rmarkdown`-generated html documents.

### Texts in gray boxes {-}

They are one of the following:

+ objects defined on R during demonstrations
+ R functions
+ R packages

When it is a function, I always put parentheses at the end like this: `st_read()`.^[This is a function that draws values randomly from the uniform distribution.] Sometimes, I combine a package and function in one like this: `sf::st_read()`. This means it is a function called `st_read()` from the `sf` package. 

### Colored Boxes {-}

Codes are in blue boxes, and outcomes are in red boxes.

Codes:

```{r codes, eval = F}
runif(5)
```

Outcomes:

```{r outcomes, echo = F}
runif(5)
```

### Parentheses around codes {-}

Sometimes you will see codes enclosed by parenthesis like this:

```{r notes_par}
(
  a <- runif(5)
)
```

The parentheses prints what's inside of a newly created object (here `a`) without explicitly evaluating the object. So, basically I am signaling that we will be looking inside of the object that was just created. 

This one prints nothing.

```{r notes_par_nodisp}
a <- runif(5)
```

### Footnotes {-}

Footnotes appear at the bottom of the page. You can easily get to a footnote by clicking on the footnote number. You can also go back to the main narrative where the footnote number is by clicking on the curved arrow at the end of the footnote. So, don't worry about having to scroll all the way up to where you were after reading footnotes.

## Session Information {-}

Here is the session information when compiling the book:

```{r session_info}
sessionInfo()	
```


<!--chapter:end:index.Rmd-->

# R as GIS: Demonstrations {#demo} 

```{r setup, echo = FALSE, results = "hide"}
library(knitr)
knitr::opts_chunk$set(
  echo = TRUE,
  cache = TRUE,
  comment = NA,
  message = FALSE,
  warning = FALSE,
  tidy = FALSE,
  cache.lazy = FALSE
)

suppressMessages(library(here))
opts_knit$set(root.dir = here())
```

```{r setwd, eval = FALSE, echo = FALSE}
setwd(here())
```

```{r, echo=FALSE, warning=FALSE, cache = FALSE}
#--- load packages ---#
suppressMessages(library(data.table))
suppressMessages(library(here))
suppressMessages(library(stringr))
suppressMessages(library(rgeos))
suppressMessages(library(sf))
suppressMessages(library(ggplot2))
suppressMessages(library(raster))
suppressMessages(library(stargazer))
suppressMessages(library(tmap))
suppressMessages(library(future.apply))
suppressMessages(library(lubridate))
suppressMessages(library(tidyverse))
#--- source functions ---#
source("Codes/Chap_1_Demonstration.R")
```

## Introduction {-}

The primary objective of this chapter is to showcase the power of R as GIS through demonstrations using mock-up econometric research projects^[Note that this lecture does not deal with spatial econometrics at all. This lecture is about spatial data processing, not spatial econometrics. [This](http://www.econ.uiuc.edu/~lab/workshop/Spatial_in_R.html) is a great resource for spatial econometrics in R.]. Each project consists of a project overview (objective, datasets used, econometric model, and GIS tasks involved) and demonstration. This is really not a place you learn the nuts and bolts of how R does spatial operations. Indeed, we intentionally do not explain all the details of how the R codes work. We reiterate that the main purpose of the demonstrations is to get you a better idea of how R can be used to process spatial data to help your research projects involving spatial datasets. Finally, note that these *mock-up* projects use extremely simple econometric models that completely lacks careful thoughts you would need in real research projects. So, don't waste your time judging the econometric models, and just focus on GIS tasks. If you are not familiar with html documents generated by `rmarkdown`, you might benefit from reading the conventions of the book in the Preface.

### Target Audience {-}

The target audience of this chapter is those who are not very familiar with R as GIS. Knowledge of R certainly helps. But, I tried to write in a way that R beginners can still understand the power of R as GIS^[I welcome any suggestions to improve the reading experience of unexperienced R users.]. Do not get bogged down by all the complex-looking R codes. Just focus on the narratives and figures to get a sense of what R can do.

### Direction for replication {-}

Running the codes in this chapter involves reading datasets from a disk. All the datasets that will be imported are available [here](https://www.dropbox.com/sh/cyx9clgmshwc8eo/AAApv03Qpx84IGKCyF5v2rJ6a?dl=0). In this chapter, the path to files is set relative to my own working directory (which is hidden). To run the codes without having to mess with paths to the files, follow these steps:^[I thought about using the `here` package, but I found it a bit confusing for unexperienced R users.]

+ set a folder (any folder) as the working directory using `setwd()`  
+ create a folder called "Data" inside the folder designated as the working directory  
+ download the pertinent datasets from [here](https://www.dropbox.com/sh/cyx9clgmshwc8eo/AAApv03Qpx84IGKCyF5v2rJ6a?dl=0) and put them in the "Data" folder
+ run _Chap_1_Demonstration.R_ which is included in the datasets folder you have downloaded

```{r source_functions, eval = F}
source("Data/Chap_1_Demonstration.R")
```

Note that the data folder includes 183-day worth of PRISM precipitation data for Demonstration 3, which are quite large in size (slightly less than 1 GB). If you are not replicating Demonstration 3, you can either choose not to download them or discard them if you have downloaded them already.

## Demonstration 1: The impact of groundwater pumping on depth to water table {#Demo1}

<!-- this is for making stargazer table nicer -->
```{r table_style_demo1, results="asis", echo=FALSE}
cat("
<style>
.book .book-body .page-wrapper .page-inner section.normal table
{
  width:auto;
}
.book .book-body .page-wrapper .page-inner section.normal table td,
.book .book-body .page-wrapper .page-inner section.normal table th,
.book .book-body .page-wrapper .page-inner section.normal table tr
{
  padding:0;
  border:0;
  background-color:#fff;
}
</style>
")
```

### Project Overview

---

**Objective:**

* Understand the impact of groundwater pumping on groundwater level. 

---

**Datasets**

* Groundwater pumping by irrigation wells in Chase, Dundy, and Perkins Counties in the southwest corner of Nebraska 
* Groundwater levels observed at USGS monitoring wells located in the three counties and retrieved from the National Water Information System (NWIS) maintained by USGS using the `dataRetrieval` package.

---

**Econometric Model**

In order to achieve the project objective, we will estimate the following model:

$$
 y_{i,t} - y_{i,t-1} = \alpha + \beta gw_{i,t-1} + v
$$

where $y_{i,t}$ is the depth to groundwater table^[the distance from the surface to the top of the aquifer] in March^[For our geographic focus of southwest Nebraska, corn is the dominant crop type. Irrigation for corn happens typically between April through September. For example, this means that changes in groundwater level ($y_{i,2012} - y_{i,2011}$) captures the impact of groundwater pumping that occurred April through September in 2011.] in year $t$ at USGS monitoring well $i$, and $gw_{i,t-1}$ is the total amount of groundwater pumping that happened within the 2-mile radius of the monitoring well $i$. 

---

**GIS tasks**

* read an ESRI shape file as an `sf` (spatial) object 
  - use `sf::st_read()`
* download depth to water table data using the `dataRetrieval` package developed by USGS 
  - use `dataRetrieval::readNWISdata()` and `dataRetrieval::readNWISsite()`
* create a buffer around USGS monitoring wells
  - use `sf::st_buffer()`
* convert a regular `data.frame` (non-spatial) with geographic coordinates into an `sf` (spatial) objects
  - use `sf::st_as_sf()`  and `sf::st_set_crs()`
* reproject an `sf` object to another CRS
  - use `sf::st_transform()`
* identify irrigation wells located inside the buffers and calculate total pumping
  - use `sf::st_join()`

---

**packages**

+ Load (install first if you have not) the following packages if you intend to replicate the demonstration.

```{r demo1_packages, eval = FALSE}
library(sf)
library(dplyr)
library(lubridate)
library(stargazer)
```

There are other packages that will be loaded during the demonstration.

### Project Demonstration

The geographic focus of the project is the southwest corner of Nebraska consisting of Chase, Dundy, and Perkins County (see Figure \@ref(fig:NE-county) for their locations within Nebraska). Let's read a shape file of the three counties represented as polygons. We will use it later to spatially filter groundwater level data downloaded from NWIS.

```{r NE_county_data, echo = FALSE, results = "hide"}
#--- Nebraska counties ---#
NE_county <- st_read(
    dsn = "./Data", 
    layer = "cb_2018_us_county_20m"
  ) %>% 
  filter(STATEFP == "31") %>% 
  mutate(NAME = as.character(NAME)) %>% 
  st_transform(32614) 

three_counties <- st_read(dsn = "./Data", layer = "urnrd") %>% 
  st_transform(32614)
```

```{r Demo1_read_urnrd_borders}
three_counties <- st_read(dsn = "./Data", layer = "urnrd") %>% 
  #--- project to WGS84/UTM 14N ---#
  st_transform(32614)
```

```{r NE-county, fig.cap = "The location of Chase, Dundy, and Perkins County in Nebraska", echo =F}
#--- map the three counties ---#
tm_shape(NE_county) +
  tm_polygons() +
tm_shape(three_counties) +
  tm_polygons(col = "blue", alpha = 0.3) +
  tm_layout(frame = FALSE)
```
---

We have already collected groundwater pumping data, so let's import it. 

```{r Demo1_urnrd_gw_read}
#--- groundwater pumping data ---#
(
urnrd_gw <- readRDS("./Data/urnrd_gw_pumping.rds")
)
```

`well_id` is the unique irrigation well identifier, and `vol_af` is the amount of groundwater pumped in acre-feet. This dataset is just a regular `data.frame` with coordinates. We need to convert this dataset into a object of class `sf` so that we can later identify irrigation wells located within a 2-mile radius of USGS monitoring wells (see Figure \@ref(fig:sp-dist-wells) for the spatial distribution of the irrigation wells).

```{r convert_to_sf}
urnrd_gw_sf <- urnrd_gw %>% 
  #--- convert to sf ---#
  st_as_sf(coords = c("lon", "lat")) %>% 
  #--- set CRS WGS UTM 14 (you need to know the CRS of the coordinates to do this) ---# 
  st_set_crs(32614) 

#--- now sf ---#
urnrd_gw_sf
```

```{r sp-dist-wells, fig.cap = "Spatial distribution of irrigation wells", echo = FALSE, results = "hide"}
urnrd_gw_sf <- urnrd_gw %>% 
  #--- convert to sf ---#
  st_as_sf(coords = c("lon", "lat")) %>% 
  #--- set CRS WGS UTM 14 (you need to know the CRS of the coordinates to do this) ---# 
  st_set_crs(32614) 

tm_shape(three_counties) +
  tm_polygons() +
tm_shape(unique(urnrd_gw_sf, by = "well_id")) +
  tm_symbols(size = 0.1, col = "blue") +
  tm_layout(frame = FALSE)
```
---

Here are the rest of the steps we will take to obtain a regression-ready dataset for our analysis.

1. download groundwater level data observed at USGS monitoring wells from National Water Information System (NWIS) using the `dataRetrieval` package 
2. identify the irrigation wells located within the 2-mile radius of the USGS wells and calculate the total groundwater pumping that occurred around each of the USGS wells by year 
3. merge the groundwater pumping data to the groundwater level data

---

Let's download groundwater level data from NWIS first. The following code downloads groundwater level data for Nebraska from Jan 1, 1990, through Jan 1, 2016.

```{r gwl_data_download, eval = F}
#--- load the dataRetrieval package ---#
library(dataRetrieval)

#--- download groundwater level data ---#
NE_gwl <- readNWISdata(
    stateCd="Nebraska", 
    startDate = "1990-01-01", 
    endDate = "2016-01-01", 
    service = "gwlevels"
  ) %>% 
  dplyr::select(site_no, lev_dt, lev_va) %>% 
  rename(date = lev_dt, dwt = lev_va) 

#--- take a look ---#
head(NE_gwl, 10)
```

```{r read_NW_gwl, echo = F}
library(dataRetrieval)

NE_gwl <- readRDS("./Data/NE_gwl.rds")

#--- take a look ---#
head(NE_gwl, 10)
```

`site_no` is the unique monitoring well identifier, `date` is the date of groundwater level monitoring, and `dwt` is depth to water table. 

We calculate the average groundwater level in March by USGS monitoring well (right before the irrigation season starts):^[`month()` and `year()` are from the `lubridate` package. They extract month and year from a `Date` object.]

```{r avg_march_deptn}
#--- Average depth to water table in March ---#
NE_gwl_march <- NE_gwl %>% 
  mutate(
    date = as.Date(date),
    month = month(date),
    year = year(date),
  ) %>% 
  #--- select observation in March ---#
  filter(year >= 2007, month == 3) %>% 
  #--- gwl average in March ---#
  group_by(site_no, year) %>% 
  summarize(dwt  = mean(dwt))

#--- take a look ---#
head(NE_gwl_march, 10)
```

Since `NE_gwl` is missing geographic coordinates for the monitoring wells, we will download them using the `readNWISsite()` function and select only the monitoring wells that are inside the three counties.  

```{r NE_sites}
#--- get the list of site ids ---#  
NE_site_ls <- NE_gwl$site_no %>% unique()

#--- get the locations of the site ids ---#  
sites_info <- readNWISsite(siteNumbers = NE_site_ls) %>% 
  dplyr::select(site_no, dec_lat_va, dec_long_va) %>% 
  #--- turn the data into an sf object ---#
  st_as_sf(coords = c("dec_long_va", "dec_lat_va")) %>% 
  #--- NAD 83 ---#
  st_set_crs(4269) %>% 
  #--- project to WGS UTM 14 ---#
  st_transform(32614) %>% 
  #--- keep only those located inside the three counties ---#
  .[three_counties, ]
```

---

We now identify irrigation wells that are located within the 2-mile radius of the monitoring wells^[This can alternatively be done using the `st_is_within_distance()` function.]. We first create polygons of 2-mile radius circles around the monitoring wells (see Figure \@ref(fig:buffer-map)).

```{r create_buffer}
buffers <- st_buffer(sites_info, dist = 2*1609.34) # in meter
```

```{r buffer-map, fig.cap = "2-mile buffers around USGS monitoring wells", echo = FALSE}
tm_shape(three_counties) +
  tm_polygons() +
tm_shape(buffers) +
  tm_polygons(col = "red", alpha = 0,2) +
tm_shape(sites_info) +
  tm_symbols(size = 0.1) +
  tm_layout(frame = FALSE)
```

We now identify which irrigation wells are inside each of the buffers and get the associated groundwater pumping values. The `st_join()` function from the `sf` package will do the trick.

```{r Demo_join_buffer_gw, cache = FALSE}
#--- find irrigation wells inside the buffer and calculate total pumping  ---#
pumping_neaby <- st_join(buffers, urnrd_gw_sf)
``` 

Let's take a look at a USGS monitoring well (`site_no` = $400012101323401$).

```{r take_a_look}
filter(pumping_neaby, site_no == 400012101323401, year == 2010)
```

As you can see, this well has seven irrigation wells within its 2-mile radius in 2010.   

Now, we will get total nearby pumping by monitoring well and year. 

```{r Demo1_summary_by_buffer}
(
total_pumping_nearby <- pumping_neaby %>% 
  #--- calculate total pumping by monitoring well ---#
  group_by(site_no, year) %>% 
  summarize(nearby_pumping = sum(vol_af, na.rm = TRUE)) %>% 
  #--- NA means 0 pumping ---#  
  mutate(
    nearby_pumping = ifelse(is.na(nearby_pumping), 0, nearby_pumping)
  )
)
```

---

We now merge nearby pumping data to the groundwater level data, and transform the data to obtain the dataset ready for regression analysis.

```{r Demo_nearby_merge}
#--- regression-ready data ---#
reg_data <- NE_gwl_march %>% 
  #--- pick monitoring wells that are inside the three counties ---#
  filter(site_no %in% unique(sites_info$site_no)) %>% 
  #--- merge with the nearby pumping data ---#
  left_join(., total_pumping_nearby, by = c("site_no", "year")) %>% 
  #--- lead depth to water table ---#
  arrange(site_no, year) %>% 
  group_by(site_no) %>% 
  mutate(
    #--- lead depth ---#
    dwt_lead1 = dplyr::lead(dwt, n = 1, default = NA, order_by = year),
    #--- first order difference in dwt  ---#
    dwt_dif  = dwt_lead1 - dwt
  )  

#--- take a look ---#
dplyr::select(reg_data, site_no, year, dwt_dif, nearby_pumping)
``` 

---

Finally, we estimate the model using the `lfe` package.

```{r Demo_reg_dwt}
#--- load the lfe package for regression with fixed effects ---#
library(lfe)

#--- OLS with site_no and year FEs (error clustered by site_no) ---#
reg_dwt <- felm(dwt_dif ~ nearby_pumping | site_no + year | 0 | site_no, data = reg_data)
```

Here is the regression result.

```{r reg_dwt_disp, results = "asis"}
stargazer(reg_dwt, type = "html")
```

## Demonstration 2: Precision Agriculture

```{r table_style_demo2, results="asis", echo=FALSE}
cat("
<style>
.book .book-body .page-wrapper .page-inner section.normal table
{
  width:auto;
}
.book .book-body .page-wrapper .page-inner section.normal table td,
.book .book-body .page-wrapper .page-inner section.normal table th,
.book .book-body .page-wrapper .page-inner section.normal table tr
{
  padding:0;
  border:0;
  background-color:#fff;
}
</style>
")
```

### Project Overview

---

**Objectives:**

+ Understand the impact of nitrogen on corn yield 
+ Understand how electric conductivity (EC) affects the marginal impact of nitrogen on corn 

---

**Datasets:**

+ The experimental design of an on-farm randomized nitrogen trail on an 80-acre field 
+ Data generated by the experiment
  * As-applied nitrogen rate
  * Yield measures 
+ Electric conductivity 

---

**Econometric Model:**

Here is the econometric model, we would like to estimate:

$$
yield_i = \beta_0 + \beta_1 N_i + \beta_2 N_i^2 + \beta_3 N_i \cdot EC_i + \beta_4 N_i^2 \cdot EC_i + v_i
$$

where $yield_i$, $N_i$, $EC_i$, and $v_i$ are corn yield, nitrogen rate, EC, and error term at subplot $i$. Subplots which are obtained by dividing experimental plots into six of equal-area compartments.  

---

**GIS tasks**

* read spatial data in various formats: R data set (rds), shape file, and GeoPackage file
  - use `sf::st_read()` 
* create maps using the `ggplot2` package
  - use `ggplot2::geom_sf()`
* create subplots within experimental plots
  - use-defined function that makes use of `st_geometry()` 
* identify corn yield, as-applied nitrogen, and electric conductivity (EC) data points within each of the experimental plots and find their averages
  - use `sf::st_join()` and   `sf::aggregate()`

---

**Preparation for replication**

+ Source (run) *Chap_1_Demonstration.R* to define `theme_map` and `gen_subplots()`

```{r source_demo2, eval = F}
source("Codes/Chap_1_Demonstration.R")
```

+ Load (install first if you have not) the following packages (There are other packages that will be loaded during the demonstration).

```{r demo2_packages, eval = FALSE}
library(sf)
library(dplyr)
library(ggplot2)
library(stargazer)
```

### Project Demonstration

We have already run a whole-field randomized nitrogen experiment on a 80-acre field. Let's import the trial design data

```{r Demo2_exp_design}
#--- read the trial design data ---#
trial_design_16 <- readRDS("./Data/trial_design.rds")
```

Figure \@ref(fig:trial-fig) is the map of the trial design generated using `ggplot2` package.^[`theme_for_map` is a user defined object that defines the theme of figures generated using `ggplot2` for this section. You can find it in **Chap_1_Demonstration.R**.].

```{r trial-fig, fig.cap = "The Experimental Design of the Randomize Nitrogen Trial"}
#--- map of trial design ---#
ggplot(data = trial_design_16) +
  geom_sf(aes(fill = factor(NRATE))) +
  scale_fill_brewer(name = "N", palette = "OrRd", direction = 1) +
  theme_for_map
```

---

We have collected yield, as-applied NH3, and EC data. Let's read in these datasets:^[Here we are demonstrating that R can read spatial data in different formats. R can read spatial data of many other formats. Here, we are reading a shapefile (.shp) and GeoPackage file (.gpkg).]

```{r Demo2_data_experiment, results = "hide"}
#--- read yield data (sf data saved as rds) ---#
yield <- readRDS("./Data/yield.rds")

#--- read NH3 data (GeoPackage data) ---#
NH3_data <- st_read("Data/NH3.gpkg")

#--- read ec data (shape file) ---#
ec <- st_read(dsn="Data", "ec")
```

Figure \@ref(fig:Demo2-show-the-map) shows the spatial distribution of the three variables. A map of each variable was made first, and then they are combined into one figure using the `patchwork` package^[Here is its [github page](https://github.com/thomasp85/patchwork). See the bottom of the page to find vignettes.].

```{r Demo2-show-the-map, fig.cap = "Spatial distribution of yield, NH3, and EC", fig.height = 9}

#--- yield map ---#
g_yield <- ggplot() +
  geom_sf(data = trial_design_16) +
  geom_sf(data = yield, aes(color = yield), size = 0.5) +
  scale_color_distiller(name = "Yield", palette = "OrRd", direction = 1) +
  theme_for_map

#--- NH3 map ---#
g_NH3 <- ggplot() +
  geom_sf(data = trial_design_16) +
  geom_sf(data = NH3_data, aes(color = aa_NH3), size = 0.5) +
  scale_color_distiller(name = "NH3", palette = "OrRd", direction = 1) +
  theme_for_map

#--- NH3 map ---#
g_ec <- ggplot() +
  geom_sf(data = trial_design_16) +
  geom_sf(data = ec, aes(color = ec), size = 0.5) +
  scale_color_distiller(name = "EC", palette = "OrRd", direction = 1) +
  theme_for_map

#--- stack the figures vertically and display  ---#
library(patchwork)
g_yield/g_NH3/g_ec
```

---

Instead of using plot as the observation unit, we would like to create subplots inside each of the plots and make them the unit of analysis because it would avoid masking the within-plot spatial heterogeneity of EC. Here, we divide each plot into six subplots^[`gen_subplots` is a user-defined function. See **Chap_1_Demonstration.R**.]:

```{r Demo2_get_subplots}
#--- generate subplots ---#
subplots <- lapply(
  1:nrow(trial_design_16), 
  function(x) gen_subplots(trial_design_16[x, ], 6)
  ) %>% 
  do.call('rbind', .) 
```

Figure \@ref(fig:map-subgrid) is a map of the subplots generated.

```{r map-subgrid, fig.cap = "Map of the subplots"}
#--- here is what subplots look like ---#
ggplot(subplots) +
  geom_sf() +
  theme_for_map
```

---

We now identify the mean value of corn yield, nitrogen rate, and EC for each of the subplots using `sf::aggregate()` and `sf::st_join()`.

```{r Demo2_merge_join}
(
reg_data <- subplots %>% 
  #--- yield ---#
  st_join(., aggregate(yield, ., mean), join = st_equals) %>%
  #--- nitrogen ---#
  st_join(., aggregate(NH3_data, ., mean), join = st_equals) %>%
  #--- EC ---#
  st_join(., aggregate(ec, ., mean), join = st_equals)
)
``` 

Here are the visualization of the subplot-level data (Figure \@ref(fig:Demo2-subplot-fig)):  

```{r Demo2-subplot-fig, fig.cap = "Spatial distribution of subplot-level yield, NH3, and EC", fig.height = 7}
(ggplot() +
  geom_sf(data = reg_data, aes(fill = yield), color = NA) +
  scale_fill_distiller(name = "Yield", palette = "OrRd", direction = 1) +
  theme_for_map)/
(ggplot() +
  geom_sf(data = reg_data, aes(fill = aa_NH3), color = NA) +
  scale_fill_distiller(name = "NH3", palette = "OrRd", direction = 1) +
  theme_for_map)/
(ggplot() +
  geom_sf(data = reg_data, aes(fill = ec), color = NA) +
  scale_fill_distiller(name = "EC", palette = "OrRd", direction = 1) +
  theme_for_map)
``` 

---

Let's estimate the model and see the results:

```{r Demo2_results, results = "asis"}
lm(yield ~ aa_NH3 + I(aa_NH3^2) + I(aa_NH3*ec) + I(aa_NH3^2*ec), data = reg_data) %>% 
  stargazer(type = "html")
```


## Demonstration 3: Land Use and Weather

```{r table_style_demo3, results="asis", echo=FALSE}
cat("
<style>
.book .book-body .page-wrapper .page-inner section.normal table
{
  width:auto;
}
.book .book-body .page-wrapper .page-inner section.normal table td,
.book .book-body .page-wrapper .page-inner section.normal table th,
.book .book-body .page-wrapper .page-inner section.normal table tr
{
  padding:0;
  border:0;
  background-color:#fff;
}
</style>
")
```

### Project Overview

---

**Objective**

Understand the impact of past precipitation on crop choice in Iowa (IA). 

---

**Datasets**

+ IA county boundary 
+ Regular grids over IA, created using `sf::st_make_grid()` 
+ PRISM daily precipitation data downloaded using `prism` package
+ Land use data from the Cropland Data Layer (CDL) for IA in 2015, downloaded using `cdlTools` package

---

**Econometric Model**

The econometric model we would like to estimate is:

$$
 CS_i = \alpha + \beta_1 PrN_{i} + \beta_2 PrC_{i} + v_i
$$
where $CS_i$ is the area share of corn divided by that of soy in 2015 for grid $i$ (we will generate regularly-sized grids in the Demo section), $PrN_i$ is the total precipitation observed in April through May and September  in 2014, $PrC_i$ is the total precipitation observed in June through August in 2014, and $v_i$ is the error term. To run the econometric model, we need to find crop share and weather variables observed at the grids. We first tackle the crop share variable, and then the precipitation variable.

---

**GIS tasks**

+ download Cropland Data Layer (CDL) data by USDA NASS 
  * use `cdlTools::getCDL()`
+ download PRISM weather data
  * use `prism::get_prism_dailys()`
+ crop PRISM data to the geographic extent of IA 
  * use `raster::crop()`
+ create regular grids within IA, which become the observation units of the econometric analysis
  * use `sf::st_make_grid()` 
+ remove grids that share small area with IA 
  * use `sf::st_intersection()` and `sf::st_area`
+ assign crop share and weather data to each of the generated IA grids (parallelized)
  * use `exactextractr::exact_extract()` and `future.apply::future_lapply()`
+ create maps 
  * use `tmap` package

---

**Preparation for replication**

+ Load (install first if you have not) the following packages (There are other packages that will be loaded during the demonstration).

```{r demo3_packages, eval = FALSE}
library(sf)
library(data.table)
library(dplyr)
library(raster)
library(lubridate)
library(tmap)
library(future.apply)
library(stargazer)
```

### Project Demonstration

The geographic focus of this project is IA. Let's get IAs state border (see Figure \@ref(fig:IA-map) for its map).

```{r Demo4_IA_boundary}
library("maps")

#--- IA state boundary ---#
IA_boundary <- st_as_sf(map("state", "iowa", plot = FALSE, fill = TRUE)) 
```

```{r IA-map, echo = FALSE, fig.cap = "IA state boundary", fig.margin = TRUE}
#--- map IA state border ---#
tm_shape(IA_boundary) +
  tm_polygons() +
  tm_layout(frame = FALSE) 
```

The unit of analysis is artificial grids that we create over IA. The grids are regularly-sized rectangles except around the edge of the IA state border^[We by no means are saying that this is the right geographical unit of analysis. This is just about demonstrating how R can be used for analysis done at the higher spatial resolution than county.]. So, let's create grids and remove those that do not overlap much with IA.

```{r Demo4_IA_grids} 
#--- create regular grids (40 cells by 40 columns) over IA ---#
IA_grids <- IA_boundary %>% 
  #--- create grids ---#
  st_make_grid(, n = c(40, 40)) %>% 
  #--- convert to sf ---#
  st_as_sf() %>% 
  #--- find the intersections of IA grids and IA polygon ---#
  st_intersection(., IA_boundary) %>% 
  #--- calculate the area of each grid ---#
  mutate(
    area = as.numeric(st_area(.)),
    area_ratio = area/max(area)
  ) %>% 
  #--- keep only if the intersected area is large enough ---#
  filter(area_ratio > 0.8) %>% 
  #--- assign grid id for future merge ---#
  mutate(grid_id = 1:nrow(.))
```

Here is what the generated grids look like (Figure \@ref(fig:Demo4-IA-grids-map)):

```{r Demo4-IA-grids-map, fig.cap = "Map of regular grids generated over IA"}
#--- plot the grids over the IA state border ---#
tm_shape(IA_boundary) +
  tm_polygons(col = "green") +
tm_shape(IA_grids) +
  tm_polygons(alpha = 0) +
  tm_layout(frame = FALSE)
```  

---

Let's work on crop share data. You can download CDL data using the `getCDL()` function from the `cdlTools` package.

```{r Demo4_cdl_download}
#--- load the cdlTools package ---#
library(cdlTools)

#--- download the CDL data for IA in 2015 ---#
(
IA_cdl_2015 <- getCDL("Iowa", 2015)$IA2015
)
```

The cells (30 meter by 30 meter) of the imported raster layer take a value ranging from 0 to 255. Corn and soybean are represented by 1 and 5, respectively (visualization of the CDL data is on the right).

Figure \@ref(fig:overlap-cdl-grid) shows the map of one of the IA grids and the CDL cells it overlaps with.

```{r overlap-cdl-grid, fig.cap = "Spatial overlap of a IA grid and CDL layer", echo = FALSE, dependson = "Demo4_cdl_download"}
temp_grid <- IA_grids[100, ]

extent_grid <- temp_grid %>% 
  st_transform(., projection(IA_cdl_2015)) %>% 
  extent()

raster_ovelap <- raster::crop(IA_cdl_2015, extent_grid)

tm_shape(raster_ovelap) +
  tm_raster() +
tm_shape(temp_grid) +
  tm_borders(lwd = 3, col = "blue") 
```  

We would like to extract all the cell values within the blue border. 

We use `exactextractr::exact_extract()` to identify which cells of the CDL raster layer fall within each of the IA grids and extract land use type values. We then find the share of corn and soybean for each of the grids.

```{r Demo4_extract, results = "hide"}
#--- reproject grids to the CRS of the CDL data ---#
IA_grids_rp_cdl <- st_transform(IA_grids, projection(IA_cdl_2015))

#--- load the exactextractr package for fast rater value extractions for polygons ---#
library(exactextractr)

#--- extract crop type values and find frequencies ---#
cdl_extracted <- exact_extract(IA_cdl_2015, IA_grids_rp_cdl) %>% 
  lapply(., function (x) data.table(x)[,.N, by = value]) %>% 
  #--- combine the list of data.tables into one data.table ---#
  rbindlist(idcol = TRUE) %>% 
  #--- find the share of each land use type ---#
  .[, share := N/sum(N), by = .id] %>% 
  .[, N := NULL] %>% 
  #--- keep only the share of corn and soy ---#
  .[value %in% c(1, 5), ]  
```

We then find the corn to soy ratio for each of the IA grids.

```{r Demo4_share_calc}
#--- find corn/soy ratio ---#
corn_soy <- cdl_extracted %>% 
  #--- long to wide ---#
  dcast(.id ~ value, value.var = "share") %>% 
  #--- change variable names ---#
  setnames(c(".id", "1", "5"), c("grid_id", "corn_share", "soy_share")) %>% 
  #--- corn share divided by soy share ---#
  .[, c_s_ratio := corn_share / soy_share]
```

---

We are still missing daily precipitation data at the moment. We have decided to use daily weather data from PRISM. Daily PRISM data is a raster data with the cell size of 4 km by 4 km. Figure \@ref(fig:Demo4-show-prism-data) the right presents precipitation data downloaded for April 1, 2010. It covers the entire contiguous U.S.     

```{r Demo4-show-prism-data, fig.cap = "Map of PRISM raster data layer", echo = FALSE, fig.margin = TRUE}
prism_ex <- readRDS( "./Data/prism_ex.rds")
tm_shape(prism_ex) +
  tm_raster() +
  tm_layout(frame = FALSE)
```

Let's now download PRISM data^[You do not have to run this code to download the data. It is included in the data folder for replication ([here](https://www.dropbox.com/sh/cyx9clgmshwc8eo/AAApv03Qpx84IGKCyF5v2rJ6a?dl=0)).]. This can be done using the `get_prism_dailys()` function from the `prism` package.^[[prism github page](https://github.com/ropensci/prism)]  

<!-- not to be seen -->
```{r Demo4_get_prism, eval = FALSE}
options(prism.path = "./Data/PRISM")

get_prism_dailys(
  type = "ppt", 
  minDate = "2014-04-01", 
  maxDate = "2014-09-30", 
  keepZip = FALSE 
)
```

When we use `get_prism_dailys()` to download data^[For this project, I could have just used monthly PRISM data, which can be downloaded using the `get_prism_monthlys()` function. But, in many applications, daily data is necessary, so I wanted to illustrate how to download and process them.], it creates one folder for each day. So, I have about 180 folders inside the folder I designated as the download destination above with the `options()` function. 

<!-- The name of the folder is expressive about what the data inside it is about. For example, the precipitation data for April 1st, 2010 is stored in the folder called "PRISM_ppt_stable_4kmD2_20100401_bil." Inside it, you will see bunch of files with exactly the same prefix, but with different extensions.   --> 

---

We now try to extract precipitation value by day for each of the IA grids by geographically overlaying IA grids onto the PRISM data layer and identify which PRISM cells each of the IA grid encompass. Figure \@ref(fig:Demo4-prism-crop) shows how the first IA grid overlaps with the PRISM cells^[Do not use `st_buffer()` for spatial objects in geographic coordinates (latitude, longitude) if you intend to use the created buffers for any serious IA (it is difficult to get the right distance parameter anyway.). Significant distortion will be introduced to the buffer due to the fact that one degree in latitude and longitude means different distances at the latitude of IA. Here, I am just creating a buffer to extract PRISM cells to display on the map.]. 

```{r Demo4-prism-crop, fig.cap = "Spatial overlap of an IA grid over PRISM cells"}
#--- read a PRISM dataset ---#
prism_whole <- raster("./Data/PRISM/PRISM_ppt_stable_4kmD2_20140401_bil/PRISM_ppt_stable_4kmD2_20140401_bil.bil") 

#--- align the CRS ---#
IA_grids_rp_prism <- st_transform(IA_grids, projection(prism_whole))

#--- crop the PRISM data for the 1st IA grid ---#
PRISM_1 <- crop(prism_whole, st_buffer(IA_grids_rp_prism[1, ], dist = 0.05))

#--- map them ---#
tm_shape(PRISM_1) +
  tm_raster() +
tm_shape(IA_grids_rp_prism[1, ]) +
  tm_polygons(alpha = 0) +
  tm_layout(frame = NA)
```

As you can see, some PRISM grids are fully inside the analysis grid, while others are partially inside it. So, when assigning precipitation values to grids, we will use the coverage-weighted mean of precipitations^[In practice, this may not be advisable. The coverage fraction calculation by `exact_extract()` is done using latitude and longitude. Therefore, the relative magnitude of the fraction numbers incorrectly reflects the actual relative magnitude of the overlapped area. When the spatial resolution of the sources grids (grids from which you extract values) is much smaller relative to that of the target grids (grids to which you assign values to), then a simple average would be very similar to a coverage-weighted mean. For example, CDL consists of 30m by 30m grids, and more than $1,000$ grids are inside one analysis grid.]. 

Unlike the CDL layer, we have `r seq(as.Date("2014-04-01"), as.Date("2014-09-30"), "days") %>% length` raster layers to process. Fortunately, we can process many raster files at the same time very quickly by first "stacking" many raster files first and then applying the `exact_extract()` function. Using `future_lapply()`, we let $6$ cores take care of this task with each processing 31 files, except one of them handling only 28 files.^[Parallelization of extracting values from many raster layers for polygons are discussed in much more detail in Chapter \@ref(Efficient). When I tried stacking all 183 files into one stack and applying `exact_extract`, it did not finish the job after over five minutes. So, I terminated the process in the middle. The parallelized version gets the job done in about $30$ seconds on my desktop.]
We first get all the paths to the PRISM files. 

```{r setup_parallel}
#--- get all the dates ---#
dates_ls <- seq(as.Date("2014-04-01"), as.Date("2014-09-30"), "days") 

#--- remove hyphen ---#
dates_ls_no_hyphen <- str_remove_all(dates_ls, "-")

#--- get all the prism file names ---#
folder_name <- paste0("PRISM_ppt_stable_4kmD2_", dates_ls_no_hyphen, "_bil") 
file_name <- paste0("PRISM_ppt_stable_4kmD2_", dates_ls_no_hyphen, "_bil.bil") 
file_paths <- paste0("./Data/PRISM/", folder_name, "/", file_name)

#--- take a look ---#
head(file_paths)
```

We now prepare for parallelized extractions and then implement them using `future_apply()`.

```{r go_parallel_prep}
#--- define the number of cores to use ---#
num_core <- 6

#--- prepare some parameters for parallelization ---#
file_len <- length(file_paths)
files_per_core <- ceiling(file_len/num_core)

#--- prepare for parallel processing ---#
plan(multiprocess, workers = num_core)

#--- reproject IA grids to the CRS of PRISM data ---#
IA_grids_reprojected <- st_transform(IA_grids, projection(prism_whole))
```

Here is the function that we run in parallel over `r num_core` cores. 

```{r define_function_prism_get, eval = F}
#--- define the function to extract PRISM values by block of files ---#
extract_by_block <- function(i, files_per_core) {

  #--- files processed by core  ---#
  start_file_index <- (i-1) * files_per_core + 1

  #--- indexes for files to process ---#
  file_index <- seq(
    from = start_file_index,
    to = min((start_file_index + files_per_core), file_len),
    by = 1
  )

  #--- extract values ---# 
  data_temp <- file_paths[file_index] %>% # get file names
    #--- stack files ---#
    stack() %>% 
    #--- extract ---#
    exact_extract(., IA_grids_reprojected) %>% 
    #--- combine into one data set ---#
    rbindlist(idcol = "ID") %>% 
    #--- wide to long ---#
    melt(id.var = c("ID", "coverage_fraction")) %>% 
    #--- calculate "area"-weighted mean ---#
    .[, .(value = sum(value * coverage_fraction)/sum(coverage_fraction)), by = .(ID, variable)]

  return(data_temp)
}
```

Now, let's run the function in parallel and calculate precipitation by period.

```{r parallel_prism_not_run, eval = FALSE}
#--- run the function ---#
precip_by_period <- future_lapply(1:num_core, function(x) extract_by_block(x, files_per_core)) %>% rbindlist() %>% 
  #--- recover the date ---#
  .[, variable := as.Date(str_extract(variable, "[0-9]{8}"), "%Y%m%d")] %>% 
  #--- change the variable name to date ---#
  setnames("variable", "date") %>% 
  #--- define critical period ---#
  .[,critical := "non_critical"] %>% 
  .[month(date) %in% 6:8, critical := "critical"] %>% 
  #--- total precipitation by critical dummy  ---#
  .[, .(precip=sum(value)), by = .(ID, critical)] %>%
  #--- wide to long ---#
  dcast(ID ~ critical, value.var = "precip")
```

```{r read_precip_period, echo = FALSE}
# saveRDS(precip_by_period, "./Data/precip_by_period.rds")
precip_by_period <- readRDS( "./Data/precip_by_period.rds")
```

We now have grid-level crop share and precipitation data. 

---

Let's merge them and run regression.^[We can match on `grid_id` from `corn_soy` and `ID` from "precip_by_period" because `grid_id` is identical with the row number and ID variables were created so that the ID value of $i$ corresponds to $i$ th row of `IA_grids`.]

```{r Demo4_reg}
#--- crop share ---#
reg_data <- corn_soy[precip_by_period, on = c(grid_id = "ID")]

#--- OLS ---#
reg_results <- lm(c_s_ratio ~ critical + non_critical, data = reg_data)
```

Here is the regression results table.


```{r , results = "asis"}
#--- regression table ---#
stargazer(reg_results, type = "html")
```

Again, do not read into the results as the econometric model is terrible.  


## Demonstration 4: The Impact of Railroad Presence on Corn Planted Acreage

```{r table_style_demo4, results="asis", echo=FALSE}
cat("
<style>
.book .book-body .page-wrapper .page-inner section.normal table
{
  width:auto;
}
.book .book-body .page-wrapper .page-inner section.normal table td,
.book .book-body .page-wrapper .page-inner section.normal table th,
.book .book-body .page-wrapper .page-inner section.normal table tr
{
  padding:0;
  border:0;
  background-color:#fff;
}
</style>
")
```

### Project Overview

---

**Objective**

+ Understand the impact of railroad on corn planted acreage in Illinois

---

**Datasets**

+ USDA corn planted acreage for Illinois downloaded from the USDA  NationalAgricultural Statistics Service (NASS) QuickStats service using `tidyUSDA` package 
+ US railroads (line data) downloaded from [here](https://catalog.data.gov/dataset/tiger-line-shapefile-2015-nation-u-s-rails-national-shapefile)

---

**Econometric Model**

We will estimate the following model:

$$
  y_i = \beta_0 + \beta_1 RL_i + v_i
$$

where $y_i$ is corn planted acreage in county $i$ in Illinois, $RL_i$ is the total length of railroad, and $v_i$ is the error term.

---

**GIS tasks**

+ Download USDA corn planted acreage by county as a spatial dataset (`sf` object)
  * use `tidyUSDA::getQuickStat()`
+ Import US railroad shape file as a spatial dataset (`sf` object) 
  * use `sf:st_read()`
+ Spatially subset (crop) the railroad data to the geographic boundary of Illinois 
  * use `sf_1[sf_2, ]`
+ Find railroads for each county (cross-county railroad will be chopped into pieces for them to fit within a single county)
  * use `sf::st_intersection()`        
+ Calculate the travel distance of each railroad piece
  * use `sf::st_length()`

---

**Preparation for replication**

+ Load (install first if you have not) the following packages (There are other packages that will be loaded during the demonstration).

```{r demo4_packages, eval = FALSE}
library(sf)
library(ggplot2)
library(dplyr)
library(stargazer)
```

### Project Demonstration

We first download corn planted acreage data for 2018 from USDA NASS QuickStat service using `tidyUSDA` package^[In order to actually download the data, you need to obtain the API key [here](https://quickstats.nass.usda.gov/api). Once the API key was obtained, I stored it using `set_key()` from the `keyring` package, which was named "usda_nass_qs_api". In the code to the left, I retrieve the API key using `key_get("usda_nass_qs_api")` in the code.].

```{r get_quicknass, results = "hide"}
library(keyring)
library(tidyUSDA)

(
IL_corn_planted <- getQuickstat(
    key = key_get("usda_nass_qs_api") ,
    program = "SURVEY",
    data_item = "CORN - ACRES PLANTED",
    geographic_level = "COUNTY",
    state = "ILLINOIS",
    year = "2018",
    geometry = TRUE
  )  %>% 
  #--- keep only some of the variables ---#
  dplyr::select(year, NAME, county_code, short_desc, Value)
)
```

```{r IL_corn_planted, echo = F}
IL_corn_planted
```

A nice thing about this function is that the data is downloaded as an `sf` object with county geometry with `geometry = TRUE`. So, you can immediately plot it (Figure \@ref(fig:map-il-corn-acreage)) and use it for later spatial interactions without having to merge the downloaded data to an independent county boundary data.^[`theme_for_map` is a user defined object that defines the theme of figures generated using `ggplot2` for this section. You can find it in **Chap_1_Demonstration.R**.]. 

```{r map-il-corn-acreage, fig.cap = "Map of Con Planted Acreage in Illinois in 2018"}
ggplot(IL_corn_planted) +
  geom_sf(aes(fill = Value/1000)) +
  scale_fill_distiller(name = "Planted Acreage (1000 acres)", palette = "YlOrRd", trans = "reverse") +
  theme(
    legend.position = "bottom"
  ) +
  theme_for_map
```

---

Let's import the U.S. railroad data and reproject to the CRS of `IL_corn_planted`:

```{r Demo5_rail, dependson = "map_il_corn_acreage"}
rail_roads <- st_read(dsn = "./Data/", layer = "tl_2015_us_rails") %>% 
  st_transform(st_crs(IL_corn_planted))
```

Here is what it looks like:

```{r Demo5-rail-plot, fig.cap = "Map of Railroads"}
ggplot(rail_roads) +
  geom_sf() +
  theme_for_map
```

We now crop it to the Illinois state border (Figure \@ref(fig:Demo5-rail-IL-plot)) using `sf_1[sf_2, ]`:

```{r crop_to_IL_run, echo = FALSE, dependson = "map_il_corn_acreage"}
rail_roads_IL <- rail_roads[IL_corn_planted, ]
```

```{r crop_to_IL, eval = FALSE, dependson = "map_il_corn_acreage"}
rail_roads_IL <- rail_roads[IL_corn_planted, ]
```

```{r Demo5-rail-IL-plot, fig.cap = "Map of railroads in Illinois"}
ggplot() +
  geom_sf(data = rail_roads_IL) +
  theme_for_map
```

Let's now find railroads for each county, where cross-county railroads will be chopped into pieces so each piece fits completely within a single county, using `st_intersection()`.

```{r intersect_rails, dependson = "map_il_corn_acreage"}
rails_IL_segmented <- st_intersection(rail_roads_IL, IL_corn_planted) 
```

Here are the railroads for Richland County:

```{r map_seg_rail, dependson = "intersect_rails", fig.height = 6}
ggplot() + 
  geom_sf(data = dplyr::filter(IL_corn_planted, NAME == "Richland")) +
  geom_sf(data = dplyr::filter(rails_IL_segmented, NAME == "Richland"), aes( color = LINEARID )) +
  theme(
    legend.position = "bottom"
  ) +
  theme_for_map
```

We now calculate the travel distance (Great-circle distance) of each railroad piece using `st_length()` and then sum them up by county to find total railroad length by county.

```{r rail_total_county}
(
rail_length_county <- mutate(
    rails_IL_segmented, 
    length_in_m = as.numeric(st_length(rails_IL_segmented)),
  ) %>% 
  #--- group by county ID ---#
  group_by(county_code) %>% 
  #--- sum rail length by county ---#
  summarize(length_in_m = sum(length_in_m)) %>% 
  #--- geometry no longer needed ---#
  st_drop_geometry()
)
```

---

We merge the railroad length data to the corn planted acreage data and estimate the model.

```{r merge_data}
reg_data <- left_join(IL_corn_planted, rail_length_county, by = "county_code") 
```

```{r results = "asis"}
lm(Value ~ length_in_m, data = reg_data) %>% 
  stargazer(type = "html")
```

## Demonstration 5: Groundwater use for agricultural irrigation

```{r table_style_demo5, results="asis", echo=FALSE}
cat("
<style>
.book .book-body .page-wrapper .page-inner section.normal table
{
  width:auto;
}
.book .book-body .page-wrapper .page-inner section.normal table td,
.book .book-body .page-wrapper .page-inner section.normal table th,
.book .book-body .page-wrapper .page-inner section.normal table tr
{
  padding:0;
  border:0;
  background-color:#fff;
}
</style>
")
```

### Project Overview

---

**Objective**
+ Understand the impact of monthly precipitation on groundwater use for agricultural irrigation

---

**Datasets**

+ Annual groundwater pumping by irrigation wells in Kansas for 2010 and 2011 (originally obtained from the Water Information Management & Analysis System (WIMAS) database)
+ Daymet^[[Daymet website](https://daymet.ornl.gov/)] daily precipitation and maximum temperature downloaded using `daymetr` package

---

**Econometric Model**

The econometric model we would like to estimate is:

$$
   y_{i,t}  = \alpha +  P_{i,t} \beta + T_{i,t} \gamma + \phi_i + \eta_t + v_{i,t}
$$

where $y$ is the total groundwater extracted in year $t$, $P_{i,t}$ and $T_{i,t}$ is the collection of monthly total precipitation and mean maximum temperature April through September in year $t$, respectively, $\phi_i$ is the well fixed effect, $\eta_t$ is the year fixed effect, and $v_{i,t}$ is the error term. 

---

**GIS tasks**

+ download Daymet precipitation and maximum temperature data for each well from within R in parallel
  * use `daymetr::download_daymet()` and `future.apply::future_lapply()`

---

### Project Demonstration

We have already collected annual groundwater pumping data by irrigation wells in 2010 and 2011 in Kansas from the Water Information Management & Analysis System (WIMAS) database. Let's read in the groundwater use data.  

```{r Demo5_gw}
#--- read in the data ---#
(
gw_KS_sf <- readRDS( "./Data/gw_KS_sf.rds") 
)
```

We have `r length(unique(gw_KS_sf$well_id))` wells in total, and each well has records of groundwater pumping (`af_used`) for years 2010 and 2011. Here is the spatial distribution of the wells. 

```{r Demo5_map, echo = FALSE}
KS_counties <- readRDS("./Data/KS_county_borders.rds")

tm_shape(KS_counties) +
  tm_polygons() +
tm_shape(gw_KS_sf) +
  tm_symbols(size = 0.05, col = "black")
```

<!-- 
#=========================================
# Daymet data download and processing 
#=========================================
-->

--- 

We now need to get monthly precipitation and maximum temperature data. We have decided that we use [Daymet](https://daymet.ornl.gov/) weather data. Here we use the `download_daymet()` function from the `daymetr` package^[[daymetr vignette](https://cran.r-project.org/web/packages/daymetr/vignettes/daymetr-vignette.html)] that allows us to download all the weather variables for a specified geographic location and time period^[See [here]() for a fuller explanation of how to use the `daymetr` package.]. We write a wrapper function that downloads Daymet data and then processes it to find monthly total precipitation and mean maximum temperature^[This may not be ideal for a real research project because the original raw data is not kept. It is often the case that your econometric plan changes on the course of your project (e.g., using other weather variables or using different temporal aggregation of weather variables instead of monthly aggregation). When this happens, you need to download the same data all over again.]. We then loop over the `r nrow(gw_KS_sf)` wells, which is parallelized using the `future_apply()` function^[For parallelized computation, see Chapter \@ref(prallel)] from the `future.apply` package. This process takes about an hour on my Mac with parallelization on 7 cores. The data is available in the data repository for this course (named as "all_daymet.rds"). 

```{r Demo5_getdaymet}
library(daymetr)
library(future.apply)

#--- get the geographic coordinates of the wells ---#
well_locations <- gw_KS_sf %>%
  unique(by = "well_id") %>% 
  dplyr::select(well_id) %>% 
  cbind(., st_coordinates(.))

#--- define a function that downloads Daymet data by well and process it ---#
get_daymet <- function(i) {

  temp_site <- well_locations[i, ]$well_id
  temp_long <- well_locations[i, ]$X
  temp_lat <- well_locations[i, ]$Y

  data_temp <- download_daymet(
      site = temp_site,
      lat = temp_lat,
      lon = temp_long,
      start = 2010,
      end = 2011,
      #--- if TRUE, tidy data is returned ---#
      simplify = TRUE,
      #--- if TRUE, the downloaded data can be assigned to an R object ---#
      internal = TRUE
    ) %>% 
    data.table() %>% 
    #--- keep only precip and tmax ---#
    .[measurement %in% c("prcp..mm.day.", "tmax..deg.c."), ] %>%  
    #--- recover calender date from Julian day ---#
    .[, date := as.Date(paste(year, yday, sep = "-"), "%Y-%j")] %>% 
    #--- get month ---#
    .[, month := month(date)] %>% 
    #--- keep only April through September ---#
    .[month %in% 4:9,] %>% 
    .[, .(site, year, month, date, measurement, value)] %>% 
    #--- long to wide ---#
    dcast(site + year + month + date~ measurement, value.var = "value") %>% 
    #--- change variable names ---#
    setnames(c("prcp..mm.day.", "tmax..deg.c."), c("prcp", "tmax")) %>% 
    #--- find the total precip and mean tmax by month-year ---#
    .[, .(prcp = sum(prcp), tmax = mean(tmax)) , by = .(month, year)] %>% 
    .[, well_id := temp_site]

  return(data_temp)
  gc()
}
```

Here is what one run (for the first well) of `get_daymet()` returns 

```{r Demo5_one_run}
#--- one run ---#
(
returned_data <- get_daymet(1)[]
)
```

We get the number of cores you can use by `RhpcBLASctl::get_num_procs()` and parallelize the loop over wells using `future_lapply()`.^[For Mac users, `mclapply` or `pbmclapply` (`mclapply` with progress bar) are good alternatives.]

```{r eval = FALSE}
#--- prepare parallelized process ---#
library(RhpcBLASctl) 
num_core <- get_num_procs() - 1

#--- run get_daymet with parallelization ---#
(
all_daymet <- future_lapply(1:nrow(well_locations), get_daymet) %>% 
  rbindlist() 
)
```

```{r all_daymet_read, echo = FALSE}
all_daymet <- readRDS("./Data/all_daymet.rds")
all_daymet
```

---

Before merging the Daymet data, we need to reshape the data into a wide format to get monthly precipitation and maximum temperature as columns.  

```{r Demo5_long_wide}
#--- long to wide ---#
daymet_to_merge <- dcast(all_daymet, well_id + year ~ month, value.var = c("prcp", "tmax"))

#--- take a look ---#
daymet_to_merge
```

Now, let's merge the weather data to the groundwater pumping dataset.

```{r Demo5_merge_weather}
(
reg_data <- data.table(gw_KS_sf) %>% 
  #--- keep only the relevant variables ---#
  .[, .(well_id, year, af_used)] %>% 
  #--- join ---#
  daymet_to_merge[., on = c("well_id", "year")]
)
```

---

Let's run regression and display the results.

```{r Demo5_reg_display, results = "asis"}
#--- load lfe package ---#
library(lfe)

#--- run FE ---#
reg_results <- felm(
  af_used ~ 
  prcp_4 + prcp_5 + prcp_6 + prcp_7 + prcp_8 + prcp_9 +
  tmax_4 + tmax_5 + tmax_6 + tmax_7 + tmax_8 + tmax_9
  |well_id + year| 0 | well_id,
  data = reg_data
)

#--- display regression results ---#
stargazer(reg_results, type = "html")
```

That's it. Do not bother to try to read into the regression results. Again, this is just an illustration of how R can be used to prepare a regression-ready dataset with spatial variables.  


<!--chapter:end:Demonstration.Rmd-->

--- 
title: "Download Publicly Accessible Spatial Datasets using R"
author: "Taro Mieno"
date: "`r Sys.Date()`"
output:
  tufte::tufte_html: 
    number_sections: yes
    toc_float: yes
    toc: yes
  tufte::tufte_handout:
    citation_package: natbib
    latex_engine: xelatex
  tufte::tufte_book:
    citation_package: natbib
    latex_engine: xelatex
# bibliography: skeleton.bib
link-citations: yes
---

# Introduction

There are many publicly available spatial datasets that can be downloaded using R. Programming data downloading using R instead of manually downloading data from websites can save lots of time and also enhances the reproducibility of your analysis. In this section, we will introduce some of such datasets and show how to download data.  

## USDA NASS QuickStat (`tidyUSDA`)

There are several packages that lets you download data from the USDA NASS QuickStat. Here we use the `tidyUSDA` package. A nice thing about [tidyUSDA](https://github.com/bradlindblad/tidyUSDA) is that it gives you an option to download data as an `sf` object, which means you can immediately visualize the data or spatially interact it with other spatial objects.

First thing you want to do is to get an API key from this [website](https://quickstats.nass.usda.gov/api). We need an API key to actually download data. 


```{r }
library(tidyUSDA)  
library(ggplot2)
library(dplyr)
library(keyring)  
```

You can download data using `getQuickstat()`. There are number of options you can use to narrow down scope of the data you are downloading including `data_item`, `geographic_level`, `year`,  `commodity`, and so on. See its [manual](https://cran.r-project.org/web/packages/tidyUSDA/tidyUSDA.pdf) for the full list of parameters you can set. As an example, the code below download corn-related data by county in Illinois for year 2016 as an `sf` object.

```{r il_corn_download}
(
IL_corn_yield <- getQuickstat(
    key = key_get("usda_nass_qs_api"),
    program = "SURVEY",
    commodity = "CORN",
    geographic_level = "COUNTY",
    state = "ILLINOIS",
    year = "2016",
    geometry = TRUE
  )  %>% 
  #--- keep only some of the variables ---#
  dplyr::select(year, NAME, county_code, short_desc, Value)
)
```

As you can see, it is an `sf` object with `geometry` column due to `geometry = TRUE` option. This means that you can immediately create a map with the data (Figure \@ref(fig:corn-yield-IL)):

```{r corn-yield-IL, fig.cap = "Corn Yield (bu/acre) in Illinois in 2016"}
ggplot() +
  geom_sf(
    data = filter(IL_corn_yield, short_desc == "CORN, GRAIN - YIELD, MEASURED IN BU / ACRE"), 
    aes(fill = Value)
  )
```

You can download data for multiple states and years at the same time like below (of course, if you want the whole U.S., don't specify the state parameter.). 

```{r il_co_necorn_download}
(
IL_CO_NE_corn <- getQuickstat(
    key = key_get("usda_nass_qs_api"),
    program = "SURVEY",
    commodity = "CORN",
    geographic_level = "COUNTY",
    state = c("ILLINOIS", "COLORADO", "NEBRASKA"),
    year = paste(2014:2018),
    geometry = TRUE
  ) %>% 
  #--- keep only some of the variables ---#
  dplyr::select(year, NAME, county_code, short_desc, Value)
)
```


### Look for parameter values

This package has a function that lets you see all the possible parameter values you can use for many of these parameters. For example, suppose you know you would like irrigated corn yield in Colorado, but you are not sure what parameter value (string) you should supply to the `data_item` parameter. Then, you can do this:^[Of course, you can alternatively go to the QuickStat website and look for the text values.]

```{r see_values_dataitem}
#--- get all the possible values for data_item ---#
all_items <- tidyUSDA::allDataItem 

#--- take a look at the first six ---#
head(all_items)
```

You can use key words like "CORN", "YIELD", "IRRIGATED" to narrow the entire list: 

```{r }
all_items %>% 
  grep(pattern = "CORN", ., value = TRUE) %>% 
  grep(pattern = "YIELD", ., value = TRUE) %>% 
  grep(pattern = "IRRIGATED", ., value = TRUE)  
```

Looking at the list, we know the exact text value we want, which is the first entry of the vector.

```{r co_corn_download}
(
CO_ir_corn_yield <- getQuickstat(
    key = key_get("usda_nass_qs_api"),
    program = "SURVEY",
    data_item = "CORN, GRAIN, IRRIGATED - YIELD, MEASURED IN BU / ACRE",
    geographic_level = "COUNTY",
    state = "COLORADO",
    year = "2018",
    geometry = TRUE
  ) %>% 
  #--- keep only some of the variables ---#
  dplyr::select(year, NAME, county_code, short_desc, Value)
)
```

Here is the complete list of functions that gives you the possible values of the parameters for `getQuickstat()`. 

```{r eval = F}
tidyUSDA::allCategory 
tidyUSDA::allSector 
tidyUSDA::allGroup 
tidyUSDA::allCommodity 
tidyUSDA::allDomain 
tidyUSDA::allCounty 
tidyUSDA::allProgram 
tidyUSDA::allDataItem 
tidyUSDA::allState 
tidyUSDA::allGeogLevel 
```

### Caveats

You cannot retrieve more than $50,000$ (the limit is set by QuickStat) rows of data. This query requests way more than $50,000$ observations, and fail. In this case, you need to narrow the search and chop the task into smaller tasks. 

```{r cav_1, error = TRUE}
many_states_corn <- getQuickstat(
    key = key_get("usda_nass_qs_api"),
    program = "SURVEY",
    commodity = "CORN",
    geographic_level = "COUNTY",
    state = c("ILLINOIS", "COLORADO", "NEBRASKA", "IOWA", "KANSAS"),
    year = paste(1995:2018),
    geometry = TRUE
  ) 
```

Another caveat is that the query returns an error when there is no observation that satisfy your query criteria. For example, even though "CORN, GRAIN, IRRIGATED - YIELD, MEASURED IN BU / ACRE" is one of the values you can use for `data_item`, there is no entry for the statistic in Illinois in 2018. Therefore, the following query fails.

```{r cav_2, error = TRUE}
many_states_corn <- getQuickstat(
    key = key_get("usda_nass_qs_api"),
    program = "SURVEY",
    data_item = "CORN, GRAIN, IRRIGATED - YIELD, MEASURED IN BU / ACRE",
    geographic_level = "COUNTY",
    state = "ILLINOIS",
    year = "2018",
    geometry = TRUE
  ) 
```


## PRISM

But, we are just interested in Kansas. In order to avoid carrying around unnecessary data while processing them and make processing faster, we can crop only the portion that overlaps with Kansas.  

```{r }
#--- get the geographic extent of the groundwater data ---#
extent_KS <- extent(gw_KS_sf_2010)

#--- read PRISM data and crop the PRISM data based on the extent ---#
prism_KS <- raster("./Data/PRISM/PRISM_ppt_stable_4kmD2_20100401_bil/PRISM_ppt_stable_4kmD2_20100401_bil.bil") %>% 
  crop(., extent_KS)
```

Here is the map of the cropped precipitation data superimposed on the Kansas county boundaries.

```{r echo = FALSE}
tm_shape(KS_counties) +
  tm_polygons(col = NA) +
tm_shape(prism_KS) +
  tm_raster(title = "", alpha = 0.4) 
```

We now try to extract precipitation value by day for each of the wells by geographic geographically overlaying wells onto the PRISM data, and identify which PRISM grid each well falls within.  

```{r echo = FALSE}
tm_shape(prism_KS) +
  tm_raster(title = "") +
tm_shape(gw_KS_sf_2010) +
  tm_symbols(size = 0.05)
```

## Daymet

```{r }
library(daymetr)

#--- tiles ---#
tm_shape(tile_outlines) +
  tm_polygons() +
tm_shape(st_transform(IA_grids, st_crs(tile_outlines))) +
  tm_polygons(col = "red") 
```

```{r }
tiles_to_get <- tile_outlines %>% 
  st_as_sf() %>% 
  .[st_transform(IA_grids, st_crs(tile_outlines)), ] %>% 
  .$TileID
```

^[Note that when you download all the Daymet tiles data following the code below, the total size of the downloaded files will be about 3GB.]

```{r }
library(daymetr)

query_par_data <- expand.grid(year = 2012:2014, param = c("tmax", "prcp"), tile = tiles_to_get) 

get_daymet_tiles <- function(i) {

  year <- query_par_data[i, "year"]
  param <- query_par_data[i, "param"]
  tile <- query_par_data[i, "tile"]

  download_daymet_tiles(
    tiles = tile, 
    start = year,
    end = year,
    path = "./Data/Daymet",
    param = param
  )

}

plan(multiprocess, workers = 10)  

future_lapply(1:nrow(query_par_data), get_daymet_tiles)
``` 

```{r }

daymet_12 <- stack("./Data/Daymet/prcp_2012_11742.nc", varname = "prcp")

daymet_12@layers[[1]] %>%  plot
daymet_12@layers[[80]] %>%  plot
daymet_12@layers[[140]] %>% getValues() %>%  hist

 %>% 
  brick(quick =TRUE)

plot(daymet_12)

library(ncdf4)

daymet_12 <- nc_open("./Data/Daymet/prcp_2012_11742.nc") 

daymet_12

lon <- ncvar_get(daymet_12, "x")
dim(lon)

lat <- ncvar_get(daymet_12, "y")
dim(lat)

time <- ncvar_get(daymet_12, "time")
dim(time)


prcp <- ncvar_get(daymet_12, "prcp")

prcp[, , 1] %>%  as.vector %>%  hist


daymet_12 <- raster("./Data/Daymet/prcp_2012_11744.nc", varname = "prcp", band = 6)

plot(daymet_12)

IA_grids <- st_transform(IA_grids, projection(daymet_12)) 

data <- exact_extract(daymet_12, IA_grids) %>% rbindlist(idcol = TRUE)

plot(daymet_12)
plot(IA_grids, add=TRUE)


# library("maps")
# states <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
# counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
```

## USGS

[USGS R](https://owi.usgs.gov/R/index.html)

### Groundwater level data

```{r }
library(dataRetrieval)
state_ls <- state.abb

data_ne <- whatNWISdata(stateCd = "NE")

KS_gwl <- readNWISdata(
    stateCd="Kansas", 
    startDate = "1980-01-01", 
    endDate = "2020-01-01", 
    service = "gwlevels"
  ) %>% 
  select(site_no, lev_dt, lev_va) %>% 
  rename(date = lev_dt, dwt = lev_va)

KS_site_ls <- KS_gwl[,site_no] %>% unique()

sites_info <- readNWISsite(siteNumbers = KS_site_ls) %>% 
  select(site_no, dec_lat_va, dec_long_va) %>% 
  st_as_sf(coords = c("dec_long_va", "dec_lat_va")) %>% 
  st_set_crs(4269)
```

### Nitrogen concentration

```{r }

SaltLake_totalN <- readNWISdata(
  bBox = c(-113.0428, 40.6474, -112.0265, 41.7018), 
  service = "qw", 
  parameterCd = "00600", 
  startDate = "2020-01-01"
)

attributes(SaltLake_totalN)

attr(SaltLake_totalN, "variableInfo")


phosphorous: 00665
nitrogen: 00665
```


### Water temperature
```{r }
MauiCo_avgdailyQ <- readNWISdata(
  stateCd = "Hawaii", 
  service = "dv", 
  parameterCd = "00060"
)

head(MauiCo_avgdailyQ)
```

### WQP 

[WQP user guide](https://www.waterqualitydata.us/portal_userguide/)

[WQP query](https://www.waterqualitydata.us/webservices_documentation/#WQPWebServicesGuide-Submitting)

```{r }
wqpcounts_ia <- readWQPdata(
  statecode="US:19", 
  querySummary = TRUE
)

IA_lake_phosphorus <- readWQPdata(
  statecode = "IA", 
  siteType = "Lake, Reservoir, Impoundment", 
  characteristicName = "Phosphorus", 
  startDate = "1990-10-01" 
) %>% 
filter(MonitoringLocationIdentifier, ResultMeasureValue, ActivityStartDate)

#--- get longitude and latitude ---#
siteInfo <- attr(IA_lake_phosphorus, "siteInfo") %>% 
  select(MonitoringLocationIdentifier, dec_lat_va, dec_lon_va) %>% 
  st_as_sf(coords = c("dec_lon_va", "dec_lat_va")) %>% 
  st_set_crs(4269)

attributes()
```

## Daymet


<!-- ## Crop Data Layer   -->


## Water quality 

[USGS online ](https://owi.usgs.gov/R/training-curriculum/usgs-packages/)


## Sentinel 2


## NOAA (rnoaa)


<!--chapter:end:DownloadSpatialData.Rmd-->

# Website url

[R as GIS for Economists](https://tmieno2.github.io/R-as-GIS-for-Economists/)

# Publish on Github 

+ Create a repository on Github
+ link the account to the folder using atom
+ commit and push
+ change the setting on Github so that master/docs becomes the source (see [here](https://bookdown.org/yihui/bookdown/github.html)) 

# Files not to push to Github
+ this file
+ knit_rmd.R
+ Codes/

# Files
+ all the files originate from Box/Teaching/AAEA R/GIS

$ Pandoc (html to doc) for editing

pandoc -f html -t docx -o temp.docx .html

<!--chapter:end:Notes.Rmd-->

# Loop and Parallel Computing


```{r library_02,echo=FALSE,warning=FALSE}
suppressMessages(library(data.table))
suppressMessages(library(nycflights13))
suppressMessages(library(maptools))
suppressMessages(library(tictoc))
suppressMessages(library(raster))
suppressMessages(library(broom))
suppressMessages(library(tidyverse))
suppressMessages(library(stargazer))
suppressMessages(library(microbenchmark))
```

```{r chunk_set_02,echo=FALSE}
library(knitr)
opts_chunk$set(
  echo= TRUE,
  comment = NA,
  cache = TRUE, 
  message = FALSE,
  warning = FALSE,
  tidy=FALSE,
  #--- figure related ---#
  fig.align='center',
  fig.width=5,
  fig.height=4
  # dev='pdf'
  )
```


## Introduction

In this lecture, we will learn how to program repetitive operations effectively and fast. We sometimes need to run the same process over and over again often with slight changes in parameters. In such a case, it is very inefficient (time-consuming) and  messy to write all of the steps one bye one. For example, suppose you are interested in knowing the square of 1 through 5 in with a step of 1 ($[1, 2, 3, 4, 5]$). The following code certainly works:

```{r tedious}
1^2 
2^2 
3^2 
4^2 
5^2 
``` 

However, imagine you have to do this for 1000 integers. Yes, you don't want to write each one of them one by one as that would occupy 1000 lines of your code. Moreover, it's just tedious to do so. Things will be even worse if you need to repeat much more complicated processes like Monte Carlo simulations. So, let's learn how to write a program to do repetitive jobs effectively and fast. 

Here are the specific learning objectives of this chapter.

1. Learn how to use for loop and `lapply()` to complete repetitive jobs 
2. Learn how not to loop things that can be easily vectorized
3. Learn how to parallelize repetitive jobs using the `future_lapply()` function from the `future.apply` package

## What is loop? 

Looping is to repeatedly evaluate the same (except parameters) process over and over again. Here, the **same** process is the action of squaring. This does not change among the processes you run. What changes is what you square. There is a very efficient way of coding such processes. 

### For loop

Here is how **for loop** works in general:

```{r loop_explain, eval = FALSE}
for (x in a_list_of_values){
  you do what you want to do on x
}
```

As an example, let's use this looping syntax to do the same as above:

```{r loop}
for (x in 1:5){
  print(x^2)
}
```

Here, a list of values is $1:5$ ($[1, 2, 3, 4, 5]$). For each value of the list, you square it ($x^2$) and then print it ($print()$). If you want to get the square of $1:1000$, the only thing you need to change is the list of values as in:

```{r loop_more, eval=FALSE}
#--- evaluation not reported as it's too long ---#
for (x in 1:1000){
  print(x^2)
}
```

So, the length of the code does not depend on how many repeats you do, which is an obvious improvement over manual typing of every single process one by one. Note that you do not have to use $x$ to refer to an object you are going to use. It could be any combination of letters as long as you use it when you code what you want to do inside the loop. So, this would work just fine,

```{r silly_ex}
for (bluh_blugh_bluh in 1:5){
  print(bluh_blugh_bluh^2)
}
```

### While loop

Another type of loop is **while loop**. This is useful when you do not have priori a specific list of values to loop over and  want certain conditions to be satisfied before the repeated process is terminated. We do not talk about this loop much here. I will just provide one example that illustrates how this type of loop works. 

```{r while}
x <- 0
while (x < 6){
  print(x^2)
  x <- x + 1
}
```

### Looping using the `lapply()` function

You can do looping using the `lapply()` function^[`lpply()` in only one of the family of `apply()` functions. We do not talk about other types of `apply()` functions here (e.g., `apply()`, `spply()`, `mapply()`,, `tapply()`). Personally, I found myself only rarely using them. But, if you are interested in learning those, take a look at [here](https://www.datacamp.com/community/tutorials/r-tutorial-apply-family#gs.b=aW_Io) or [here](https://www.r-bloggers.com/using-apply-sapply-lapply-in-r/).]. In general, here is how it works:

$$
lapply(A,B)
$$

where $A$ is the list of values you go through one by one in the order the values are stored, and $B$ is the function you would like to apply to each of the values in $A$. For example, the following code does exactly the same thing as the above for loop example.

```{r lapply}
lapply(1:5, function(x){x^2})
```

Here, $A$ is $[1, 2, 3, 4, 5]$. In $B$ you have a function that takes $x$ and square it. So, the above code applies the function to each of $[1, 2, 3, 4, 5]$ one by one. In many circumstances, you can write the same looping actions in a much more concise manner using the `lapply` function than explicitly writing out the loop process as in the above for loop examples. You might have noticed that the output is a list. Yes, `lapply()` returns the outcomes in a list. That is where **l** in `lapply()` comes from.  

When the operation you would like to repeat becomes complicated (almost always the case), it is advisable that you create a function of that process first. 

```{r def_fcn}
#--- define the function first ---#
square_it <- function(x){
  return(x^2)
}

#--- lapply using the pre-defined function ---#
lapply(1:5, square_it)
```
  

### Looping over multiple variables using lapply()  

`lapply()` allows you to loop over only one variable. However, it is often the case that you want to loop over multiple variables^[the `map()` function from the `purrr` package allows you to loop over two variable.]. However, it is easy to achieve this. The trick is to create a `data.frame` of the variables where the complete list of the combinations of the variables are stored, and then loop over row of the `data.frame`. As an example, suppose we are interested in understanding the sensitivity of corn revenue to corn price and applied nitrogen amount. We consider the range of $3.0/bu to $5.0/bu for corn price and 0 lb/acre to 300/acre for applied nitrogen applied. 

```{r define_vectors}
#--- corn price vector ---#
corn_price_vec <- seq(3, 5, by = 1)

#--- nitrogen vector ---#
nitrogen_vec <- seq(0, 300, by = 100)
```

After creating vectors of the parameters, you combine them to create a complete combination of the parameters using the `expand.grid()` function, and then convert it to a `data.frame` object^[Converting to a `data.frame` is not strictly necessary.].

```{r param_mat}
#--- crate a data.frame that holds parameter sets to loop over ---#
parameters_data <- expand.grid(corn_price = corn_price_vec, nitrogen = nitrogen_vec) %>% 
  #--- convert the matrix to a data.frame ---#
  data.frame()

#--- take a look ---#
parameters_data
```

We now define a function that takes a row number, refer to `parameters_data` to extract the parameters stored at the row number, and then calculate corn yield and revenue based on the extracted parameters. 

```{r define_rev_function}
  gen_rev_corn <- function(i) {

    #--- define corn price ---#
    corn_price <- parameters_data[i,'corn_price']

    #--- define nitrogen  ---#
    nitrogen <- parameters_data[i,'nitrogen']

    #--- calculate yield ---#
    yield <- 240 * (1 - exp(0.4 - 0.02 * nitrogen))

    #--- calculate revenue ---#
    revenue <- corn_price * yield 

    #--- combine all the information you would like to have  ---#
    data_to_return <- data.frame(
      corn_price = corn_price,
      nitrogen = nitrogen,
      revenue = revenue
    )

    return(data_to_return)
  }
``` 

This function takes $i$ (act as a row number within the function), extract corn price and nitrogen from the $i$th row of `parameters_mat`, which are then used to calculate yield and revenue^[Yield is generated based on the Mitscherlich-Baule functional form. Yield increases at the decreasing rate as you apply more nitrogen, and yield eventually hits the plateau.]. Finally, it returns a `data.frame` of all the information you used (the parameters and the outcomes).

```{r revenue_data}
#--- loop over all the parameter combinations ---#
rev_data <- lapply(1:nrow(parameters_data), gen_rev_corn)

#--- take a look ---#
rev_data
```

Successful! Now, for us to use the outcome for other purposes like further analysis and visualization, we would need to have all the results combined into a single data.frame instead of a list of data.frames. To do this, use either `bind_rows()` from the `dplyr` package or `rbindlist()` from the `data.table` package.

```{r bind_rows}
#--- bind_rows ---#
bind_rows(rev_data)

#--- rbindlist ---#
rbindlist(rev_data)
```

### Do you really need to loop? 

Actually, we should not have used for loop or `lapply` in none of the examples above in practice^[By the way, note that `lapply` is no magic. It's basically a for loop and not rally any faster than for loop.] This is because they can be easily vectorized. Vectorized operations are those that take vectors as inputs and work on each element of the vectors in parallel^[This does not mean that the process is parallelized by using multiple cores.]. 

A typical example of a vectorized operation would be this:

```{r vec_1}
#--- define numeric vectors ---#
x <- 1:1000
y <- 1:1000

#--- element wise addition ---#
z_vec <- x + y 
```

A non-vectorized version of the same calculation is this:

```{r }
z_la <- lapply(1:1000, function(i) x[i] + y[i]) %>%  unlist

#--- check if identical with z_vec ---#
all.equal(z_la, z_vec)
```

Both produce the same results. However, R is written in a way that does much better doing vectorized operations. Let's time them using the `microbenchmark()` function from the `microbenchmark` package. Here, we do not `unlist()` after `lapply()` to just focus on the multiplication part.

```{r}
library(microbenchmark)

microbenchmark(
  #--- vectorized ---#
  "vectorized" = { x + y }, 
  #--- not vectorized ---#
  "not vectorized" = { lapply(1:1000, function(i) x[i] + y[i])},
  times = 100, 
  unit = "ms"
)
```

As you can see, the vectorized version is faster. The time difference comes from R having to conduct many more internal checks and hidden operations for the non-vectorized one^[See [this](http://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/) or [this](https://stackoverflow.com/questions/7142767/why-are-loops-slow-in-r) to have a better understanding of why non-vectorized operations can be slower than vectorized operations.]. Yes, we are talking about a fraction of milliseconds here. But, as the objects to operated on get larger, the difference between vectorized and non-vectorized operations can become substantial^[see [here](http://www.win-vector.com/blog/2019/01/what-does-it-mean-to-write-vectorized-code-in-r/) for a good example of such a case. R is often regarded very slow compared to other popular software. But, many of such claims come from not vectorizing operations that can be vectorized. Indeed, many of the base and old R functions are written in C. More recent functions relies on C++ via the `Rcpp` package.]. 
    
The `lapply()` examples can be easily vectorized.        

Instead of this:

```{r lap_1, eval = FALSE}
lapply(1:1000 square_it)
```

You can just do this:

```{r vec_square, eval = FALSE}
square_it(1:1000)
```

You can also easily vectorize the revenue calculation demonstrated above. First, define the function differently so that revenue calculation can take corn price and nitrogen vectors and return a revenue vector.

```{r define_rev_simple}
  gen_rev_corn_short <- function(corn_price, nitrogen) {

    #--- calculate yield ---#
    yield <- 240 * (1 - exp(0.4 - 0.02 * nitrogen))

    #--- calculate revenue ---#
    revenue <- corn_price * yield 

    return(revenue)
  }
```  

Then use the function to calculate revenue and assign it to a new variable in the `parameters_data` data.

```{r no_need_to_loop}
 rev_data_2 <- mutate(
    parameters_data,
    revenue = gen_rev_corn_short(corn_price, nitrogen)
  ) 
```

Let's compare the two:

```{r }
microbenchmark(

  #--- vectorized ---#
  "vectorized" = { rev_data <- mutate(parameters_data, revenue = gen_rev_corn_short(corn_price, nitrogen)) },
  #--- not vectorized ---#
  "not vectorized" = { parameters_data$revenue <- lapply(1:nrow(parameters_data), gen_rev_corn) },
  times = 100, 
  unit = "ms"
)
```

Yes, the vectorized version is faster. So, the lesson here is that if you can vectorize, then vectorize instead of using `lapply()`. But, of course, things are just cannot be vectorized in many cases. 


## Parallelization of embarrassingly parallel processes

Parallelization of computation involves distribute the task at hand to multiple cores so that multiple processes are done in parallel. Here, we learn how to parallelize computation in R. Our focus is on the so called **embarrassingly** parallel processes. Embarrassingly parallel processes refers to a collection of processes where each process is completely independent of one another. That is, one process does not use the outputs of any of the other processes. The example of integer squaring is embarrassingly parallel. In order to calculate $1^2$, you do not need to use the result of $2^2$ or any other squares. Embarrassingly parallel processes are very easy to parallelize because you do not have to worry about which process to be completed first to make other processes happen. Fortunately, most of the processes you are interested in parallelizing fall under this category^[A good example of non-embarrassingly parallel process is dynamic optimization via backward induction. You need to know the optimal solution at $t = T$, before you find the optimal solution at $t = T-1$.].  

We will use the `future.apply` package for parallelization^[There are many other options including the `parallel`, `foreach` packages.]. Using the package, parallelizing is really a piece of cake as it is basically the same syntactically as `lapply()`. 

```{r load_fapply, message=FALSE, warning=FALSE}
#--- load packages ---#
library(future.apply) 
```

You can find out how many cores you have available for parallel computation on your computer using the `get_num_procs()` function from the `RhpcBLASctl` package.

```{r detect_cores}
library(RhpcBLASctl)

#--- number of all cores ---#
get_num_procs()
```

Before you implement parallelized `lapply()`, we need to declare what backend process we will be using the `plan()` function. Here, we use `plan(multiprocess)`^[If you are a Mac or Linux user, then the `multicore` process will be used, while the `multisession` process will be used if you are using a Windows machine. The `multicore` process is superior to the `multisession` process. See [this lecture note](https://raw.githack.com/uo-ec607/lectures/master/12-parallel/12-parallel.html) on parallel programming using R by Dr. Grant McDermott's at University of Oregon for the distinctions between the two and many other useful concepts for parallelization. At the time of writing, if you run R through RStudio, `multiprocess` option is always redirected to `multisession` option because of the instability in doing `multiprocess`. If you use Linux or Mac and want to take the full advantage of `future_apply`, you should run your R programs not through RStudio at least for now.]. In the `plan()` function, you can specify the number of workers. Here I will use the total number of cores less 1^[This way, you can have one more core available to do other tasks comfortably. However, if you don't mind having your computer completely devoted to the processing task at hand, then there is no reason not to use all the cores.].  

```{r plan}
plan(multiprocess, workers = get_num_procs() - 1)
```

`future_lapply()` works exactly like `lapply()`. 

```{r flapply}
sq_ls <- future_lapply(1:1000, function(x) x^2)
```

This is it. The only difference you see from the serialized processing using `lapply()` is that you changed the function name to `future_lapply()`. 

Okay, now we know how we parallelize computation. Let's check how much improvement in implementation time we got by parallelization. 

```{r do_mb}
microbenchmark(
  #--- parallelized ---#
  "parallelized" = { sq_ls <- future_lapply(1:1000, function(x) x^2) }, 
  #--- non-parallelized ---#
  "not parallelized" = { sq_ls <- lapply(1:1000, function(x) x^2) },
  times = 100, 
  unit = "ms"
)
```

Hmmmm, okay, so parallelization made the code slower... How could this be? This is because communicating jobs to each core takes some time as well. So, if each of the iterative processes is super fast (like this example where you just square a number), the time spent on communicating with the cores outweigh the time saving due to parallel computation. Parallelization is more beneficial when each of the repetitive processes takes long. 

One of the very good use cases of parallelization is MC simulation. The following MC simulation tests whether correlation between an independent variable and error term would cause bias (yes, we know the answer). The `MC_sim` function first generates a dataset (50,000 observations) according to the following data generating process:

$$
 y = 1 + x + v
$$
where $\mu \sim N(0,1)$, $x \sim N(0,1) + \mu$, and $v \sim N(0,1) + \mu$. The $\mu$ term cause correlation between $x$ (the covariate) and $v$ (the error term). It then estimates the coefficient on $x$ vis OLS, and return the estimate. We would like to repeat this process 1,000 times to understand the property of the OLS estimators under the data generating process. This Monte Carlo simulation is embarrassingly parallel because each process is independent of each other. 

```{r def_MC}
#--- repeat steps 1-3 B times ---#
MC_sim <- function(i){

  N <- 50000 # sample size

  #--- steps 1 and 2:  ---#
  mu <- rnorm(N) # the common term shared by both x and u
  x <- rnorm(N) + mu # independent variable
  v <- rnorm(N) + mu # error
  y <- 1 + x + v # dependent variable
  data <- data.table(y = y, x = x)

  #--- OLS ---# 
  reg <- lm(y~x, data = data) # OLS

  #--- return the coef ---#
  return(reg$coef['x'])
}
```

Let's run one iteration,

```{r one_run, eval = FALSE}
tic()
MC_sim(1)
toc()
```

```{r one_run_eval, echo = FALSE}
tic.clearlog()
tic()
MC_sim(1)
toc(log = TRUE, quiet = TRUE)
log_txt <- tic.log(format = FALSE)
time_elapsed <- log_txt[[1]]$toc - log_txt[[1]]$tic
time_elapsed
```

Okay, so it takes `r time_elapsed` second for one iteration. Now, let's run this 1000 times with or without parallelization.

**Not parallelized**

```{r benchmack_non_par, eval = FALSE}
#--- non-parallel ---#
tic()
sq_ls <- lapply(1:1000, MC_sim)
toc()
```

```{r benchmack_non_par_do, echo = FALSE}
#--- non-parallel ---#
tic.clearlog()
tic()
sq_ls <- lapply(1:1000, MC_sim)
toc(log = TRUE, quiet = TRUE)
log_txt <- tic.log(format = FALSE)
time_elapsed_ser <- log_txt[[1]]$toc - log_txt[[1]]$tic
time_elapsed_ser
```

**Parallelized**

```{r benchmack_par, eval = FALSE}
#--- parallel ---#
tic()
sq_ls <- future_lapply(1:1000, MC_sim)
toc()
```

```{r benchmack_par_do, echo = FALSE}
#--- parallel ---#
tic.clearlog()
tic()
sq_ls <- future_lapply(1:1000, MC_sim)
toc(log = TRUE, quiet = TRUE)
log_txt <- tic.log(format = FALSE)
time_elapsed_par <- log_txt[[1]]$toc - log_txt[[1]]$tic
time_elapsed_par
```

As you can see, parallelization makes it much quicker with a noticeable difference in elapsed time. We made the code `r round(time_elapsed_ser/time_elapsed_par, digit=2)` times faster. However, we did not make the process $7$ times faster even though we used 7 cores for the parallelized process. This is because of the overhead associated with distributing tasks to the cores. The relative advantage of parallelization would be even greater if each iteration takes more time. For example, if you are running a process that takes about 2 minutes for 1000 times, it would take approximately 33 hours and 20 minutes. But, it may take only 8 hours if you parallelize it on 7 cores, or maybe even 2 hours if you run it on 30 cores.

## Exercises

### Exercise 1: Vectorization and simple loop

We have two vectors:

```{r }
 x <- rnorm(1000)
 y <- rnorm(1000)
```


**Q 1.1:** Do vectorized addition of x and y (Method 1).

&nbsp;
&nbsp;

**Q 1.2:** Use the **lapply()** function to do element wise addition of x and y (Method 2).

&nbsp;
&nbsp;

**Q 1.3:** Parallelize the loop (Method 2) using the **future_lapply()** function (Method 3)

&nbsp;
&nbsp;

**Q 1.4** Using the **microbenchmark()** function, time all the three methods and identify the fastest method. Explain whey one is faster or slower than the others. 

&nbsp;
&nbsp;

## Exercise 2: Monte Carlo Simulation (small)

We know the sample average is an unbiased estimator of the expected value of a random variable. Let's confirm this numerically using Monte Carlo simulation. The random variable we will work on is distributed as $N(1, 1)$.

One iteration looks like this:

```{r , eval = FALSE}
#--- sample size ---#
N <- 1000
 
#--- sample from N(1,1) ---# 
x <- rnorm(N, mean = 1, sd = 1)

#--- get the sample mean ---#
mean(x)
```  

**Q 2.1** Conduct 1000 iterations of the same process using the **lapply** function and check if indeed sample mean is an unbiased estimator of the expected value. 

&nbsp;
&nbsp;


**Q 2.2** Parallelize the MC simulation using the **future_lapply()** function.

&nbsp;
&nbsp;


**Q 2.3** Using the **tic()** and **toc()** function to see which is faster.

&nbsp;
&nbsp;


## Exercise 3: Monte Carlo Simulation (larger)

For this exercise, change the sample size to $1000,000$ and redo the MC simulations using **lapply()** and **future_lapply()**, and finally confirm that the parallelized version is faster. Why is the parallelized version faster in this case while it was slower in the previous case?

&nbsp;
&nbsp;
&nbsp;
&nbsp;


```{r echo = FALSE, eval = FALSE}
get_mean <- function(i){

  #--- sample size ---#
  N <- 1000000
   
  #--- sample from N(1,1) ---# 
  x <- rnorm(N, mean = 1, sd = 1)

  #--- get the sample mean ---#
  mean(x)
}

tic()
res <- lapply(1:1000, get_mean)
toc()

library(future.apply)
plan(multiprocess, workers = 7)

tic()
res <- future_lapply(1:1000, get_mean)
toc()
```


## Exercise 4 (parallel computation)

**Note: This exercise requires knowledge on spatial data handling in R.** 

First, run the following code to get land use data form the Cropland Data Layer (CDL) dataset. 

```{r IA_CDL}
#--- load the cdlTools package ---#
library(cdlTools)

#--- download the CDL data for Iowa in 2015 ---#
IA_cdl_2015 <- getCDL("Iowa", 2015)$IA2015
```

Here is what the downloaded CDL data looks like: 

```{r plot_cdl}
#--- plot ---#
plot(IA_cdl_2015)
```

Next, run the following codes to create regular grids over Iowas state border.

```{r IA_grids}
library("sf")
library("maps")

#--- IA boundary ---#
IA_boundary <- st_as_sf(map("state", plot = FALSE, fill = TRUE)) %>% 
  #--- select Nebraska ---#
  filter(ID == "iowa") 

#--- create regular grids (40 cells by 40 columns) over Iowa ---#
IA_grids <- st_make_grid(IA_boundary, n = c(40, 40)) %>% 
  st_as_sf() %>% 
  #--- cut out the non-intersected parts of grids ---#
  st_intersection(., IA_boundary) %>% 
  #--- calculate the area of each grid ---#
  mutate(
    area = as.numeric(st_area(.)),
    area_ratio = area/max(area)
  ) %>% 
  #--- keep only if the area is large enough ---#
  filter(area_ratio > 0.8) %>% 
  #--- assign grid id for future merge ---#
  mutate(grid_id = 1:nrow(.)) %>% 
  #--- reproject to the CRS of the CDL data ---#
  st_transform(projection(IA_cdl_2015))

#--- plot the grids over the IA state border ---#
library(tmap)

tm_shape(IA_boundary) +
  tm_polygons(col = "green") +
tm_shape(IA_grids) +
  tm_polygons(alpha = 0)
```

The objective here is to get crop share by crop type for each of the grids in **IA_grids**. The final output should be an object that holds crop share information for all the grids in one dataset, which has grid identifier.  

**[Hint:]**  Here is how to get a crop share table for a grid. 

```{r }
temp_grid <- IA_grids[1, ]

crop_share <- crop(IA_cdl_2015, extent(temp_grid)) %>% 
  freq() 
```

**Q 4.1** Use the **lapply()** function to complete the task

&nbsp;
&nbsp;

**Q 4.2** Use the **future_lapply()** function to complete the task

&nbsp;
&nbsp;

---

```{r sessioninfo}
sessionInfo()
```

<!--chapter:end:ParallelComputing.Rmd-->

--- 
title: "Handle raster data"
author: "Taro Mieno"
date: "`r Sys.Date()`"
output:
  tufte::tufte_html: 
    css: "tufte_mine.css"
    number_sections: yes
    toc: yes
  tufte::tufte_handout:
    citation_package: natbib
    latex_engine: xelatex
  tufte::tufte_book:
    citation_package: natbib
    latex_engine: xelatex
# bibliography: skeleton.bib
link-citations: yes
---


```{r setup, echo = FALSE}
library(tufte)
library(knitr)
knitr::opts_chunk$set(
  echo = TRUE,
  cache = TRUE,
  comment = NA,
  message = FALSE,
  warning = FALSE,
  tidy = FALSE,
  cache.lazy = FALSE
  #--- figure ---#
  # dpi=400
)

opts_knit$set(
  root.dir = "/Users/tmieno2/Box/Teaching/AAEA R/GIS"
)
```

```{r, echo=FALSE, warning=FALSE, cache = FALSE}
#--- load packages ---#
suppressMessages(library(data.table))
suppressMessages(library(stringr))
suppressMessages(library(raster))
# setwd("/Users/tmieno2/Box/Teaching/AAEA R/GIS")
```

```{r, echo = F, eval = F}
setwd("/Users/tmieno2/Box/Teaching/AAEA R/GIS")
```

# Before you start

## What you need to know before you start
+ What raster data is?


## Highlight of what you will (or not) learn 
+ read and write raster files
+ stack raster files
+ quick plot raster file (not creating publication-quality maps)
+ access values from raster objects

# Introduction  


In this section, we will learn how to use the `raster` and `terra` packages to handle raster data. `terra` package has been developed to replace `raster` package, and the first beta version of the `terra` package^[[github page](https://github.com/rspatial/terra)] was just released on CRAN on 20 March 2020. `terra` is written in C++ and thus faster than `raster` package in many raster data operations. 

The topics we cover here is limited to only a small portion of the full capability of the `raster` and `terra` package. For example, we do not cover raster arithmetic, focal operations, or aggregation, which I consider are rarely of use for economists^[I do raster aggregation in Chapter X. However, it is only for the purpose of generating raster data of different cell densities to examine the performance of some raster operations.]. Those who are interested in a fuller treatment of the `raster` package are referred to Chapters 3, 4, and 5 of [Geocomputation with R](https://geocompr.robinlovelace.net/). For economists, most of the time, raster data is a source data from which values will be extracted to spatially related spatial objects in vector form^[For example, extracting land type values for a county in Iowa, extracting precipitation values from PRISM raster data for county in U.S.]. This is because the unit of analysis tends to be geographic units represented in vector form (e.g., county, school districts) rather than regular square grids without any economic or social meanings. Therefore, we will introduce only the essential knowledge of raster data operation required to effectively implement the task of extracting values, which will be covered extensively in Chapter . 

You may wonder why we still learn `raster` even though `terra` is its replacement and faster. This is because other useful (critical if you are handling many large spatial datasets and need speed) packages for us economists, such as `exactextractr` and `velox`, were written to work with `raster` object classes and have still not been adapted to support `terra` object classes at the moment. I consider `exactextractr` and `velox`  critical for economists, especially those who use large spatially fine raster datasets with many temporal dimensions. As I stated earlier, raster data extraction will be by far the most common use case of raster data for economists and also the most time-consuming part of the whole raster data handling experience. `terra::extract()` is much much faster than `raster::extract()`, which is unbearably slow for large datasets. Unfortunately, `terra::extract()` is still much slower than the extraction function provided by `exactextractr` and `velox`^[A whole chapter is dedicated to raster value extraction in Chapter , where these packages are introduced.] packages for large datasets^[CDL data is a good example of a large (or spatially fine) raster data with the cell size of 30 meter by 30 meter.]. Since `exactextractr` and `velox` works only with objects defined by `raster` package, you need to convert a `terra` object to a `raster` if you would like to take advantage of those functions. This also means that we need to learn the difference in raster object classes between the two packages. This problem should be resolved in a matter of couple of years (or even less), and most of the spatial packages will add support for `terra`.  Good news is that learning both packages does not take much time. We learn only a fraction of what they are capable of, do you remember? 

Finally, another package you might want to keep an eye on for raster (and vector data) handling is the `stars` package^[For the package details, see [here](https://r-spatial.github.io/stars/index.html). A few vignettes are available from the "Articles" tab.]. It provides a consistent way of handling spatiotemporal data than the `raster` and `terra` package. So far, the advantage of this packages brings does not seem worth the time I will need to spend learning it. 


# Raster data handling using the `raster` and `terra` packages (Only the bare minimum) 

## `raster` package: `RasterLayer`, `RasterStack`, and `RasterBrick`

Let's start with taking a look at raster data. We will download CDL data for Iowa in 2015. 

```{r read_the_IA_cdl_data}
library(cdlTools)

#--- download the CDL data for Iowa in 2015 ---#
IA_cdl_2015 <- getCDL("Iowa", 2015)$IA2015

#--- take a look ---#
IA_cdl_2015
```

Evaluating the imported raster object provides you with information about the raster data, such as dimensions (number of cells, number of columns, number of cells), spatial resolution (30 meter by 30 meter for this raster data), extent, CRS and the minimum and maximum values recorded in this raster layer. The class of the downloaded data is `RasterLayer`. A `RasterLayer` consists of only one layer, meaning that only a single variable is associated with the cells (here it is land use category code in integer).

Among these spatial characteristics, you often need to extract the CRS of a raster object before you interact it with vector data^[e.g., extracting values from a raster layer to vector data, or cropping a raster layer to the spatial extent of vector data.], which can be done using `projection()`:

```{r chars}
projection(IA_cdl_2015)
```

---

You can stack multiple raster layers of the **same spatial resolution and extent** to create a `RasterStack` using `raster::stack()`. Often times, processing a multi-layer object has computational advantages over processing multiple single-layer one by one^[You will see this in Chapter 5 where we learn how to extract values from a raster layer for a vector data.]. 

To create a `RasterStack` and `RasterBrick`, let's download the CDL data for IA in 2016 and stack it with the 2015 data.

```{r make_stack}
#--- download the CDL data for Iowa in 2016 ---#
IA_cdl_2016 <- getCDL("Iowa", 2016)$IA2016 

#--- stack the two ---#
IA_cdl_stack <- stack(IA_cdl_2015, IA_cdl_2016)

#--- take a look ---#
IA_cdl_stack
```

`IA_cdl_stack` is of class `RasterStack`, and it has two layers of variables: CDL for 2015 and 2016. You can make it a `RasterBrick` using `raster::brick()`:

```{r make_brick}
#--- stack the two ---#
IA_cdl_brick <- brick(IA_cdl_stack)

#--- or this works as well ---#
# IA_cdl_brick <- brick(IA_cdl_2015, IA_cdl_2016)

#--- take a look ---#
IA_cdl_brick
```

You probably noticed that it took some time to create the `RasterBrick` object^[Read [here](https://geocompr.robinlovelace.net/spatial-class.html#raster-classes) for the subtle difference between `RasterStack` and `RasterBrick`]. While spatial operations on `RasterBrick` are supposedly faster than `RasterStack`, the time to create a `RasterBrick` object itself is often long enough to kill the speed advantage entirely^[We will see this in Chapter , where we compare the speed of data extraction from `RasterStack` and `RasterBrick` objects.]. Often, the three raster object types are collectively referred to as `Raster`$^*$ objects for shorthand in the documentation of the `raster` and other related packages.

## `terra` package: `SpatRaster`

`terra` package has only one object class for raster data, `SpatRaster` and no distinctions between one-layer and multi-layer rasters is necessary, which is nice. Let's first convert a `RasterLayer` to a `SpatRaster` using `rast()` function.

```{r spat_raster}
#--- convert to a SpatRaster ---#
IA_cdl_2015_sr <- rast(IA_cdl_2015)

#--- take a look ---#
IA_cdl_2015_sr
```

You can see that the number of layers (`nlyr`) is 1 of course because the original object is a `RasterLayer`, which by definition has only one layer. Now, let's convert a `RasterStack` to a `SpatRaster`.

```{r spat_raster_nl}
#--- convert to a SpatRaster ---#
IA_cdl_stack_sr <- rast(IA_cdl_stack)

#--- take a look ---#
IA_cdl_stack_sr
```

Again, it is a `SpatRaster`, and you now see that the number of layers is 2. We just confirmed that `terra` has only one class for raster data whether it is single-layer or multiple-layer ones.

Instead of `projection()`, you use `crs()` to extract the CRS.

```{r }
crs(IA_cdl_2015_sr)
```

## Converting a `SpatRaster` object to a `Raster`$^*$ object.

You can revert a `SpatRaster` object back to a `Raster`$^*$ object using `raster()`, `stack()`, and `brick()`. Keep in mind that if you use `rater()` even though `SpatRaster` has multiple layers, the resulting `RasterLayer` object has only one layer. 

```{r convert_back}
#--- RasterLayer (one layer lost) ---#
IA_cdl_stack_sr %>% raster()

#--- RasterLayer ---#
IA_cdl_stack_sr %>% stack()

#--- RasterLayer (this takes some time) ---#
IA_cdl_stack_sr %>% brick()
```

# Read and write a raster data file  

Sometimes we can download raster data as we saw in Section 3.1. But, most of the time you need to read raster data stored as a file. Raster data files come in numerous formats. For example, PRPISM comes in the Band interleaved by line (BIL) format, some of the Daymet data comes in netCDF format. Other popular formats include GeoTiff, SAGA, ENVI, and many others. 

## `raster`

You can use the `raster::raster()` function to read raster data of many common formats, and it should be almost always the case that raster data you got can be read using the function. Here, we read a GeoTiff file (a file with .tif extension).

```{r read_no_eval, eval = F}
#--- general syntax ---#
raster(filename)

#--- an example ---#
IA_cdl_15 <- raster("./Data/IA_cdl_2015.tif") 
```

```{r read_eval, echo = F}
IA_cdl_15 <- raster("./Data/IA_cdl_2015.tif") 
```

One important thing to note here is that the cell values of the raster data are actually not in memory when you "read" raster data from a file. You can check this by `inMemory` function.

```{r in_memory}
#--- check if in memory ---#
inMemory(IA_cdl_2015)
```

You basically just established a connection to the file. This helps to reduce the memory footprint of raster data handling. 

---

To write a `Raster`$^*$ you can use `raster::writeRaster()`. 

```{r write_raster, eval = F}
#--- syntax ---#
writeRaster(raster object , file name, format) 

#--- example ---#
writeRaster(IA_cdl_stack, "./Data/IA_cdl_2015.tif", format = "GTiff", overwrite = TRUE) 
```

The above code saves the `RasterStack` object as a GeoTiff file^[There are many other alternative formats (see [here](https://www.rdocumentation.org/packages/raster/versions/3.0-12/topics/writeRaster)). I picked GeoTiff just because I am familiar with GeoTiff the most. I do not see a good reason to save it in a different format other than GeoTiff. So, I always pick GeoTiff. Finally, note that format option can be dropped as `writeRaster()` infers the format from the extension of the file name.]. `overwrite = TRUE` is necessary if a file with the same already exists and you are overwriting it. `writeRaster()` can be frustratingly slow for a large `Raster`$^*$ object. `terra::writeRaster()`, which I introduce below, is much faster.

## `terra`

You use `terra::rast()` to read raster data:

```{r read_no_eval_terra, eval = F}
#--- general syntax ---#
rast(filename)

#--- an example ---#
IA_cdl_2015_sr <- rast("./Data/IA_cdl_2015.tif") 
```

```{r read_eval_terra, echo = F}
IA_cdl_2015_sr <- rast("./Data/IA_cdl_2015.tif") 
```

Just like `raster::raster()`, it does not read cell values in memory.
 
---

You can write `SpatRaster` object using `terra::writeRaster()`. It works exactly the same as `raster::writeRaster()`. 

```{r write_terra, eval = F}
terra::writeRaster(IA_cdl_2015_sr, "./Data/IA_cdl_stack.tif", format = "GTiff", overwrite = TRUE)
```

## Speed comparison

Here, we compare the speed of writing raster data using  

```{r write_comp, eval = F}
#--- terra::writeRaster (faster) ---#
tic()
terra::writeRaster(IA_cdl_2015_sr, "./Data/IA_cdl_stack.tif", format = "GTiff", overwrite = TRUE)
tic()

#--- raster::writeRaster (slow) ---#
tic()
raster::writeRaster(IA_cdl_stack, "./Data/IA_cdl_stack.tif", format = "GTiff", overwrite = TRUE)
toc()
```

You can save a `Raster`$^*$ object using `terra::writeaRaster()`, but you do not get any speed advantage.

```{r terra_write_no_speed}
#--- terra::writeRaster with RasterStack (no speed advantage) ---#
tic()
terra::writeRaster(IA_cdl_stack, "./Data/IA_cdl_stack.tif", format = "GTiff", overwrite = TRUE)
toc() 
```

## Access values 

You can access the values stored in a `Raster`$^*$ object using `getValues()` function. For `SpatRaster`, you can use `values()`.

`raster` way:

```{r getvalues}
#--- raster::getValues ---#
values_from_stack <- getValues(IA_cdl_stack) 

#--- take a look ---#
head(values_from_stack)
```

`terra` way:
```{r values}
#--- terra::values ---#
values_from_rs <- values(IA_cdl_stack_sr) 

#--- take a look ---#
head(values_from_rs)
``` 

The returned values come in a matrix form: one column for one layer. Instead of getting all the values, you could get a portion of them by using `getValuesBlock()` by specifying the region for which you would like to get values. It is used extensively in `exact_extract()` function, which we show as one of the fastest ways to extract values. However, if you are finding yourself using `getValuesBlock()`, it is very much likely that you are wasting your time by not using a faster alternative. See Chapter X for further discussion of fast value extraction.  


[RasterOption](https://www.gis-blog.com/increasing-the-speed-of-raster-processing-with-r-part-13/)

<!-- ```{r }
rasterOptions()  
``` -->

<!--chapter:end:RasterDataBasics.Rmd-->

`r if (knitr::is_html_output()) '
# References {-}
'`

<!--chapter:end:References.Rmd-->

--- 
title: "Spatial Interactions of Vector and Raster Data: Extracting Values from Raster Data for Vector Data"
author: "Taro Mieno"
date: "`r Sys.Date()`"
output:
  tufte::tufte_html: 
    css: "tufte_mine.css"
    number_sections: yes
    toc: yes
    toc_depth: 2
  tufte::tufte_handout:
    citation_package: natbib
    latex_engine: xelatex
  tufte::tufte_book:
    citation_package: natbib
    latex_engine: xelatex
# bibliography: skeleton.bib
link-citations: yes
---


```{r setup, echo = FALSE}
library(tufte)
library(knitr)
knitr::opts_chunk$set(
  echo = TRUE,
  cache = TRUE,
  comment = NA,
  message = FALSE,
  warning = FALSE,
  tidy = FALSE,
  cache.lazy = FALSE
)

opts_knit$set(
  root.dir = "/Users/tmieno2/Box/Teaching/AAEA R/GIS"
)
```

```{r, echo=FALSE, warning=FALSE, cache = FALSE}
#--- load packages ---#
suppressMessages(library(data.table))
suppressMessages(library(exactextractr))
suppressMessages(library(stringr))
suppressMessages(library(sf))
suppressMessages(library(ggplot2))
suppressMessages(library(raster))
suppressMessages(library(tidyverse))
suppressMessages(library(tictoc))
suppressMessages(library(stargazer))
suppressMessages(library(tmap))
suppressMessages(library(future.apply))
suppressMessages(library(lubridate))
```


```{r figure_setup, echo = FALSE}
theme_update(
  axis.title.x = element_text(size=12,angle=0,hjust=.5,vjust=-0.3,face="plain",family="Times"),
  axis.title.y = element_text(size=12,angle=90,hjust=.5,vjust=.9,face="plain",family="Times"),

  axis.text.x = element_text(size=10,angle=0,hjust=.5,vjust=1.5,face="plain",family="Times"),
  axis.text.y = element_text(size=10,angle=0,hjust=1,vjust=0,face="plain",family="Times"),

  axis.ticks = element_line(size=0.3, linetype="solid"),
  # axis.ticks = element_blank(),
  axis.ticks.length = unit(.15,'cm'),
  # axis.ticks.margin = unit(.1,'cm'),
  # axis.text = element_text(margin=unit(.1,'cm')),

  #--- legend ---#
  legend.text = element_text(size=10,angle=0,hjust=0,vjust=0,face="plain",family="Times"),
  legend.title = element_text(size=10,angle=0,hjust=0,vjust=0,face="plain",family="Times"),
  legend.key.size = unit(0.5, "cm"),

  #--- strip (for faceting) ---#
  strip.text = element_text(size = 10,family="Times"),

  #--- plot title ---#
  plot.title=element_text(family="Times", face="bold", size=12),

  #--- margin ---#
  # plot.margin = margin(0, 0, 0, 0, "cm"),

  #--- panel ---#
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  panel.background = element_blank(),
  panel.border = element_rect(fill=NA)
  )
```


# Introduction

# Cropping (Spatial subsetting) to the Area of Interest

--- 

We use PRISM data as a raster data layer and Illinois county borders as a vector data. 

```{r prism_map, echo = F, fig.margin = T, fig.width = 6}
#--- the file name of the PRISM data just downloaded ---#
prism_file <- here::here("Data", "PRISM_tmax_stable_4kmD2_20180701_bil", "PRISM_tmax_stable_4kmD2_20180701_bil.bil")

#--- read in the prism data ---#
prism_tmax_0701 <- raster(prism_file) 

tm_shape(prism_tmax_0701) +
  tm_raster()
```

Let's download the tmax data for July 1, 2018 (map on the right).

```{r prism_download, eval = FALSE}
#--- set the path to the folder to which you save the downloaded PRISM data ---#
# This code sets the current working directory as the designated folder
options(prism.path = here::here("Data"))

#--- download PRISM precipitation data ---#
get_prism_dailys(
  type = "tmax", 
  date = "2018-07-01", 
  keepZip = FALSE 
)

#--- the file name of the PRISM data just downloaded ---#
prism_file <- here::here("Data", "PRISM_tmax_stable_4kmD2_20180701_bil", "PRISM_tmax_stable_4kmD2_20180701_bil.bil")

#--- read in the prism data ---#
prism_tmax_0701 <- raster(prism_file) 
```

```{r ks_county_map, echo = FALSE, fig.margin = TRUE, fig.height = 10}
library(maps)

#--- IL boundary ---#
IL_county <- st_as_sf(map("county", "illinois", plot = FALSE, fill = TRUE)) %>% 
  st_transform(projection(prism_tmax_0701))

#--- gen map ---#
tm_shape(IL_county) +
  tm_polygons() +
tm_layout(
  frame = NA
)
```

We now get Illinois county data (map on the right) from the `maps` package.

```{r get_IL_county, eval = F}
library(maps)

#--- IL boundary ---#
IL_county <- st_as_sf(map("county", "illinois", plot = FALSE, fill = TRUE)) %>% 
  st_transform(projection(prism_tmax_0701))
```

---

Sometimes, it is convenient to crop (spatially limit) a raster layer to the specific area of interest so that you do not have to carry around unnecessary parts of the raster layer. Moreover, it takes less time to extract values from a raster layer when the size of the raster layer is smaller. You can crop a raster layer by using `raster::crop()`. It works like this:

```{r crop_syntax, eval = FALSE}
#--- syntax ---#
crop(raster object, geographic extent)
```

To find the geographic extent of a vector data, you can use `raster::extent()`.

```{r raster_extent_sf}
extent(IL_county)
```

As you can see, it consists of four points. Four pairs of these values (xmin, ymin), (xmin, ymax), (xmax, ymin), and (xmax, ymax) form a rectangle that encompasses the IL state boundary. We will crop the PRISM raster layer to the rectangle:

```{r crop_prism_to_IL}
#--- crop the entire PRISM to its IL portion---#
prism_for_IL <- crop(prism_tmax_0701, extent(IL_county))
```

As you can see below, the cropped raster layer extends beyond the outer boundary of IL state boundary.  

```{r prism_ks_viz}
tm_shape(prism_for_IL) +
  tm_raster() +
tm_shape(IL_county) +
  tm_polygons(alpha = 0)
```

<!-- 
You can mask the values (set values to NA) outside of the vectors data.

```{r mask_prism, eval = F}
#--- syntax ---#
mask(raster object, sf object)

#--- example ---#
masked_prism_IL <- mask(prism_for_IL, IL_county)
```

```{r mask_prism_run, echo = F}
#--- example ---#
masked_prism_IL <- mask(prism_for_IL, IL_county)
```

```{r prism_ks_masked_viz}
tm_shape(masked_prism_IL) +
  tm_raster() +
tm_shape(IL_county) +
  tm_polygons(alpha = 0)
```

 -->


# Extracting Values from Raster Layers for Vector Data 

In this section, we will learn how to extract information from raster layers for spatial units represented as vector data (points and polygons). The raster data we use for illustration is PRISM data: 

## Data Preparation

For the illustrations in this section, we use the followings:

+ Kansas county borders (polygons data)
+ irrigation wells in Kansas (points data)
+ PRISM tmax data we downloaded above, cropped to Kansas state border (raster data)

**Kansas county borders:**

```{r prepare_KS_prism}
#--- KS county border ---#
KS_county <- st_as_sf(map("county", "kansas", plot = FALSE, fill = TRUE)) %>% 
  st_transform(projection(prism_tmax_0701))
```

**Irrigation wells in Kansas:**

```{r import_KS_wells}
#--- read in the KS points data ---#
KS_wells <- readRDS(here::here("Data", "Chap_5_wells_KS.rds")) 

#--- take a look ---#
KS_wells
```

**PRISM tmax data cropped to Kansas**

```{r prism_cropped_to_KS}
#--- crop to KS ---#
prism_tmax_0701_KS <- crop(prism_tmax_0701, KS_county)
```

---

Here is how the wells are spatially distributed over the PRISM grids and KS county borders:

```{r tamx_prism_wells}
tm_shape(prism_tmax_0701_KS) +
  tm_raster(title = "tmax", alpha = 0.7) +
tm_shape(KS_county) +
  tm_polygons(alpha = 0) +
tm_shape(KS_wells) +
  tm_symbols(size = 0.02) +
  tm_layout(
    frame = FALSE, 
    legend.outside = TRUE,
    legend.outside.position = "bottom"
  )
```

## Points against a RasterLayer or a RasterStack

You can extract values from raster layers to points using `raster::extract()`.  First, we demonstrate how to extract tmax values from a RasterLayer.

```{r extract_tmax, eval = F}
#--- syntax ---#
raster::extract(raster object, vector data)

#--- extract tmax values ---#
tmax_from_prism <- raster::extract(prism_tmax_0701_KS, KS_wells)

#--- take a look ---#
head(tmax_from_prism, 20)
```

```{r extract_tmax_run, echo = F}
#--- extract tmax values ---#
tmax_from_prism <- raster::extract(prism_tmax_0701_KS, KS_wells)

#--- take a look ---#
head(tmax_from_prism, 20)
```

As you can see, the `extract()` function returns the values as a vector. Since the order of the values are consistent with the order of the observations in the target points data, you can simply assign the vector as a new variable of the target data as follows:

```{r, cache=TRUE}
KS_wells$tmax_07_01 <- tmax_from_prism
```   

Extracting values from a `RasterStack` (or `RasterBrick`) works the same way. Here, we replicate two of `prism_tmax_0701_KS` to create a stack^[In practice, you may stack tmax data for different dates.], and then extract values from the stack.

```{r extract_tmax_run_stack, echo = F}
prism_tmax_stack <- stack(prism_tmax_0701_KS, prism_tmax_0701_KS)

#--- extract tmax values ---#
tmax_from_prism_stack <- raster::extract(prism_tmax_stack, KS_wells)

#--- take a look ---#
head(tmax_from_prism_stack, 10)
```

Instead of a vector, the returned object is a matrix with columns representing raster layers.


## Polygons against a RasterLayer or a RasterStack

You can use the same `extract()` function to extract values from the raster layer for polygons. However, `exact_extract()` function from the `exactextractr` package is a better alternative. `exact_extract()` is faster and more accurate than `extract()` (more on this later).^[See [here](https://github.com/isciences/exactextract) for how it does extraction tasks differently from other major GIS software. Unfortunately, `exact_extract()` does not work with points data at the moment.] The syntax of `exact_extract()` is very much similar to `extract()`. 

```{r eval = FALSE}
exact_extract(source_raster, target_polygon) 
```

So, to get tmax values from the PRISM raster layer for KS county polygons, the following does the job: 

```{r exact_extract, eval = F}
library("exactextractr")

#--- extract values from the raster for each county ---#
tmax_by_county <- exact_extract(prism_tmax_0701_KS, KS_county)  

#--- take a look at the first 5 rows of the first two list elements ---#
tmax_by_county[1:2] %>% lapply(function(x) head(x))
```

```{r exact_extract_run, echo = F, results = "hide"}
library("exactextractr")

#--- extract values from the raster for each county ---#
tmax_by_county <- exact_extract(prism_tmax_0701_KS, KS_county)  
```

```{r show_the_results, echo = F}
#--- take a look at the first 5 rows of the first two list elements ---#
tmax_by_county[1:2] %>% lapply(function(x) head(x))
```

As you can see, `exact_extract()` returns a list, where its $i$th element corresponds to the $i$th row of observation in the polygon data (`KS_county`). For each element of the list, you see `value` and `coverage_fraction`. `value` is the tmax value of the intersecting raster grids, and `coverage_fraction` is the fraction of the intersecting area relative to the full raster grid, which can help find coverage-weighted summary of the extracted values. In order to summarize the values by list and merge it to the polygon data, you can first use the `bind_rows()` function to combine the list of the data into one dataset. In doing so, you can use `.id` option to create a new identifier column that links each row to its original data^[`data.table` users can use `rbindlist()` with the `idcol` option.].  

```{r combine_after_ee}
tmax_combined <- bind_rows(tmax_by_county, .id = "id")
```

We can now summarize the data by `id`. Here, we calculate coverage-weighted mean of tmax.

```{r transform_after_ee}
tmax_by_id <- tmax_combined %>% 
  #--- convert from character to numeric  ---#
  mutate(id = as.numeric(id)) %>% 
  #--- group summary ---#
  group_by(id) %>% 
  summarise(tmax = sum(value * coverage_fraction) / sum(coverage_fraction))
```

Remember that `id` values are row numbers in the polygon data (KS_county). So, we can assign the tmax values to KS_county as follows:

```{r asign_values_after_ee}
KS_county$tmax_07_01 <- tmax_by_id$tmax
```

Extracting values from `RasterStack` (or `RasterBrick`) works in exactly the same manner as `RasterLayer`.

```{r exatrac_from_stack_run, echo = F, results = "hide"}
#--- extract from a stack ---#
tmax_by_county_stack <- exact_extract(prism_tmax_stack, KS_county) %>% 
  bind_rows(.id = "id")
```

```{r exatrac_from_stack, eval = FALSE}
#--- extract from a stack ---#
tmax_by_county_stack <- exact_extract(prism_tmax_stack, KS_county) %>% 
  bind_rows(.id = "id")
```

```{r take_a_look}
#--- take a look ---#
head(tmax_by_county_stack)
```

# Some notes on the speed

## `exact_extract()` vs `raster::extract()`

`exact_extract()` uses C++ as the backend. Therefore, it is considerably faster than `raster::extract()`. Let's time both and see the difference. 

```{r extract_from_polygons}
library(tictoc)

#--- exact extract ---#
tic()
exact_extract_temp <- exact_extract(prism_tmax_0701_KS, KS_county, progress = FALSE)  
toc()

#--- extract with weights ---#
tic()
extract_temp <- raster::extract(prism_tmax_0701_KS, KS_county, weights = TRUE)  
toc()
```

As you can see, the difference is clear. The difference in time becomes substantial as the number of polygons are greater and the number of grids are greater. Indeed, for a fairly large raster and polygons data, `raster::extract()` becomes unacceptably slow. 

## `RasterLayer` vs `RasterStack`

Extracting values from a RasterStack takes less time than extracting values from the individual layers one at a time. This can be observed below.   

```{r compare_speed}
#--- extract from 5 layers one at a time ---#
tic()
temp <- exact_extract(prism_tmax_0701_KS, KS_county, progress = FALSE)
temp <- exact_extract(prism_tmax_0701_KS, KS_county, progress = FALSE)
temp <- exact_extract(prism_tmax_0701_KS, KS_county, progress = FALSE)
temp <- exact_extract(prism_tmax_0701_KS, KS_county, progress = FALSE)
temp <- exact_extract(prism_tmax_0701_KS, KS_county, progress = FALSE)
toc()

#--- extract from a 2-layer stack ---#
prism_tmax_stack_2 <- stack(
    prism_tmax_0701_KS, 
    prism_tmax_0701_KS, 
    quick = TRUE
  )

tic()
temp <- exact_extract(prism_tmax_stack_2, KS_county, progress = FALSE)
toc()

#--- extract from from a 5-layer stack ---#
prism_tmax_stack_5 <- stack(
    prism_tmax_0701_KS, 
    prism_tmax_0701_KS, 
    prism_tmax_0701_KS, 
    prism_tmax_0701_KS, 
    prism_tmax_0701_KS, 
    quick = TRUE
  )

tic()
temp <- exact_extract(prism_tmax_stack_5, KS_county, progress = FALSE)
toc()
```

The reduction in computation time makes sense. Since both layers have exactly the same geographic extent and resolution, finding the polygons-grids correspondence is done once and then it can be used repeatedly across the layers for the RasterStack. By comparing the 2-layer and 5-layer stacks, you can see that having additional layers is very costly. On the other hand, if `exact_extract()` for RasterLayers individually, you find the polygons-grids correspondence every time, which is a waste of time. This clearly suggests that when you are processing many layers of the same spatial resolution and extent, you should first stack them and then extract values at the same time instead of processing them one by one as long as your memory allows you to do so. There is much more to discuss about the computation speed using RasterLayer and RasterStack. For those who are interested in this topic can go to [Chapter ?](link_here).


<!--chapter:end:SpatialInteractionVectorRaster.Rmd-->

# Spatial Interactions of Vector Data: Subsetting and Joining

```{r setup, echo = FALSE}
library(tufte)
library(knitr)
knitr::opts_chunk$set(
  echo = TRUE,
  cache = TRUE,
  comment = NA,
  message = FALSE,
  warning = FALSE,
  tidy = FALSE,
  cache.lazy = FALSE
)

opts_knit$set(
  root.dir = "/Users/tmieno2/Box/Teaching/AAEA R/GIS"
)
```

```{r, echo=FALSE, warning=FALSE, cache = FALSE}
#--- load packages ---#
suppressMessages(library(data.table))
suppressMessages(library(exactextractr))
suppressMessages(library(stringr))
suppressMessages(library(sf))
suppressMessages(library(ggplot2))
suppressMessages(library(raster))
suppressMessages(library(tidyverse))
suppressMessages(library(tictoc))
suppressMessages(library(stargazer))
suppressMessages(library(tmap))
suppressMessages(library(future.apply))
suppressMessages(library(lubridate))
```


```{r figure_setup, echo = FALSE, cache = FALSE}
theme_update(
  axis.title.x = element_text(size=12,angle=0,hjust=.5,vjust=-0.3,face="plain",family="Times"),
  axis.title.y = element_text(size=12,angle=90,hjust=.5,vjust=.9,face="plain",family="Times"),

  axis.text.x = element_text(size=10,angle=0,hjust=.5,vjust=1.5,face="plain",family="Times"),
  axis.text.y = element_text(size=10,angle=0,hjust=1,vjust=0,face="plain",family="Times"),

  axis.ticks = element_line(size=0.3, linetype="solid"),
  # axis.ticks = element_blank(),
  axis.ticks.length = unit(.15,'cm'),
  # axis.ticks.margin = unit(.1,'cm'),
  # axis.text = element_text(margin=unit(.1,'cm')),

  #--- legend ---#
  legend.text = element_text(size=10,angle=0,hjust=0,vjust=0,face="plain",family="Times"),
  legend.title = element_text(size=10,angle=0,hjust=0,vjust=0,face="plain",family="Times"),
  legend.key.size = unit(0.5, "cm"),

  #--- strip (for faceting) ---#
  strip.text = element_text(size = 10,family="Times"),

  #--- plot title ---#
  plot.title=element_text(family="Times", face="bold", size=12),

  #--- margin ---#
  # plot.margin = margin(0, 0, 0, 0, "cm"),

  #--- panel ---#
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  panel.background = element_blank(),
  panel.border = element_rect(fill=NA)
  )
```


## Introduction

## Topological relations

Before we learn spatial subsetting and joining, we first look at topological relations. Topological relations refer to the way multiple spatial objects are spatially related to one another. You can identify various types of spatial relations using the `sf` package. Our main focus is on the intersections of spatial objects, which can be found using `st_intersects()`.^[I would say it is very rare that you use other topological relations like `st_within()` or `st_touches()`.] We also briefly cover `st_is_within_distance()`^[Run `?geos_binary_pred` to see other topological relations you can find.]. 

---

We first create `sf` objects we are going to use for illustrations.

**POINTS**

```{r create_points}
#--- create points ---#
point_1 <- st_point(c(2, 2))
point_2 <- st_point(c(1, 1))
point_3 <- st_point(c(1, 3))

#--- combine the points to make a single  sf of points ---#
(
points <- list(point_1, point_2, point_3) %>% 
  st_sfc() %>% 
  st_as_sf() %>% 
  mutate(name = c("point 1", "point 2", "point 3"))
)
```

---

**LINES**

```{r create_lines}
#--- create points ---#
line_1 <- st_linestring(rbind(c(0, 0), c(2.5, 0.5)))
line_2 <- st_linestring(rbind(c(1.5, 0.5), c(2.5, 2)))

#--- combine the points to make a single  sf of points ---#
(
lines <- list(line_1, line_2) %>% 
  st_sfc() %>% 
  st_as_sf() %>% 
  mutate(name = c("line 1", "line 2"))
)
```

---

**POLYGONS**

```{r create_polygons}
#--- create polygons ---#
polygon_1 <- st_polygon(list(
  rbind(c(0, 0), c(2, 0), c(2, 2), c(0, 2), c(0, 0)) 
))

polygon_2 <- st_polygon(list(
  rbind(c(0.5, 1.5), c(0.5, 3.5), c(2.5, 3.5), c(2.5, 1.5), c(0.5, 1.5)) 
))

polygon_3 <- st_polygon(list(
  rbind(c(0.5, 2.5), c(0.5, 3.2), c(2.3, 3.2), c(2, 2), c(0.5, 2.5)) 
))

#--- combine the polygons to make an sf of polygons ---#
(
polygons <- list(polygon_1, polygon_2, polygon_3) %>% 
  st_sfc() %>% 
  st_as_sf() %>% 
  mutate(name = c("polygon 1", "polygon 2", "polygon 3"))
)
```

---

Figure \@ref(fig:plot-point-polygons) shows how they look: 

```{r plot-point-polygons, fig.cap = "Visualization of the points, lines, and polygons"}
ggplot() +
  geom_sf(data = polygons, aes(fill = name), alpha = 0.3) +
  scale_fill_discrete(name = "Polygons") +
  geom_sf(data = lines, aes(color = name)) +
  scale_color_discrete(name = "Lines") + 
  geom_sf(data = points, aes(shape = name), size = 3) +
  scale_shape_discrete(name = "Points")  
``` 

### st_intersects()  

This function identifies which `sfg` object in an `sf` (or `sfc`) intersects with `sfg` object(s) in another `sf`. For example, you can use the function to identify which well is located within which NRD. `st_intersects()` is the most commonly used topological relations, and it is the default topological relation used when performing spatial subsetting and joining, which we will cover later. 

---

**points and polygons**

```{r intersects_point_polygons}
st_intersects(points, polygons)
```
As you can see, the output is a list of which polygon(s) each of the points intersect with. 1, 2, and 3 for the first row means that 1st (polygon 1), 2nd (polygon 2), and 3rd (polygon 3) objects of the `polygons` intersect with the first point (point 1) of the `points` object. The fact that point 1 is considered to be intersecting with polygon 2 means that the area inside the border is considered a part of the polygon (of course). 

If you would like the results of `st_intersects()` in a matrix form with boolean values filling the matrix, you can add `sparse = FALSE` option. 

```{r, cache=TRUE}
st_intersects(points, polygons, sparse = FALSE)
```

---

**lines and polygons**

```{r intersects_lines_polygons}
st_intersects(lines, polygons)
```

The output is a list of which polygon(s) each of the lines intersect with. 

---

**polygons and polygons**

For polygons vs polygons interaction, `st_intersects()` identifies any polygons that either touches (even at a point) or share some area.

```{r, cache=TRUE}
st_intersects(polygons, polygons)
```

### st_intersection() {#st_intersection}

Instead of getting just indices of intersecting objects, __st_intersection()__ returns intersecting spatial objects. 

---

**lines and polygons**

The following code gets the intersection of line 2 and the polygons.

```{r, cache=TRUE}
intersections <- st_intersection(lines[2, ], polygons) %>% 
  mutate(int_name = paste0(name, "-", name.1))

#--- take a look ---#
intersections
```

As you can see in Figure \@ref(fig:lines-polygons-int) below, each instance of the intersections of the line and polygons become an observation (line 2-polygon 1 and line 2-polygon 2). Note also that the part of the line that did not intersect is cut out and does not remain in the returned `sf`.^[See Chapter 1, Demonstration 3 for an example of lines-polygons intersection in an economic study.]

```{r lines-polygons-int, fig.cap = "The outcome of the intersections of the lines and polygons"}
ggplot() +
  #--- here are all the original polygons  ---#
  geom_sf(data = polygons, aes(fill = name), alpha = 0.1) +
  #--- here is what is returned after st_intersection ---#
  geom_sf(data = intersections, aes(color = int_name), size = 1.5)
```

---

**polygons and polygons**

The following code gets the intersection of polygon 1 and polygon 3 with polygon 2.

```{r, cache=TRUE}
intersections <- st_intersection(polygons[c(1,3), ], polygons[2, ]) %>% 
  mutate(int_name = paste0(name, "-", name.1))

#--- take a look ---#
intersections
```

As you can see in Figure \@ref(fig:polygons-polygons-int), each instance of the intersections of polygons 1 and 3 against polygon 2 becomes an observation (polygon 1-polygon 2 and polygon 3-polygon 2). Just like the lines-polygons case, the non-intersecting part of polygons 1 and 3 are cut out and do not remain in the returned `sf`. We will see later that `st_intersection()` can be used to find area-weighted values from the intersecting polygons with a help from `st_area()`.  

```{r polygons-polygons-int, fig.cap = "The outcome of the intersections of polygon 2 and polygons 1 and 3"}
ggplot() +
  #--- here are all the original polygons  ---#
  geom_sf(data = polygons, aes(fill = name), alpha = 0.1) +
  #--- here is what is returned after st_intersection ---#
  geom_sf(data = intersections, aes(fill = int_name))
```

### st_is_within_distance()  

This function identifies whether two spatial objects are within the distance you specify as the name suggests^[This function can be useful to identify neighbors. For example, you may want to find irrigation wells located around well $i$ to label them as well $i$'s neighbor.].  

Let's first create two sets of points. 

```{r create_random_points}
set.seed(38424738)

points_set_1 <- lapply(1:5, function(x) st_point(runif(2))) %>% 
  st_sfc() %>% st_as_sf() %>% 
  mutate(id = 1:nrow(.))

points_set_2 <- lapply(1:5, function(x) st_point(runif(2))) %>% 
  st_sfc() %>% st_as_sf() %>% 
  mutate(id = 1:nrow(.))
```

Here is how they are spatially distributed (Figure \@ref(fig:map-points-points-points)). Instead of circles of points, their corresponding `id` (or equivalently row number here) values are displayed.

```{r map-points-points-points, fig.cap = "The locations of the set of points"}
ggplot() +
  geom_sf_text(data = points_set_1, aes(label = id), color = "red") +
  geom_sf_text(data = points_set_2, aes(label = id), color = "blue") 
```

We want to know which of the blue points (points_set_2) are located within 0.2 from each of the red points (points_set_1). The following figure (Figure \@ref(fig:points-points-within)) gives use the answer visually.

```{r points-points-within, fig.cap = "The blue points within 0.2 radius of the red points"}
#--- create 0.2 buffers around points in points_set_1 ---#
buffer_1 <- st_buffer(points_set_1, dist = 0.2)

ggplot() +
  geom_sf(data = buffer_1, color = "red", fill = NA) +
  geom_sf_text(data = points_set_1, aes(label = id), color = "red") +
  geom_sf_text(data = points_set_2, aes(label = id), color = "blue") 
```

Confirm your visual inspection results with the outcome of the following code using `st_is_within_distance()` function.

```{r within_distance_blue_red}
st_is_within_distance(points_set_1, points_set_2, dist = 0.2)
```

## Spatial Subsetting (or Flagging)

Spatial subsetting refers to operations that narrow down the geographic scope of a spatial object based on another spatial object. We illustrate spatial subsetting using Kansas county borders, the boundary of the High-Plains Aquifer (HPA), and agricultural irrigation wells in Kansas.    

First, let's import all the files we will use in this section. 

```{r hp_import, results = "hide"}
#--- Kansas county borders ---#
KS_counties <- readRDS("./Data/KS_county_borders.rds")

#--- HPA boundary ---#
hpa <- st_read(dsn = "./Data", layer = "hp_bound2010") %>% 
  .[1, ] %>% 
  st_transform(st_crs(KS_counties))  

#--- all the irrigation wells in KS ---#
KS_wells <- readRDS("./Data/Kansas_wells.rds") %>% 
  st_transform(st_crs(KS_counties))

#--- US railroad ---#
rail_roads <- st_read(dsn = "./Data/", layer = "tl_2015_us_rails") %>% 
  st_transform(st_crs(KS_counties)) 
```

### polygons vs polygons

The following map (Figure \@ref(fig:overlap-KS-county-HPA)) shows the Kansas portion of the HPA and KS counties.

```{r overlap-KS-county-HPA, fig.cap = "Kansas portion of High-Plains Aquifer and Kansas counties"}
#--- add US counties layer ---#
tm_shape(KS_counties) +
  tm_polygons() +
#--- add High-Plains Aquifer layer ---#
tm_shape(hpa) +
  tm_fill(col = "blue", alpha = 0.3)
```

The goal here is to select only the counties that intersects with the HPA boundary. When subsetting a data.frame by specifying the row numbers you would like to select, you can do 

```{r eval = FALSE}
data.frame[vector of row numbers, ]
```

Spatial subsetting of sf objects works in a similar syntax:   

```{r syntax_subset, eval = FALSE}
sf_1[sf_2, ]
```

where you are subsetting sf_1 based on sf_2. Instead of row numbers, you provide another sf object in place. The following code spatially subsets KS counties based on the HPA boundary.

```{r spatial_subset}
counties_in_hpa <- KS_counties[hpa, ]
```

See the results below in Figure \@ref(fig:).

```{r default_subset, fig.cap = "The results of spatially subsetting KS counties based on HPA boundary"}
#--- add US counties layer ---#
tm_shape(counties_in_hpa) +
  tm_polygons() +
#--- add High-Plains Aquifer layer ---#
tm_shape(hpa) +
  tm_fill(col = "blue", alpha = 0.3)
```

You can see that only the counties that intersect with the HPA boundary remained. This is because when you use the above syntax of `sf_1[sf_2, ]`, the default underlying topological relations is `st_intersects()`. So, if an object in `sf_1` intersects with any of the objects in `sf_2` even slightly, then it will remain after subsetting. 

You can specify the spatial operation to be used as an option as in 

```{r eval = FALSE}
sf_1[sf_2, op = topological_relation_type] 
```

For example, if you only want counties that are completely within the HPA boundary, you can do the following (the map of the results in Figure \@ref(fig:within-subset)):

```{r st_within}
counties_within_hpa <- KS_counties[hpa, , op = st_within]
```
 
```{r within-subset, fig.cap = "Kansas counties that are completely within HPA boundary"}
#--- add US counties layer ---#
tm_shape(counties_within_hpa) +
  tm_polygons() +
#--- add High-Plains Aquifer layer ---#
tm_shape(hpa) +
  tm_fill(col = "blue", alpha = 0.3)
```

<!-- 
#%%%%%%%%%%%%%%%%%%%%%
# Points vs Polygons 
#%%%%%%%%%%%%%%%%%%%%%
-->

### points vs polygons

The following map (Figure \@ref(fig:map-wells-county)) shows the Kansas portion of the HPA and all the irrigation wells in KS.

```{r map-wells-county, fig.cap = "A map of Kansas irrigation wells and HPA"}
tm_shape(KS_wells) +
  tm_symbols(size = 0.1) +
tm_shape(hpa) +
  tm_polygons(col = "blue", alpha = 0.1) 
```

We can select only wells that reside within the HPA boundary using the same syntax as the above example.

```{r wells_hpa}
KS_wells_in_hpa <- KS_wells[hpa, ]
```

As you can see in Figure \@ref(fig:map-wells-in-hpa) below, only the wells that are inside (or intersects with) the HPA remained as the default topological relation is `st_intersects()`.  

```{r map-wells-in-hpa, fig.cap = "A map of Kansas irrigation wells and HPA"}
tm_shape(KS_wells_in_hpa) +
  tm_symbols(size = 0.1) +
tm_shape(hpa) +
  tm_polygons(col = "blue", alpha = 0.1) 
```

<!-- 
#%%%%%%%%%%%%%%%%%%%%%
# Lines vs Polygons 
#%%%%%%%%%%%%%%%%%%%%%
-->

### lines vs polygons

The following map (Figure \@ref(fig:mapl-lines-county)) shows the Kansas counties and U.S. railroads.

```{r mapl-lines-county, fig.cap = "U.S. railroads and Kansas county boundary", dependson = "hp_import"}
ggplot() +
  geom_sf(data = rail_roads, col = "blue") +
  geom_sf(data = KS_counties, fill = NA)  
```

We can select only railroads that intersects with Kansas.

```{r railroads_ks_county}
railroads_KS <- rail_roads[KS_counties, ]
```

As you can see in Figure \@ref(fig:map-rail-ks) below, only the railroads that intersect with Kansas were selected. Note the the lines that go beyond the Kansas boundary are also selected. Remember, the default is `st_intersect()`. If you would like the lines beyond the state boundary to be cut out, but the intersecting parts of those lines to remain, use `st_intersection()`.

```{r map-rail-ks, fig.cap = "Railroads that intersects Kansas county boundary"}
tm_shape(railroads_KS) +
  tm_lines(col = "blue") +
tm_shape(KS_counties) +
  tm_polygons(alpha = 0)  +
  tm_layout(frame = FALSE) 
```

### Flagging instead of subsetting

Sometimes, you just want to flag whether two spatial objects intersect or not, instead of dropping non-overlapping observations. In that case, you can use `st_intersects()`.

---

**Counties (polygons) against HPA boundary (polygons)**

```{r county_hpa}
#--- county ---#
KS_counties <- mutate(KS_counties, intersects_hpa  = st_intersects(KS_counties, hpa, sparse = FALSE))

#--- take a look ---#
dplyr::select(KS_counties, COUNTYFP, intersects_hpa)
```

---

**Wells (points) against HPA boundary (polygons)**

```{r well_hpa_flag}
#--- wells ---#
KS_wells <- mutate(KS_wells, in_hpa  = st_intersects(KS_wells, hpa, sparse = FALSE))

#--- take a look ---#
dplyr::select(KS_wells, site, in_hpa)
```

---

**U.S. railroads (lines) against Kansas county (polygons)**

Unlike the previous two cases, multiple objects (lines) are checked against multiple objects (polygons) for intersection^[Of course, this situation arises for a polygons-polygons case as well. The above polygons-polygons example was an exception because the `hpa` has only one polygon object.]. Therefore, we cannot use the strategy we took above of returning a vector of true or false using `sparse = TRUE` option. Here, we need to count the number of intersecting counties and then assign `TRUE` if the number is greater than 0. 

```{r lines_ks_flag}
#--- check the number of intersecting KS counties ---#
int_mat <- st_intersects(rail_roads, KS_counties) %>% 
  lapply(length) %>% 
  unlist() 

#--- railroads ---#
rail_roads <- mutate(rail_roads, intersect_ks  = int_mat > 0)

#--- take a look ---#
dplyr::select(rail_roads, LINEARID, intersect_ks)
```

## Spatial Join

By spatial join, we mean spatial operations that involve the followings:

1. overlay one spatial layer (target layer) onto another spatial layer (source layer) 
2. for each of the observation in the target layer
+ identify which objects in the source layer it geographically intersects (or being close) with  
+ extract values associated with the intersecting objects in the source layer (and summarize if necessary), 
+ assign the extracted value to the object in the target layer

For economists, this is probably the most common motivation of using GIS software, with the ultimate goal being including the spatially joined variables as covariates in regression analysis. 

We can classify spatial join into four categories by the type of the underlying spatial objects:

+ vector-vector: vector data (target) against vector data (source)  
+ vector-raster: vector data (target) against raster data (source)  
+ raster-vector: raster data (target) against vector data (source)  
+ raster-raster: raster data (target) against raster data (source)  

Among the four, our focus here is the first case. The second case will be discussed in Chapter 5. We will not cover the third and fourth cases in this class.^[This is because it is almost always the case that our target data is a vector data (e.g., city or farm fields as points, political boundaries as polygons, etc).]  

Category 1 can be further broken down into different sub categories depending on the type of spatial objects (point, line, and polygon). Here, we will ignore any spatial joins that involve lines. This is because objects represented by lines are rarely observations units in econometric analysis nor the source data that we will extract values from.^[Note that we did not extract any attribute values of railroads in Chapter 1, Demonstration 4. We just calculated the travel length of the railroads, which does not fall under our definition of spatial join.]. So, here is the list of the types of spatial joins we will learn.  

1. points (target) against polygons (source)
2. polygons (target) against points (source)
3. polygons (target) against polygons (source)

<!-- 
#=========================================
# Spatial Joining 
#=========================================
-->

### Case 1: points (target) vs polygons (source)

Case 1, for each of the observations (points) in the target data, finds which polygon in the source file it intersects, and then assign the value associated with the polygon to the point^[You can see a practical example of this case in action in Demonstration 1 of Chapter X.]. In order to achieve this, we can use the `st_join()` function, whose syntax is as follows:    

```{r syntax_st_join, eval = FALSE}
st_join(target_sf, source_sf)
```

Similar to spatial subsetting, the default topological relation is `st_intersects()`^[While it is unlikely you face the need to change the topological relation, you could do so using the `join` option.]. 

We use the KS irrigation wells data (points) and KS county boundary data (polygons) for a demonstration. Our goal is to assign the county-level corn price information from the KS county data to wells. First let me create and add a fake county-level corn price variable to the KS county data.  

```{r create_corn_price}
KS_corn_price <- KS_counties %>%  
  mutate(
    corn_price = seq(3.2, 3.9, length = nrow(.)) 
  ) %>% 
  dplyr::select(COUNTYFP, corn_price)
```  

Here is the map of KS county color-differentiated by fake corn price (Figure \@ref(fig:map-corn-price)):

```{r map-corn-price, fig.cap = "Map of county-level fake corn price"}
tm_shape(KS_corn_price) + 
  tm_polygons(col = "corn_price") +
  tm_layout(frame = FALSE, legend.outside = TRUE)
```

For this particular context, the following code will do the job: 

```{r st_join_KS}
#--- spatial join ---#
(
KS_wells_County <- st_join(KS_wells, KS_corn_price)
)
```

You can see from Figure \@ref(fig:map-corn-wells) below that all the wells inside the same county has the same corn price value. 

```{r map-corn-wells, fig.cap = "Map of wells color-differentiated by corn price"}
tm_shape(KS_counties) +
  tm_polygons() +
tm_shape(KS_wells_County) +
  tm_symbols(col = "corn_price", size = 0.2) +
  tm_layout(frame = FALSE, legend.outside = TRUE)
```

### Case 2: polygons (target) vs points (source)

Case 2, for each of the observations (polygons) in the target data, find which observations (points) in the source file it intersects, and then assign the values associated with the points to the polygon. We use the same function: `st_join()`^[You can see a practical example of this case in action in Demonstration 2 of Chapter X.]. 

Suppose you are now interested in county-level analysis and you would like to get county-level total groundwater pumping. The target file is `KS_counties`, and the source file is `KS_wells`.

```{r st_join_polygon_point}
#--- spatial join ---#
KS_County_wells <- st_join(KS_counties, KS_wells)

#--- take a look ---#
dplyr::select(KS_County_wells, COUNTYFP, site, af_used)
```

As you can see, in the resulting dataset, all the unique polygon - point intersecting combinations comprise the observations. For each of the polygons, you will have as many observations as the number of wells that intersect with the polygon. Once you joined the two layers, you can find statistics by polygon (county here). Since we want groundwater extraction by county, the following does the job.

```{r summary_after_join}
KS_County_wells %>% 
  group_by(COUNTYFP) %>% 
  summarize(af_used = sum(af_used, na.rm = TRUE)) 
```

Of course, it is just as easy to get other types of statistics by simply modifying the `summarize()` part.

However, this two step process can be actually done in one step using the `aggregate()` function as follows:

```{r demo_aggregate}
#--- mean ---#
aggregate(KS_wells, KS_counties, FUN = mean)

#--- sum ---#
aggregate(KS_wells, KS_counties, FUN = sum)
```

Notice that the `mean()` function was applied to all the columns in `KS_wells`, including site id number. So, you might want to select variables you want to join before you apply the `aggregate()` function like this:  

```{r agg_select}
aggregate(dplyr::select(KS_wells, af_used), KS_counties, FUN = mean)
```

### Case 3: polygons (target) vs polygons (source)

For this case, `st_join(target_sf, source_sf)` will return all the unique intersecting polygon-polygon combinations with the information of the polygon from source_sf attached.  

We will use county-level corn acres in Iowa in 2018 from USDA NASS^[see [here](link_here) for how to download Quick Stats data from within R.] and Hydrologic Units^[see [here](https://water.usgs.gov/GIS/huc.html) for explanation of what they are. You do not really need to know what HUC units are to understand what's done in this section.] Our objective here is to find corn acres by HUC units based on the county-level corn acres data^[Yes, there will be substantial measurement errors as the source polygons (corn acres by county) are large relative to the target polygons (HUC units). But, this serves as a good illustration of a polygon-polygon join.].   

We first import the Iowa corn acre data:

```{r IA_corn_data}
#--- IA boundary ---#
IA_corn <- readRDS("./Data/IA_corn.rds")

#--- take a look ---#
IA_corn
```

Here is the map of IA county color-differentiated by corn acres (Figure \@ref(fig:map-IA-corn)):

```{r map-IA-corn, fig.cap = "Map of Iowa counties color-differentiated by corn planted acreage"}
#--- here is the map ---#
tm_shape(IA_corn) +
  tm_polygons(col = "acres") +
  tm_layout(frame = FALSE, legend.outside = TRUE)
```

Now import the HUC units data:

```{r HUC_import, results = "hide"}
#--- import HUC units ---#
HUC_IA <- st_read(dsn = "./Data/huc250k_shp", layer = "huc250k") %>% 
  dplyr::select(HUC_CODE) %>% 
  #--- reproject to the CRS of IA ---#
  st_transform(st_crs(IA_corn)) %>% 
  #--- select HUC units that overlaps with IA ---#
  .[IA_corn, ]
``` 

Here is the map of HUC units (Figure \@ref(fig:HUC-map)):

```{r HUC-map, fig.cap = "Map of HUC units that intersect with Iowa state boundary"}
tm_shape(HUC_IA) +
  tm_polygons() +
  tm_layout(frame = FALSE, legend.outside = TRUE)
```

IA county with HUC units superimposed on top (Figure \@ref(fig:HUC-county-map)):

```{r HUC-county-map, fig.cap = "Map of HUC units superimposed on Iowas counties"}
tm_shape(IA_corn) +
  tm_polygons(col = "acres") +
tm_shape(HUC_IA) +
  tm_polygons(alpha = 0) +
  tm_layout(frame = FALSE, legend.outside = TRUE)
```

```{r join_HUC_acres}
HUC_joined <- st_join(HUC_IA, IA_corn)
```


#### Area-weighted average (use area-preserving projection)

To do area-weighted average, we can first use `st_intersection()`. For each of the polygons in the target layer, this function, finds the intersecting polygons from the source data, and then divide the target polygon into parts based on the boundary of the intersecting polygons. 

```{r intersection}
HUC_intersections <- st_intersection(HUC_IA, IA_corn) %>% 
  arrange(HUC_CODE) %>% 
  mutate(huc_county = paste0(HUC_CODE, "-", county_code))
```

The key difference from the `st_join()` example (see \@ref(st_intersection)) is that it returns a geometry variable that represents the intersecting area of the HUC units and the counties as shown in Figure \@ref(fig:inter-ex) below. 

```{r inter-ex, fig.cap = "Intersections of a HUC unit and Iowa counties"}
tm_shape(filter(HUC_intersections, HUC_CODE == "07020009")) + 
  tm_polygons(col = "huc_county") +
  tm_layout(frame = FALSE)
```

```{r map_area_weighted}
HUC_aw_acres <- HUC_intersections %>% 
  #--- get area ---#
  mutate(area = as.numeric(st_area(.))) %>% 
  #--- get area-weight by HUC unit ---#
  group_by(HUC_CODE) %>% 
  mutate(weight = area / sum(area)) %>% 
  #--- calculate area-weighted corn acreage ---#
  summarize(aw_acres = sum(weight * acres))  
```


<!--chapter:end:SpatialInteractionVectorVector.Rmd-->

# Handle vector data using the sf package {#vector-basics}

```{r setup, echo = FALSE}
library(knitr)
knitr::opts_chunk$set(
  echo = TRUE,
  cache = TRUE,
  comment = NA,
  message = FALSE,
  warning = FALSE,
  tidy = FALSE,
  cache.lazy = FALSE
)

opts_knit$set(root.dir = "~/Box/Teaching/AAEA R/GIS")
```

```{r setwd, eval = FALSE, echo = FALSE}
setwd("~/Box/Teaching/AAEA R/GIS")
```

```{r, echo=FALSE, warning=FALSE, cache = FALSE}
#--- load packages ---#
suppressMessages(library(data.table))
suppressMessages(library(here))
suppressMessages(library(stringr))
suppressMessages(library(sf))
suppressMessages(library(ggplot2))
suppressMessages(library(raster))
suppressMessages(library(stargazer))
suppressMessages(library(tmap))
suppressMessages(library(future.apply))
suppressMessages(library(lubridate))
suppressMessages(library(tidyverse))
```

## Introduction

In this chapter we learn how to use the `sf` package to handle and operate on spatial datasets. The `sf` package uses the class of simple feature (`sf`)^[yes, it is the same as the package name] for spatial objects in R. We first learn how `sf` objects store and represent spatial datasets. We then move on to the following practical topics:

+ read and write a shapefile and spatial data in other formats (and why you might not want to use the shape file system any more, but use other alternative formats)
+ project and reproject spatial objects
+ convert `sf` objects into `sp` objective, vice versa
+ confirm that `dplyr` works well with `sf` objects
+ implement non-interactive (does not involve two `sf` objects) geometric operations on `sf` objects
  * create buffers 
  * find the area of polygons
  * find the centroid of polygons
  * calculate the length of lines

### `sf` or `sp`?

The `sf` package was designed to replace the `sp` package (both developed by Edzer Pebesma), which has been one of the most popular and powerful spatial packages in R for more than a decade. It has been about four years since `sf` package was first registered on CRAN. A couple of years back, many other spatial packages did not have support for the package yet. In this [blog post](https://www.r-bloggers.com/should-i-learn-sf-or-sp-for-spatial-r-programming/) that asked the question of whether one should learn `sp` of `sf`, the author said:

"That's a tough question. If you have time, I would say, learn to use both. sf is pretty new, so a lot of packages that depend on spatial classes still rely on sp. So you will need to know sp if you want to do any integration with many other packages, including raster (as of March 2018).

However, in the future we should see an increasing shift toward the sf package and greater use of sf classes in other packages. I also think that sf is easier to learn to use than sp."

The future has come, and it's not a tough question anymore. I cannot think of any major spatial packages that do not support `sf` package, and `sf` has largely become the standard for handling vector data in $R$. Thus, this lecture note does not cover how to use `sp` at all.^[except we learn how to convert back and forth between `sf` object ans `sp` objects just in case you need `sp` objects.]

`sf` has several advantages over `sp` package [@pebesma2018simple].^[There are cases where `sp` is faster completing the same task than `sf`. For example, see the answer to [this question](https://gis.stackexchange.com/questions/324952/spover-vs-sfst-intersection-in-r). But, the I doubt the difference between the two is practically negligible even with a bigger data than the test data.] First, it cut off the tie that `sp` had with ESRI shapefile system, which had somewhat loose way of representing spatial data. Instead, it uses _simple feature access_, which is an open standard supported by Open Geospatial Consortium (OGC). Another important benefit is its compatibility with the `tidyverse` package, which include widely popular packages like `ggplot2` and `dplyr`. Consequently, map-making with `ggplot()` and data wrangling with a family of `dplyr` functions come very natural to many $R$ users. `sp` objects have different slots for spatial information and attributes data, and they are not amenable to `dplyr` way of data transformation.


## Spatial Data Structure

Here, we learn how the `sf` package stores spatial data, along the definitions of three key `sf` objects: simple feature geometry (`sfg`), simple feature geometry list-column (`sfc`), and simple feature (`sf`). The `sf` package provides a simply way of storing geographic information and the attributes of the geographic units in a single dataset. This special type of dataset is called simple feature (`sf`). It is best to take a look at an example to see how this is achieved. We use North Carolina county boundaries with county attributes (Figure \@ref(fig:nc-county)).  

```{r nc_import}
#--- a dataset that comes with the sf package ---#
nc <- st_read(system.file("shape/nc.shp", package="sf")) 
```

```{r nc-county, fig.cap = "North Carolina county boundary"}
library(tmap)
tm_shape(nc) +
  tm_polygons() +
  tm_layout(frame = NA)
```

As you can see below, this dataset is of class `sf` (and `data.frame` at the same time).

```{r class_sf}
class(nc)
```

Now, let's take a look inside of `nc`.

```{r }
#--- take a look at the data ---#
head(nc)
```

Just like a regular `data.frame`, you see a number of variables (attributes) except that you have a variable called `geometry` at the end. Each row represents a single geographic unit (here, county). Ashe County (1st row) has area of $0.114$, FIPS code of $37009$, and so on. And the entry in `geometry` column at the first row represents the geographic information of Ashe County. An entry in the `geometry` column is a simple feature geometry (`sfg`), which is an $R$ object that represents the geographic information of a single geometric feature (county in this example). There are different types of `sfg`s (`POINT`, `LINESTRING`, `POLYGON`, `MULTIPOLYGON`, etc). Here, `sfg`s representing counties in NC are of type `MULTIPOLYGON`. Let's take a look inside the `sfg` for Ashe County using `st_geometry()`.

```{r geometry}
st_geometry(nc[1, ])[[1]][[1]]
```

As you can see, the `sfg` consists of a number of points (pairs of two numbers). Connecting the points in the order they are stored delineate the Ashe County boundary.

```{r }
plot(st_geometry(nc[1, ])) 
``` 

We will take a closer look at different types of `sfg` in the next section. 

Finally, the `geometry` variable is a list of individual `sfg`s, called simple feature geometry list-column (`sfc`).

```{r }
dplyr::select(nc, geometry)
``` 

Elements of a geometry list-column are allowed to be different in nature from other elements^[just like a regular `list` object can contain mixed types of elements: numeric, character, etc]. In the `nc` data, all the elements (`sfg`s) in `geometry` column are `MULTIPOLYGON`. However, you could also have `LINESTRING` or `POINT` objects mixed with `MULTIPOLYGONS` objects in a single `sf` object if you would like. 

## Simple feature geometry, simple feature geometry list-column, and simple feature

Here, we learn how different types of `sfg` are constructed. We also learn how to create `sfc` and `sf` from `sfg` from scratch.^[I must say that creating spatial objects from scratch yourself is an unnecessary skill for many of us as an economist. But, it is still good to know the underlying structure of the data. Also, occasionally the need arises. For example, I had to construct spatial objects from scratch when I designed on-farm randomized nitrogen trials. In such cases, it is of course necessary to understand how different types of `sfg` are constructed, create `sfc` from a collection of `sfg`s, and then create an `sf` from a `sfc`.]    

### Simple feature geometry (`sfg`)

The `sf` package uses a class of `sfg` (simple feature geometry) objects to represent a geometry of a single geometric feature (say, a city as a point, a river as a line, county and school district as polygons). There are different types of `sfg`s. Here are some example feature types that we commonly encounter as an economist^[You will hardly see the other geometry types: MULTIPOINT and GEOMETRYCOLLECTION. You may see GEOMETRYCOLLECTION after intersecting two spatial objects. You can see [here](https://r-spatial.github.io/sf/articles/sf1.html#sfg-simple-feature-geometry-1) if you are interested in learning what they are.]:

+ `POINT`: area-less feature that represents a point (e.g., well, city, farmland) 
+ `LINESTRING`: (e.g., a tributary of a river) 
+ `MULTILINESTRING`: (e.g., river with more than one tributaries) 
+ `POLYGON`: geometry with a positive area (e.g., county, state, country)
+ `MULTIPOLYGON`: collection of polygons to represent a single object (e.g., countries with islands: U.S., Japan)

---

`POINT` is the simplest geometry type, and is represented by a vector of two^[or three to represent a point in the three-dimensional space] numeric values. An example below shows how a `POINT` feature can be made from scratch^[we will learn how to make `sfg` objects from scratch because it helps to better understand how the spatial data is stored.]:

```{r sf_point}
#--- create a POINT ---#
a_point <- st_point(c(2,1))
```

The `st_point()` function creates a `POINT` object when supplied with a vector of two numeric values. If you check the class of the newly created object,

```{r class}
#--- check the class of the object ---#
class(a_point)
```

You can see that it's indeed a `POINT` object. But, it's also an `sfg` object. So, `a_point` is an `sfg` object of type `POINT`. 

---

A `LINESTRING` objects are represented by a sequence of points:  

```{r linestrting}
#--- collection of points in a matrix form ---#
s1 <- rbind(c(2,3),c(3,4),c(3,5),c(1,5))

#--- see what s1 looks like ---#
s1

#--- create a "LINESTRING" ---#
a_linestring <- st_linestring(s1)

#--- check the class ---#
class(a_linestring)
```

`s1` is a matrix where each row represents a point. By applying `st_linestring()` function to `s1`, you create a `LINESTRING` object. Let's see what the line looks like.

```{r plot_line}
plot(a_linestring)
```

As you can see, each pair of consecutive points in the matrix are connected by a straight line to form a line. 

---

A `POLYGON` is very similar to `LINESTRING` in the manner it is represented. 

```{r polygon_1}
#--- collection of points in a matrix form ---#
p1 <- rbind(c(0,0), c(3,0), c(3,2), c(2,5), c(1,3), c(0,0))

#--- see what s1 looks like ---#
p1
 #--- create a "LINESTRING" ---#
a_polygon <- st_polygon(list(p1))

#--- check the class ---#
class(a_polygon)

#--- see what it looks like ---#
plot(a_polygon)
```

Just like the `LINESTRING` object we created earlier, a `POLYGON` is represented by a collection of points. The biggest difference between them is that we need to have some positive area enclosed by lines connecting the points. To do that, you have the the same point for the first and last points to close the loop: here, it's `c(0,0)`. A `POLYGON` can have a hole in it. The first matrix of a list becomes the exterior ring, and all the subsequent matrices will be holes within the exterior ring.  

```{r polygon_hole}
#--- a hole within p1 ---#
p2 <- rbind(c(1,1), c(1,2), c(2,2), c(1,1))

#--- create a polygon with hole ---#
a_plygon_with_a_hole <- st_polygon(list(p1,p2))

#--- see what it looks like ---#
plot(a_plygon_with_a_hole)
```

---

You can create a `MULTIPOLYGON` object in a similar manner. The only difference is that you supply a list of lists of matrices, with each inner list representing a polygon. An example below: 

```{r multi_polygon}
#--- second polygon ---#
p3 <- rbind(c(4,0), c(5,0), c(5,3), c(4,2), c(4,0)) 

#--- create a multipolygon ---#
a_multipolygon <- st_multipolygon(list(list(p1,p2), list(p3)))

#--- see what it looks like ---#
plot(a_multipolygon)
```

Each of `list(p1,p2)`, `list(p3,p4)`, `list(p5)` represents a polygon. You supply a list of these lists to the `st_multipolygon()` function to make a `MULTIPOLYGON` object.

### Create simple feature geometry list-column (`sfc`) and simple feature (`sf`) from scratch

To make a simple feature geometry list-column (`sfc`), you can simply supply a list of `sfg` to the `st_sfc()` function as follows:

```{r gen_sfc}
#--- create an sfc ---#
sfc_ex <- st_sfc(list(a_point,a_linestring,a_polygon,a_multipolygon))
```

To create a `sf` object, you first add a `sfc` as a column to a `data.frame`.  

```{r add_sfc_to_df}
#--- create a data.frame ---#
df_ex <- data.frame(
  name=c('A','B','C','D')
)

#--- add the sfc as a column ---#
df_ex$geometry <- sfc_ex 

#--- take a look ---#
df_ex
```

At this point, it is not recognized as a `sf` by R yet.

```{r class_check}
#--- see what it looks like (this is not an sf object yet) ---#
class(df_ex)
```

You can register it as a `sf` object using `st_as_sf()`.

```{r gen_sf_yourself}
#--- let R recognize the data frame as sf ---#
sf_ex <- st_as_sf(df_ex)

#--- see what it looks like ---#
sf_ex
```

As you can see `sf_ex` is now recognized also as an `sf` object.  

```{r check_if_sf}
#--- check the class ---#
class(sf_ex)
```


## Reading and writing vector data

I claimed that you do not need ArcGIS $99\%$ of your work as an economist. However, the vast majority of people still use ArcGIS to handle spatial data, which has its own system of storing spatial data^[See [here]() for other various formats spatial data are stored.] called shapefile. So, chances are that your collaborators still use shapefiles. Moreover, there are many GIS data online that are available only as shapefiles. So, it is important to learn how to read and write shapefiles. 

### Reading a shape file

We can use `st_read()` function to read a shapefile. It reads in a shape file and then turn the data into an sf object. Let's take a look at an example. 

```{r import_nc_shp}
#--- read a NE county boundary shape file ---#
nc_loaded <- st_read(dsn = "./Data", "nc") 
```

Typically, you have two arguments to specify for `st_read()`. The first one is `dsn`, which is basically the path to the shape file you want to import. The second one is the name of the shapefile. Notice that you do not add `.shp` extension to the file name: "NE_county," not "NE_county.shp."^[When storing a spatial dataset, ArcGIS divides the information into separate files. All of them have the same prefix, but have different extensions. We typically say we read a shape file, but we really are importing all these files including the shape file with the .shp extension. When you read those data, you just refer to the common prefix because you really are importing all the files, not just a .shp file.].

### Writing to a shape file

Writing an `sf` object as a shape file is just as easy. You use the `st_write()` function, with the first argument being the name of the `sf` object you are exporting, and the second being the name of the new shape file. For example, the code below will export an `sf` object called `NE_county` as `NE_county_2.shp` (along with other supporting files). 

```{r write_nc, eval = FALSE}
st_write(nc_loaded, dsn="./Data", "nc", driver="ESRI Shapefile", append = FALSE)
```

`append = FALSE` forces writing the data when there already exists a file with the same name. Without the option, this happens.

```{r write_nc_error, error = TRUE}
st_write(nc_loaded, dsn="./Data", "nc", driver="ESRI Shapefile")
```

### Better alternatives 

Now, if your collaborator is using ArcGIS and demanding that he/she needs a shapefile for his/her work, sure you can use the above command to write a shapefile. But, there is really no need to work with the shapefile system. One of the alternative data formats that are considered superior to the shapefile system is GeoPackage^[Link here], which overcomes various limitations associated with shapefile^[see the last paragraph of [chapter 7.5 of this book](https://csgillespie.github.io/efficientR/data-carpentry.html#data-processing-with-data.table), [this blogpost](https://carto.com/blog/fgdb-gpkg/), or [this](http://switchfromshapefile.org/)]. Unlike the shapefile system, it produces only a single file with .gpkg extension. Note that GeoPackage file can also be easily read into ArcGIS. So, it might be worthwhile to convince your collaborators to stop using shapefiles and start using GeoPackage.  

```{r gpkg, eval = FALSE}
#--- write as a gpkg file ---#
st_write(nc, dsn = "./Data/nc.gpkg")

#--- read a gpkg file ---#
nc <- st_read("./Data/nc.gpkg")
```

Or better yet, if your collaborator uses R (or if it is only you who uses the data), then just save it as an rds file using `saveRDS()`, which can be of course read using `readRDS()`.

```{r save_read_nc_as_rds}
#--- save as an rds ---#
saveRDS(nc, "/Users/tmieno2/Box/Teaching/AAEA R/GIS/nc_county.rds")

#--- read an rds ---#
nc <- readRDS("/Users/tmieno2/Box/Teaching/AAEA R/GIS/nc_county.rds")
```

The use of rds files is particularly attractive when the dataset is large because rds files are generally more memory efficient than shape files, eating up less of your disk memory. 


## Projection with a different Coordinate Reference Systems 

You often need to reproject an `sf` using a different coordinate reference system (CRS) because you need to have two or more `sf` objects have the same CRS to do geometrical operations on them or to map those `sf` in the same map. In order to check the current CRS for an `sf` object, you can use the `st_crs()` function. 

```{r chech_CRS}
st_crs(nc)
```

`wkt` stands for **Well Known Text**^[`sf` versions prior to 0.9 provides CRS information in the form of `proj4string`. The newer version of `sf` presents CRS in the form of `wtk` (see [this slide](https://nowosad.github.io/whyr_webinar004/#25)). You can find the reason behind this change in the same slide, starting from [here](https://nowosad.github.io/whyr_webinar004/#18).], which is one of many many formats to store CRS information.^[See [here](https://spatialreference.org/ref/epsg/nad27/) for numerous other formats that represent the same CRS.] 4267 is the SRID (Spatial Reference System Identifier) defined by the European Petroleum Survey Group (EPSG) for the CRS^[You can find the CRS-EPSG number correspondence [here](http://spatialreference.org/ref/epsg/).]. 

When you transform your `sf` using a different CRS, you can use its EPSG number if the CRS has an EPSG number.^[Potential pool of CRS is infinite. Only the commonly-used CRS have been assigned EPSG SRID.] Let's transform the `sf` to `WGS 84` (another commonly used GCS), whose EPSG number is 4326. We can use the `st_transform()` function to achieve that, with the first argument being the `sf` object you are transforming and the second being the EPSG number of the new CRS.

```{r to_4326}
#--- transform ---#
nc_wgs84 <- st_transform(nc, 4326)

#--- check if the transformation was successful ---#
st_crs(nc_wgs84)
```

Notice that `wkt` was also altered accordingly to reflect the change in CRS: datum was changed to WGS 84. Now, let's transform (reproject) the data using `NAD83 / UTM zone 17N` CRS. Its EPSG number is $26917$.^[(see [here](http://spatialreference.org/ref/epsg/nad83-utm-zone-14n/))] So, the following code does the job.

```{r to_26917}
#--- transform ---#
nc_utm17N <- st_transform(nc_wgs84, 26917)

#--- check if the transformation was successful ---#
st_crs(nc_utm17N)
```

As you can see in its CRS information, the projection system is now UTM zone 17N. 

You often need to change the CRS of a `sf` object when you interact^[e.g., spatial subsetting, joining, etc] it with another `sf` object. In such a case, you can extract the CRS of the other `sf` object using `st_crs()` and use it for transformation^[In this example, we are using the same data with two different CRS. But, you get the point.]. 

```{r to_utm17}
#--- transform ---#
nc_utm17N_2 <- st_transform(nc_wgs84, st_crs(nc_utm17N))

#--- check if the transformation was successful ---#
st_crs(nc_utm17N_2)
```


<!-- However, notice that the `epsg` component of CRS is $NA$. ESRI shapefile format uses WTK (Well-known Text) format to refer to the CRS in use, which is saved in .prj file. So, if there is no corresponding SRID number for the CRS in use, the `epsg` component number would get lost when you save an `sf` object when you save it as an ESRI shapefile. This is exactly what happened to 
 -->

## Turning a data.frame of points into a `sf` 

Often times, you have a dataset with geographic coordinates as variables in a csv or other formats, which would not be recognized as a spatial dataset by R immediately when it is read into R. In this case, you need to identify which variables represent the geographic coordinates from the data set, and create an `sf` yourself. Fortunately, it is easy to do so using the `st_as_sf()` function.

```{r dataframe_to_sf}
#--- read well registration data ---#
wells <- readRDS('./Data/registration.rds') 

#--- recognize it as an sf ---#
wells_sf <- st_as_sf(wells, coords = c("longdd","latdd"))

#--- take a look at the data ---#
head(wells_sf[,1:5])
```

Note that the CRS of `wells_sf` is NA. Obviously, $R$ does not know the reference system without you telling it. We know^[Yes, YOU need to know the CRS of your data.] that the geographic coordinates in the wells data is NAD 83 ($epsg=4269$) for this dataset. So, we can assign the right CRS using either `st_set_crs()` or `st_crs()`.

```{r set_crs}
#--- set CRS ---#
wells_sf <- st_set_crs(wells_sf, 4269) 

#--- or this ---#
st_crs(wells_sf) <- 4269

#--- see the change ---#
head(wells_sf[,1:5])
```

## Conversion to and from sp

You can convert an `sf` object to its `sp` counterpart using `as(sf_object, "Spatial")` as in^[Before the `sf` package was introduced, `sp` was the major package that provided R with spatial data handling capability. When the `sf` package was new a few years ago, most of the spatial packages that utilize the `sp` package were not able to work with `sf` objects, and it was necessary to learn how to use the `sp` package in addition to the `sf` package. Now, most of the major spatial packages have been modified to work with `sf` objects. At least, I no longer have faced the needs to use `sp` objects instead of `sf` objects for the past year. However, you may find instances where `sp` objects are necessary or desirable. For example, those who run spatial econometric methods using `spdep`, creating neighbors from polygons is a bit faster using `sp` objects than using `sf` objects). In that case, it is good to know how to convert an `sf` object to an `sp` object, vice versa.]:

```{r as_spatial}
#--- conversion ---#
wells_sp <- as(wells_sf,"Spatial")

#--- check the class of NE_sp ---#
class(wells_sp)
```

<!-- ```{r }
list(a_point, a_polygon) %>% st_sfc() %>% st_as_sf() %>% 
  as("Spatial")
```

 -->
As you can see `NE_sp` is a class of `SpatialPointsDataFrame`, polygons with data frame supported by the `sp` package. The above syntax works for converting a `sf` of polygons into `SpatialPolygonsDataFrame` as well^[The function does not work for an `sf` object that consists of different geometry types (e.g., POINT and POLYGON). This is because `sp` objects do not allow to have different types of geometries in the single `sp` object. For example, `SpatialPointsDataFrame` consists only of points data.].     

You can revert `NE_sp` back to an `sf` object using the `st_as_sf()` function, as follows:

```{r back_to_sf}
#--- revert back to sf ---#
wells_sf <- st_as_sf(wells_sp)

#--- check the class ---#
class(wells_sf)
```

We do not cover how to use the `sp` package as the benefit of learning it has become marginal compared to when `sf` was just introduced a few years back^[For those interested in learning the `sp` package, [this website](https://rspatial.org/) is a good resource.]. Moreover, as we just saw, we can go back and forth between `sf` and `sp`. So, there is really no need to learn `sp`.

## Non-spatial transformation of sf 

An important feature of an `sf` object is that it is basically a data.frame with geometric information stored as a variable (column). This means that transforming an `sf` object works just like transforming a `data.frame`. Basically, everything you can do to a `data.frame`, you can do to a `sf` as well. The code below just provides an example of basic operations including `select()`, `filter()`, and `mutate()` in action with an `sf` object to just confirm that `dplyr`  operations works with an sf object just like a `data.frame`.   

```{r apply_dplyr}
#--- here is what the data looks like ---#
dplyr::select(wells_sf, wellid, nrdname, acres, regdate, nrdname)

#--- do some transformations ---#
wells_sf %>% 
  #--- select variables (geometry will always remain after select) ---#
  dplyr::select(wellid, nrdname, acres, regdate, nrdname) %>% 
  #--- removes observations with acre < 30  ---#
  filter(acres > 30) %>% 
  #--- hectare instead of acre ---#
  mutate(hectare = acres * 0.404686) 
```

Now, let's try to get a summary of a variable by group using the  `group_by()` and `summarize()` functions.

```{r summary}
#--- summary by group ---#
wells_by_nrd <- wells_sf %>% 
  #--- group by nrdname ---#
  group_by(nrdname) %>% 
  #--- summarize ---#
  summarize(tot_acres = sum(acres, na.rm = TRUE))

#--- take a look ---#
wells_by_nrd
```

So, we got total acres by NRD as we expected. One interesting change that happened is geometry variable. Each NRD now has `multipoint` sfg, which is the combination of all the wells (points) located inside the NRD as you can see below.   

```{r map_combine}
tm_shape(wells_by_nrd) + 
  tm_symbols(col = "nrdname", size = 0.2) +
  tm_layout(
    frame = NA,
    legend.outside = TRUE,
    legend.outside.position = "bottom"
  )
```

This feature is unlikely to be of much use to us. If you would like to drop a geometry column, you can use the `st_drop_geometry()` function:

```{r drop_geometry}
#--- remove geometry ---#
wells_no_longer_sf <- st_drop_geometry(wells_by_nrd)

#--- take a look ---#
wells_no_longer_sf
``` 

Finally, `data.table` does not work as well with sf objects as `dplyr` does. 

```{r sf_to_datatable}
#--- convert an sf to data.table ---#
wells_by_nrd_dt <- data.table(wells_by_nrd)

#--- take a look ---#
wells_by_nrd_dt

#--- check the class ---#
class(wells_by_nrd_dt)
```

You see that `wells_by_nrd_dt` is no longer an sf object even though geometry still remains in the data. If you try to run `sf` operations on it, it will of course give you an error. Like this:

```{r ch3_create_buffer, error = TRUE}
st_buffer(wells_by_nrd_dt, dist = 2)
```

But, it is easy to revert it back to an `sf` object again by using the `st_as_sf()` function^[This was not the case before. Turning an `sf` object to a `data.table` object used to replace sfg with NA.]. 

```{r table_to_sf}
wells_by_nrd_sf_again <- st_as_sf(wells_by_nrd_dt)
``` 

So, this means that if you need fast data transformation, you can first turn an `sf` to a `data.table`, transform the data using the `data.table` functionality, and then revert back to `sf`^[Remember that conversions between `sp` and `sf` also take time.]. However, for most economists, the geometry variable itself is not of interest in the sense that it never enters econometric models. For most of us, the geographic information contained in the geometry variable is just a glue to tie two datasets together by geographic referencing. Once we get values of spatial variables of interest, then there is no point in keeping your data an sf object. Personally, whenever I no longer need to carry around the geometry variable, I immediately turn an sf object into a `data.table` for fast data transformation especially when the data is large. 

## Geometrical operations

There are various geometrical operations that are particularly useful for economists. Here, some of the most commonly used geometrical operations are introduced^[For the complete list of available geometrical operations under the `sf` package, see [here](https://cran.r-project.org/web/packages/sf/vignettes/sf1.html).]. You can see the practical use of these functions in the Demonstration section.
 
### st_buffer 

`st_buffer()` creates a buffer around points, lines, or the border of polygons. 

---

Let's create buffers around points. First, we read well locations data. 

```{r points_import}
#--- read wells location data ---#
urnrd_wells_sf <- readRDS("./Data/urnrd_wells.rds") %>% 
  #--- project to UTM 14N WGS 84 ---#
  st_transform(32614)  
```

Here is the spatial distribution of the wells (Figure \@ref(fig:urnrd-wells)). 

```{r urnrd-wells, fig.cap = "Map of the wells"}
tm_shape(urnrd_wells_sf) +
  tm_symbols(col = "red", size = 0.1) +
  tm_layout(frame = NA)
```

Let's create buffers around the wells.

```{r gen_buffer_points}
#--- create a one-mile buffer around the wells ---#
wells_buffer <- st_buffer(urnrd_wells_sf, dist = 1600)
```

As you can see, you see bunch of circles around wells with the radius of $1,600$ meters (Figure \@ref(fig:buffer-points-map)).   

```{r buffer-points-map, fig.cap = "Buffers around wells"}
tm_shape(wells_buffer) +
  tm_polygons(alpha = 0) +
tm_shape(urnrd_wells_sf) +
  tm_symbols(col = "red", size = 0.1) +
  tm_layout(frame = NA)
```

A practical application of buffer creation will be seen in Chapter \@ref(Demo1).

---

We now create buffers around polygons. First, read NE county boundary data and select three counties (Chase, Dundy, and Perkins).

```{r ne_counties, echo=FALSE}
NE_counties <- readRDS("./Data/NE_county_borders.rds") %>%
  filter(NAME %in% c("Perkins", "Dundy", "Chase")) %>% 
  st_transform(32614)
```

Here is what they look like (Figure \@ref(fig:map-three-counties)):

```{r map-three-counties, fig.cap = "Map of the three counties"}
tm_shape(NE_counties) +
  tm_polygons('NAME', palette="RdYlGn", contrast=.3, title="County") +
  tm_layout(frame = NA)
```

The following code creates buffers around polygons (see the results in Figure \@ref(fig:buffer-county)):

```{r buffer_polygons}
NE_buffer <- st_buffer(NE_counties, dist = 2000)
```

```{r buffer-county, fig.cap = "Buffers around the three counties"}
tm_shape(NE_buffer) +
  tm_polygons(col='blue',alpha=0.2) +
tm_shape(NE_counties) +
  tm_polygons('NAME', palette="RdYlGn", contrast=.3, title="County") + 
  tm_layout(
    legend.outside=TRUE,
    frame=FALSE
  )
```

For example, this can use useful to identify observations are close enough to the border of political boundaries when you want to take advantage of spatial discontinuity of policies across adjacent political boundaries.    

### st_area 

The `st_area()` function calculates the area of polygons. 

```{r get_area}
#--- generate area by polygon ---#
(
  NE_counties <- mutate(NE_counties, area = st_area(NE_counties))
)
```

Now, as you can see below,

```{r check_class_area}
class(NE_counties$area)
```

the default class of the results of `st_area()` is `units`, which does not accept numerical operations. So, let's turn it into double.

```{r convert_to_numeric}
(
NE_counties <- mutate(NE_counties, area = as.numeric(area))
)
```

`st_area()` is useful when you want to find area-weighted average of characteristics after spatially joining two polygon layers using the `st_intersection()` function. 

### st_centroid

The `st_centroid()` function finds the centroid of each polygon.

```{r gen_centroids}
#--- create centroids ---#
(
NE_centroids <- st_centroid(NE_counties)
)
```

Here's what the map of the output (Figure \@ref(fig:map-centroids)).

```{r map-centroids, fig.cap = "The centroids of the polygons"}
tm_shape(NE_counties) +
  tm_polygons() +
tm_shape(NE_centroids)+
  tm_symbols(size=0.5) +
tm_layout(
    legend.outside=TRUE,
    frame=FALSE
  )
```

It can be useful when creating a map with labels because the centroid of polygons tend to be a good place to place labels at like this (Figure \@ref(fig:cent-label)).^[When creating maps with the ggplot2 package, you can use `geom_sf_text()` or `geom_sf_label()`, which automatically finds where to put texts. See some examples [here](https://yutani.rbind.io/post/geom-sf-text-and-geom-sf-label-are-coming/).]

```{r cent-label, fig.cap = "County names placed at the centroids of the counties"}
tm_shape(NE_counties) +
  tm_polygons() +
tm_shape(NE_centroids)+
  tm_text("NAME") +
tm_layout(
    legend.outside=TRUE,
    frame=FALSE
  )
```

It may be also useful when you somehow need to calculate the "distance" between polygons.  

### st_length

We can use `st_length()` to calculate great circle distances^[Great circle distance is the shortest distance between two points on the surface of a sphere (earth)] of `LINESTRING` and `MULTILINESTRING` when they are represented in geodetic coordinates. On the other hand, if they are projected and use a Cartesian coordinate system, it will calculate Euclidean distance. We use U.S. railroad data for a demonstration. 

```{r railroad_import}
#--- import US railroad data and take only the first 10 of it ---#
(
a_railroad <- rail_roads <- st_read(dsn = "./Data/", layer = "tl_2015_us_rails")[1:10, ]
)

#--- check CRS ---#
st_crs(a_railroad)
```

It uses geodetic coordinate system. Let's calculate the great circle distance of the lines.

```{r get_distance}
(
a_railroad <- mutate(a_railroad, length = st_length(a_railroad))
)
```


<!-- ## st_distance

The `st_distance` function calculates distances between spatial objects. Its output is the matrix of distances whose $i,j$ element is the distance between the $i$th `sfg` of the first `sf` and $j$th `sfg` of the second `sf`. The following code find the distance between the first 10 wells in `urnrd_wells_sf`.

```{r st_distance}
st_distance(urnrd_wells_sf[1:10, ], urnrd_wells_sf[1:10, ])
```
 -->

<!--chapter:end:VectorDataBasics.Rmd-->