forked from RaXephon/CMSC320-Final-Project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
final_project.Rmd
282 lines (214 loc) · 13.8 KB
/
final_project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
---
name: "Shashwat Kapoor, Jin Ah Kang, Theresa Choi"
output: html_document
---
##Table of Contents
I. Introduction
II. Data
III. Analysis
i. Statistical
ii. Visual
IV. Conclusion
####I. Introduction
As the name implies, natural disasters occur from events in nature and are beyond human control. While it is impossible for humans to directly prevent the course of Mother Nature, we can still analyze the patterns of natural events and their implications.
Of these, earthquakes are some of the most diverse in their effects. From undetectable tectonic plate movements to disasters with body counts in the thousands, earthquakes leave equal amounts of material to study and motive to do so.
If the government knew that a particular area was prone to earthquakes, they could avoid putting a housing complex there, preventing future tragedies. If tourists knew seismic seasonal trends would spike in their destination, they might choose a different vacation spot.
As the above examples illustrate, decisions may be on a national or personal level or anywhere in between. Both, however, were fueled by public research based on data analysis.
For our tutorial, we decided to use a dataset on earthquakes near Japan. As one of the countries most affected by earthquakes, Japan experiences around 1,500 earthquakes every year. This made it a prime sample set.
####III. Data
**Including Libraries**
```{r setup, message=FALSE, warning=FALSE}
library(tidyr)
library(dplyr)
library(ggplot2)
library(tibble)
library(tidyverse)
library(askpass)
library(datasets)
library(leaflet)
library(KernSmooth)
library(sp)
library(dummies)
library(Hmisc)
```
**Data Scraping/Tidying**
First we want to gather the raw data from our dataset. In this case, we can simply read it from a csv datatype using the function `read_csv`.
```{r read_data, warning=FALSE, message=FALSE}
dat <- read_csv("Japan_earthquakes.csv")
```
Next, we select the columns that we are interested in. For our purposes, we only need `time`, `latitude`, `longitude`, `depth`, `mag`, `magType`, and `place`.
```{r}
ext_dat <- dat %>%
select("time", "latitude", "longitude", "depth", "mag", "magType", "place")
ext_dat
```
Notice that the `time` column has a large amount of information, but none of it is currently easily accessible. We can break down this compact form into more columns. First we will make a copy of the `time` column, so we can parse each into time and date separately. To do this, we separate the date and the time via the delimiter " ". We do the same for the time, separating out `year`, `month`, and `day`.
```{r}
options(dplyr.width = Inf)
ext_dat$copy <- ext_dat$time
ext_dat <- ext_dat %>%
separate(copy, sep=" ", into = c("date", "time_in_sec")) %>%
separate(date, sep="-", into = c("year", "month", "day"))
ext_dat
```
We can now take the original time column and transform it into a column of the `date` type, allowing R to perform date-specific computations on it. (We will additionally rename the date column from `time` to `date` to prevent confusion.)
```{r}
ext_dat <- ext_dat %>%
mutate(time=as.Date(time))
colnames(ext_dat)[colnames(ext_dat) == "time"] <- "date"
ext_dat
```
####IV-i. Analysis -- Statistical
Now that we have created a useful dataframe from our csv file, the next step is to draw conclusions from the data. One way to do this is to run various statistical analyses.
One option we might consider is looking at earthquakes via the seasons during which they occur.
First, we need to find a way to represent seasons. This can be done by (crudely) breaking the seasons into three month chunks.
Note we will use the `month` column that we created while tidying the data.
```{r create season}
#separate seasons according to conditions
ext_dat <- mutate(ext_dat, seasons = ifelse(month >= "03" & month <= "05", "spring",
ifelse(month >= "06" & month <= "08", "summer",
ifelse(month >= "09" & month <= "11", "fall", "winter"))))
ext_dat
```
Now that we have the seasons, we can look at the correlation between an earthquake's depth and its season. We will do this using a correlation matrix. A correlation matrix is a table showing correlation coefficients between sets of variables. Each entry in the table is the random variable's correlation with each of the other values in the table. This allows you to see which pairs have the highest correlation.
For each season, we filter the dataframe to its respective season, create columns [season]_depth and [season]_mag, and only select those columns in our final seasonal dataframe.
```{r cor btwn depth and each season}
corr_spring <- ext_dat %>%
filter(seasons == "spring") %>%
mutate(spring_depth = depth, spring_mag = mag) %>%
select(spring_depth, spring_mag)
corr_summer <- ext_dat %>%
filter(seasons == "summer") %>%
mutate(summer_depth = depth, summer_mag = mag) %>%
select(summer_depth, summer_mag)
corr_fall <- ext_dat %>%
filter(seasons == "fall") %>%
mutate(fall_depth = depth, fall_mag = mag) %>%
select(fall_depth, fall_mag)
corr_winter <- ext_dat %>%
filter(seasons == "winter") %>%
mutate(winter_depth = depth, winter_mag = mag)%>%
select(winter_depth, winter_mag)
#all seasons' p-value < 0.5 and therefore, their some form of correlation exist
cor(corr_spring)
cor(corr_summer)
cor(corr_fall)
cor(corr_winter)
```
As the correlation tables show, none of the seasons show a strong correlation between depth and magnitude. All the values are less than 0.7, the general threshold for correlation.
To further investigate, we can also present a visual representation of the correlation tables.
```{r plotting, warning=FALSE, message=FALSE}
#show in plots what kind of relation it has
attach(corr_spring)
plot(spring_mag, spring_depth, main = "Correlation Spring", pch = 19)
attach(corr_summer)
plot(summer_mag, summer_depth, main = "Correlation summer", pch = 19)
attach(corr_fall)
plot(fall_mag, fall_depth, main = "Correlation fall", pch = 19)
attach(corr_winter)
plot(winter_mag, winter_depth, main = "Correlation winter", pch = 19)
```
While the correlations are nonlinear, we can see a slight pattern across seasons; only the relatively shallow depths seem to extend into the higher magnitude earthquakes. With the exception of summer, most earthquakes above magnitude 7 are above 300 feet deep.
While this was a good exercise in data mutation and correlation testing, it didn't lead to a clear conclusion. We will try another analysis on magnitude that will hopefully yield more declarative findings.
Researchers have found that stronger earthquakes occur at shallower depths, while small earthquakes tend to range more in their depths. For further discussion, interested readers may find [this link](https://www.openhazards.com/faq/earthquakes-faults-plate-tectonics-earth-structure-user-submitted-questions/there-correlation) helpful.
```{r, message=FALSE, warning=FALSE}
attach(ext_dat)
plot(depth, mag, main = "mag vs depth", pch = 19)
```
Our plot fits the hypothesis perfectly. Earthquakes with the most magnitude are clustered at shallower depths while shorter earthquakes roam more freely, though there seems to be a slight cluster at the [300,500] range.
####IV-ii. Analysis -- Visual
Now that we have performed statistical analyses, we would like to present more compelling information in the form of interactive maps.
First, let's practice handling maps.
We will install icons using the awesomeIcons function. In this example, we chose a flag.
```{r mark Japan for ref}
icons <- awesomeIcons(
icon = 'fa-flag',
library = 'fa'
)
```
Next, we will put this icon on a map of Japan. Specifically, we want to put it at Tokyo.
A quick search will reveal Tokyo's coordinates to be 35.6762° N, 139.6503° E.
Now we will use `leaflet`, a Javascript library for interactive maps integrated into R.
First we create a map widget by calling leaflet(). From there, we can simply add layers. In this introductory example, we first call addTiles() for a base layer before using addAwesomeMarkers() with Tokyo's coordinates to pin the icon we initialized in the previous code block. Finally, we use setView() to determine the area displayed by default.
```{r}
japan_map <- leaflet(ext_dat) %>%
addTiles() %>%
addAwesomeMarkers(lng = 139.80, lat = 35.68, icon=icons) %>%
setView(lng = 139.80, lat = 35.68, zoom = 6) #tokyo lat, lng
japan_map
```
This was a fun introduction but it doesn't show anything useful regarding earthquakes yet.
For this next example, we want to plot out earthquakes per season.
We previously created a seasons column, but another way to parse seasonal data is to split the dataframe altogether into four dataframes, one for each season.
```{r}
winter <- ext_dat[ext_dat$month %in% c("01", "02", "12"), ]
spring <- ext_dat[ext_dat$month %in% c("03", "04", "05"), ]
summer <- ext_dat[ext_dat$month %in% c("06", "07", "08"), ]
autumn <- ext_dat[ext_dat$month %in% c("09", "10", "11"), ]
```
Now we want to plot this by seasons.
We would still like to setView() to Tokyo, but we call three new functions in this example.
First, `addCircleMarkers()`. This will plot a circle of the given specifications (radius, color, etc) at the points in the given dataframe. Two things to note here are that we chose different colors for different seasons and that there is a small 'group = "[season]"' field at the end. This will come in useful for the next two functions, but the effect essentially ties the layer to a group.
Next we see `addLayersControl()`. This will create a box at the top righthand corner with the checkboxes per season. Two options are available for addLayersControl():baseGroups and overlayGroups. For baseGroups, the user can only view one group at a time as only one base can be displayed. OverlayGroups, however, can be overlapped over each other. Anywhere from none to all layers can be shown together. For our purposes, overlayGroups was more apt; we wanted users to be able to use the plot flexibly, like laying two seasons' events on top of each other and comparing them directly.
Finally, we have `hideGroup()`. As its name implies, this function hides groups from the default selection. We decided that having all the seasons selected as default was a little overwhelming and cluttered, so one season (winter) was defined as the automatic selection.
These functions are all put together in the code below.
(Note that we already created a map widget `japan_map` so we don't need to call leaflet() again.)
```{r}
japan_map %>%
addTiles() %>%
setView(lng = 139.80, lat = 35.68, zoom = 4) %>%
addCircleMarkers(data=winter, stroke = FALSE, radius = 3, color = "#0080FF", group = "winter") %>%
addCircleMarkers(data=spring, stroke = FALSE, radius = 3, color = "#29AB87", group = "spring") %>%
addCircleMarkers(data=summer, stroke = FALSE, radius = 3, color = "#999900", group = "summer") %>%
addCircleMarkers(data=autumn, stroke = FALSE, radius = 3, color = "#EA3C53", group = "autumn") %>%
addLayersControl(overlayGroups = c("winter", "spring", "summer", "autumn"),
options = layersControlOptions(collapsed = FALSE)) %>%
hideGroup("spring") %>%
hideGroup("summer") %>%
hideGroup("autumn")
```
This was a good exercise for getting familiar with the layering possibilities of leaflet, but there are still many more possibilities with leaflet, especially as far as earthquakes are concerned.
One method we can use is a heatmap.
First, we'd like to link respective latitudes and longitudes to each other again to create a dataframe of plotpoints.
```{r}
X = cbind(ext_dat$longitude, ext_dat$latitude)
head(X, n = 10)
```
Now our goal is to get the intensity at each of these plot points. To do this, we introduce bkde2D, which computes a **b**inned **k**ernel **d**ensity **e**stimate in **2D**. Essentially, this stretches the dataframe of plot points in each direction to find density estimates using the overlaps of plot points. It will return these in the form of a matrix.
**Explain bw.ucv and bandwidth**
We take the returned dataframe and input it into the function `contourLines()`, which will calculate lines connecting points of equal intensity on the heatmap.
```{r, warning=FALSE}
kde2d <- bkde2D(X, bandwidth = c(bw.ucv(X[,1]),bw.ucv(X[,2])))
x = kde2d$x1
y = kde2d$x2
z = kde2d$fhat
CL = contourLines(x , y , z)
```
Now that we have calculated the most important component, we will begin to actually plot out the heatmap.
As before, we will use the initialized map widget `japan_map` and set the default view to Tokyo.
```{r}
quake_density <- japan_map %>%
setView(lng = 139.80, lat = 35.68, zoom = 4)
```
The crucial part of this example concerns the CL or contourLines. We use a for loop that traverses the entire contourLines dataframe, calling `addPolylines` and `addPolygons` to plot. This essentially goes through each level of earthquake density and adds more layers of red for each successive level.
```{r}
for (x in seq(1, length(CL))) {
quake_density <- quake_density %>%
addPolylines(CL[[x]]$x, CL[[x]]$y, color = "#8B0000", weight = 2) %>%
addPolygons(CL[[x]]$x, CL[[x]]$y, fillColor = "red", stroke = FALSE)
}
quake_density
```
####V. Conclusion
We have investigated many of the tools inside the data science toolkit in this tutorial. In order, we
- tidied the data from the earthquakes dataset
- used correlation tables to investigate the relationship between depth and magnitude
- plotted these correlations (or lack thereof)
- created an interactive map with toggles that plotted earthquakes by season
- created a heatmap of locations with higher earthquake risk
After this tutorial, you can apply these skills to any dataset and any problem.
For more information on data science, we have provided the following links:
[Kaggle](kaggle.com)
[Datacamp](https://www.datacamp.com/home)
[r/datascience](https://www.reddit.com/r/datascience/)
[Andrew Ng, cofounder of GoogleBrain, has a course on coursera](https://www.coursera.org/learn/machine-learning)