-
Notifications
You must be signed in to change notification settings - Fork 0
/
Lab_2.Rmd
270 lines (209 loc) · 14 KB
/
Lab_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
---
title: "LAB 2 - Wildfire Spread Dynamics"
author: "Tanav Thanjavuru, Simran Gill, Ozkan Meric Ozcan"
date: "2024-07-23"
geometry: margin=1.5cm
output:
bookdown::pdf_document2:
toc: true
number_sections: true
---
\newpage
\setcounter{page}{1}
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r load packages, include=FALSE}
library(dplyr)
library(ggplot2)
library(corrplot)
library(hexbin)
library(gridExtra)
library(lmtest)
library(sandwich)
library(stargazer)
theme_set(theme_bw())
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
Forest fires not only represent a significant threat towards human life and property, but also affect the ecosystem. The northeast region of Portugal in particular has a history of wildfires because more than a third is covered by forests. Adding in the factor of hot and dry summers can potentially increase the risk of wildfires\footnote{The Sunday Times. "Portugal wildfires: what to expect if you’re travelling in summer 2024" (2024).}. Even though wildfires are a normal part of the renewal process of forests, the forests are getting drier and people are living closer to them which causes an increase in safety issues\footnote{National Geographic. "How to live with mega-fires? Portugal’s feral forests may hold the secret" (2019).}. This is why it is important for us to understand the behavior of wildfires, in particular the speed at which fire spreads.
With this investigation, we can continue to discuss how wildfires affect the environment along with social-economics and explore opportunities to do our part to put in place appropriate measures of safety. Along with producing strategies to minimize the risk of wildfires. This research will be in the interest of Environmental Scientists and Ecologists along with Fire Management Agencies with the goal of protecting and decreasing any threats towards human life and the ecosystem. To create a starting point for this discussion, this research will investigate the following research question:
\begin{quote}
\textit{How does Fine Fuel Moisture Code (FFMC) help us understand how quick a fire spreads in the northeast region of Portugal?}
\end{quote}
To measure fire spread, we will use Initial Spread Index (ISI). Both FFMC and ISI are key components of the Canadian Forest Fire Weather Index system which are used to assess fire behavior. To provide context, FFMC a numeric rating which measures the moisture content of forest litter such as mosses and twigs. A higher FFMC value means these fine fuels are drier. ISI is a numeric rating that estimates how quickly a fire will spread after it starts. Exploring the relationship between these two variables can provide valuable insights into what conditions cause rapid wildfire to spread. To answer this research question, we will perform a bivariate analysis be developing a regression models by iteratively applying variable transformation. Then evaluating both the statistical and practical significance of results.
# Description of the Data Source
The dataset we will be using for this research is from the UCI Machine Learning Repository which is called "Forest Fires" and can be found here: https://archive.ics.uci.edu/dataset/162/forest+fires. This data set has 517 instances with 13 feature that consists of climate and physical factors of the Montesinho natural park. The data was collected between January 2000 and December 2003. This can be used to understand forest fire behavior in northeast region of Portugal.
# Data Wrangling & Operationalization
Given the dataset contains 517 rows, we will divide it into an exploration set and a confirmation set. The exploration set, will be 30% of the dataset (155 rows), which will be used to initially explore and analyze the data. The remaining 70% (362 rows) will form the confirmation set, which will be used to build the final model and interpret the results.
Our research question has two parts, the Fine Fuel Moisture Code (FFMC) and how quickly a fire spreads. To start exploring the data set, we created a correlation heat map to see if FFMC has any correlation with other variables, even though correlation does not mean causation. From the plot, we saw there are 4 variables that have correlation, which are Initial Spread Index (ISI), Temperature, Duff Moisture Code (DMC), and the Drought Code (DC). Because we are interested in the speed of a fire spreading, we created a scatter plot for FFMC and ISI to see if there is a linear relationship. From the scatter plot called "FFMC & ISI Relationship", we found that as FFMC increases, so does ISI. Since there is a relationship here, and our research question involves fire spread-ability, we decided to select ISI as our dependent variable and FFMC as our independent variable.
From the scatter plot, we noticed there are some outliers, and decided it would be a good idea to remove those using an interquartile range. After removing the outliers, the exploration set has 151 rows and the confirmation set still has 362 rows.
```{r, echo=FALSE , results='hide' }
data <- read.csv("forestfires.csv")
# shape of the data
head(data)
```
```{r, echo=FALSE, results='hide' }
## Splitting data to exploration set and confirmation set
set.seed(123) # for reproducibility
sample_index <- sample(seq_len(nrow(data)), size = 0.3 * nrow(data))
exploration_set <- data[sample_index, ]
confirmation_set <- data[-sample_index, ]
(dim(exploration_set))
(dim(confirmation_set))
```
```{r, echo=FALSE, results='hide'}
vars <- c("FFMC", "DMC", "DC", "ISI", "temp", "RH", "wind", "rain", "area")
cor_matrix <- cor(exploration_set[vars], use = "complete.obs")
```
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.pos='!b', fig.height = 3, fig.width= 3, fig.align='center' }
c <- corrplot(cor_matrix, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45, mar=c(0,0,1,0), title = "Forest Fire Variables")
```
```{r, echo=FALSE }
p1 <- ggplot(exploration_set, aes(x = FFMC, y = ISI)) +
geom_hex() +
scale_fill_viridis_c() +
ggtitle("FFMC & ISI Relationship") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
plot.title = element_text(size = 9),
axis.title = element_text(size = 7))
```
```{r, echo=FALSE, results='hide' }
# Detect Outlier in Column
detect_outlier <- function(x, threshold_multiplier = 4) {
Q1 <- quantile(x, probs = 0.25)
Q3 <- quantile(x, probs = 0.75)
IQR <- Q3 - Q1
outliers <- x > (Q3 + threshold_multiplier * IQR) | x < (Q1 - threshold_multiplier * IQR)
outliers
}
# Remove Outlier Function
remove_outlier <- function(dataframe, columns = names(dataframe)) {
for (col in columns) {
outliers <- detect_outlier(dataframe[[col]])
dataframe <- dataframe[!outliers, ]
}
dataframe
}
cleaned_exploration_set <- remove_outlier(exploration_set, c('FFMC', 'ISI'))
(dim(cleaned_exploration_set))
cleaned_confirmation_set <- remove_outlier(confirmation_set, c('FFMC', 'ISI'))
(dim(confirmation_set))
```
```{r, echo=FALSE }
p2 <- ggplot(cleaned_exploration_set, aes(x = FFMC, y = ISI)) +
geom_hex() +
scale_fill_viridis_c() +
ggtitle("Outlier Removal:FFMC & ISI Relationship") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
plot.title = element_text(size = 9),
axis.title = element_text(size = 7))
```
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.pos='!b', fig.height = 2, fig.width= 6,fig.align='center' }
grid.arrange( p1, p2, ncol = 2, nrow = 1)
```
# Model Specification
```{r, echo=FALSE, results='hide'}
# First model
model1 <- lm(ISI ~ FFMC, data = cleaned_confirmation_set)
summary(model1)
```
```{r, echo=FALSE,message=FALSE , results='hide', include=FALSE}
# model 1 regression and residual plots
regression_plot <- ggplot(cleaned_confirmation_set, aes(x = FFMC, y = ISI)) +
geom_point() + # Scatter plot of the data points
geom_smooth(method = "lm", col = "blue") +
labs(title = "Regression of ISI on FFMC", x = "FFMC", y = "ISI") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
plot.title = element_text(size = 9),
axis.title = element_text(size = 7))
#print(regression_plot)
cleaned_confirmation_set <- cleaned_confirmation_set %>%
mutate(fitted_values = fitted(model1),
residuals = resid(model1))
residuals_plot <- ggplot(cleaned_confirmation_set, aes(x = fitted_values, y = residuals)) +
geom_point() + # Scatter plot of residuals
geom_hline(yintercept = 0, col = "red", linetype = "dashed") + # Horizontal line at y=0
labs(title = "Residuals vs Fitted Values", x = "Fitted Values", y = "Residuals") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
plot.title = element_text(size = 9),
axis.title = element_text(size = 7))
#print(residuals_plot)
#grid.arrange(regression_plot, residuals_plot, ncol = 2)
```
```{r, echo=FALSE, results='hide' }
# Second Model with Log Transformation
log_model <- lm(log(ISI) ~ FFMC, data = cleaned_confirmation_set)
summary(log_model)
```
```{r , echo=FALSE,message=FALSE , results='hide', include=FALSE}
# log model regression and residual plots
log_regression_plot <- ggplot(cleaned_confirmation_set, aes(x = FFMC, y = log(ISI))) +
geom_point() +
geom_smooth(method = "lm", col = "blue") + # Regression line
labs(title = "Log-Linear Regression of log(ISI) on FFMC", x = "FFMC", y = "log(ISI)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
plot.title = element_text(size = 9),
axis.title = element_text(size = 7))
#print(log_regression_plot)
cleaned_confirmation_set <- cleaned_confirmation_set %>%
mutate(log_fitted_values = fitted(log_model),
log_residuals = resid(log_model))
log_residuals_plot <- ggplot(cleaned_confirmation_set, aes(x = log_fitted_values, y = log_residuals)) +
geom_point() + # Scatter plot of residuals
geom_hline(yintercept = 0, col = "red", linetype = "dashed") + # Horizontal line at y=0
labs(title = "Residuals vs Fitted Values for Log-Linear Model", x = "Fitted Values", y = "Residuals") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
plot.title = element_text(size = 9),
axis.title = element_text(size = 7))
#print(log_residuals_plot)
#grid.arrange(log_regression_plot, log_residuals_plot, ncol = 2)
```
We started with a simple linear regression model to establish a baseline relationship between FFMC and ISI. After plotting the model, we saw a strong linear relationship between the variables. After this we applied log transformation to address skewness and improve model fit, and again saw a good line of best fit. The log transformation helped in normalizing the distribution of the dependent variable, which can lead to a better fit and more reliable statistical inference. We used log-transform for the ISI variable was based on an initial observation of skewness in the data.
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.pos='!b', fig.height = 2, fig.width= 6,fig.align='center' }
grid.arrange( regression_plot, log_regression_plot, ncol = 2, nrow = 1)
```
# Model Assuptions
The first assumption (IID) states that the data points are independent of each other and follow the same probability distribution. For our wildfire dataset, based on the correlation plot, each observation of fire spread in the dataset appears to be independent of the others. The conditions that affect fire spread (temperature, moisture, and fuel type) are assumed to be consistent across observations as well. The second assumption (BLP) states that there is a linear relationship between the independent variable(s) and the dependent variable, and that the linear model is the best predictor of the dependent variable. Based on the FFMC (Fine Fuel Moisture Code) and ISI (Initial Spread Index) plot,The relationship appears to be approximately linear.
# Model Results and Interpretation
The residuals vs fitted plot will help us judge two key model assumptions, homoscedasticity and non-linearity. We can see in the resdials plot for the simple linear model that the residuals increase with the fitted values, indicating heteroscedasticity and non-linearity. In contrast, the log-linear model's residuals appear more randomly distributed. The absence of a clear pattern suggests that the log transformation has improved the model fit by stabilizing the variance and improving the linearity.
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.pos='!b', fig.height = 2, fig.width= 6,fig.align='center' }
grid.arrange( residuals_plot, log_residuals_plot, ncol = 2, nrow = 1)
```
Our chart below shows that for our log-linear regression our coefficient for FFMC is statistically significant. Overall our model has an adjusted R$^{2}$ of 0.74. This high adjusted R$^{2}$ value shows that FFMC has strong explanatory power in our model and is a good predictor of ISI.
```{r, echo=FALSE, results='hide' }
stargazer(log_model, type = "latex", title = "Log-Linear Regression Model Results", digits = 2, single.row = TRUE)
```
\begin{table}[!htbp] \centering
\caption{Log-Linear Regression Model Results}
\label{}
\begin{tabular}{@{\extracolsep{5pt}}lc}
\\[-1.8ex]\hline
\hline \\[-1.8ex]
& \multicolumn{1}{c}{\textit{Dependent variable:}} \\
\cline{2-2}
\\[-1.8ex] & log(ISI) \\
\hline \\[-1.8ex]
FFMC & 0.14$^{***}$ (0.004) \\
Constant & $-$10.71$^{***}$ (0.40) \\
\hline \\[-1.8ex]
Observations & 355 \\
R$^{2}$ & 0.74 \\
Adjusted R$^{2}$ & 0.74 \\
Residual Std. Error & 0.25 (df = 353) \\
F Statistic & 1,022.88$^{***}$ (df = 1; 353) \\
\hline
\hline \\[-1.8ex]
\textit{Note:} & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\
\end{tabular}
\end{table}
\newpage
# References
Howard, B. C. (2023, October 4). How to live with mega-fires: Portugal’s forests may hold the secret. National Geographic. https://www.nationalgeographic.com/science/article/how-to-live-with-mega-fires-portugal-forests-may-hold-secret
Sampson, L. (2024, July 18). Where are the Portugal wildfires and is it safe to travel? The Times. https://www.thetimes.com/travel/advice/where-are-the-portugal-wildfires-is-it-safe-to-travel-ttflwnmnr