You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The goal is the make a predictor of whether a tree tracked in San
Francisco is a Department of Public Works maintained legal status tree,
or some other legal status.
Get data
This is a 2020-01-28 Tidy
Tuesday
dataset. These data are from the San Francisco Public Works’ Bureau of
Urban Forestry.
## Rows: 192987 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): legal_status, species, address, site_info, caretaker, plot_size
## dbl (5): tree_id, site_order, dbh, latitude, longitude
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
So the legal_status of “DPW Maintained” does not equate with a
caretaker of “DPW”—in fact, most of the time, DPW-legal status trees
are privately taken care of.
col_plot_legalstatus_by_caretaker<-sftrees %>%
count(legal_status, caretaker) %>%
add_count(caretaker, wt=n, name="caretaker_count") %>%
filter(caretaker_count>50) %>%
group_by(legal_status) %>%
mutate(percent_legal=n/ sum(n)) %>%
ggplot(aes(percent_legal, caretaker, fill=legal_status)) +
geom_col(position="dodge") +
scale_fill_viridis_d(option="D", begin=0.1, end=0.7, na.value="grey50") +
labs(fill=NULL,
x="proportion of trees in each category")
col_plot_legalstatus_by_caretaker
The glimpse just turns the data to print left to right. The n column
at the start shows how many rows are in the dataframe; the other named
columns show how many NAs are in the data in each column. The date
and dhb(Diameter at breast
height)
columns show significant levels of NAs (64.5% and 21.7%, respectively).
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `plot_size = parse_number(plot_size)`.
## Caused by warning:
## ! 109 parsing failures.
## row col expected actual
## 10979 -- a number TR
## 13245 -- a number CUT
## 13495 -- a number TR
## 13501 -- a number TR
## 13502 -- a number TR
## ..... ... ........ ......
## See problems(...) for more details.
head(trees_formodel)
tree_id
legal_status
species
site_order
site_info
caretaker
date
dbh
plot_size
latitude
longitude
30372
DPW Maintained
Ulmus parvifolia :: Chinese Elm
1
Sidewalk: Curb side : Cutout
Private
1956-03-02
10
3
37.76005
-122.3983
30460
DPW Maintained
Pittosporum undulatum :: Victorian Box
1
Sidewalk: Curb side : Cutout
Private
1956-05-11
19
4
37.80074
-122.4073
30454
DPW Maintained
Pittosporum undulatum :: Victorian Box
1
Sidewalk: Curb side : Cutout
Private
1956-05-11
8
3
37.80081
-122.4057
30428
DPW Maintained
Pittosporum undulatum :: Victorian Box
1
Sidewalk: Curb side : Cutout
Private
1956-05-11
13
7
37.80082
-122.4066
30468
DPW Maintained
Melaleuca quinquenervia :: Cajeput
2
Sidewalk: Curb side : Cutout
Private
1956-05-29
8
3
37.80061
-122.4073
30470
DPW Maintained
Melaleuca quinquenervia :: Cajeput
3
Sidewalk: Curb side : Cutout
Private
1956-05-29
8
3
37.80062
-122.4073
col_plot_legalstatus_by_caretaker<-trees_formodel %>%
count(legal_status, caretaker) %>%
add_count(caretaker, wt=n, name="caretaker_count") %>%
filter(caretaker_count>50) %>%
group_by(legal_status) %>%
mutate(percent_legal=n/ sum(n)) %>%
ggplot(aes(percent_legal, caretaker, fill=legal_status)) +
geom_col(position="dodge") +
scale_fill_viridis_d(option="D", begin=0.1, end=0.7, na.value="grey50") +
labs(fill=NULL,
x="proportion of trees in each category")
col_plot_legalstatus_by_caretaker
tune_spec<- rand_forest(
mtry= tune(), #trees=1000, # number of trees to start withmin_n= tune() # how many data points in a node to keep splitting further
) %>%
set_mode("classification") %>%
set_engine("ranger")
While it’s not a regular grid (of orthogonal combinations that would
allow for ceteris paribus testing) of min_n and mtry, but we can
get an idea of what is going on. It looks like higher values of mtry are
good (above about 10) and lower values of min_n are good (below about
10). We can get a better handle on the hyperparameters by tuning one
more time, this time using regular_grid(). Let’s set ranges of
hyperparameters we want to try, (inside of the dotted line box displayed
on the 2D plot above) based on the results from our initial tune.
Satisfyingly, whether the caretaker is private makes a large difference,
and latitude and longitide each make a large (and approximately equal)
contribution.