Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crosswalk SD fisheries managed lakes against LAGOS-US lakes #221

Open
hcorson-dosch-usgs opened this issue Oct 14, 2021 · 15 comments
Open
Assignees

Comments

@hcorson-dosch-usgs
Copy link
Contributor

Crosswalk this shapefile of SD fisheries managed lakes against the LAGOS-US spatial data (see #220) to see how many of those SD fisheries managed lakes are missing from LAGOS because they are classified as intermittent

@lindsayplatt
Copy link
Contributor

I did a quick analysis outside of the pipeline by manually downloading & unzipping the SD Managed Fisheries shapefile from Gretchen and the LAGOS-US LOCUS GDB. Then, I ran this code to evaluate:

library(sf)
library(tidyverse)

sd_fisheries <- st_read("South Dakota/ManagedFisheries.shp")
sd_fishers_transf <- st_transform(sd_fisheries, st_crs(lagos_us))
sd_fisheries_pts <- st_centroid(sd_fishers_transf)

lagos_us_gdb <- "gis_locus_v1.0_gdb/gis_locus_v1.0.gdb"
st_layers(lagos_us_gdb) # Available layers

lagos_us <- st_read(lagos_us_gdb) # Load lake shapes by default (since it is the first layer)
lagos_sd <- st_crop(lagos_us, st_bbox(sd_fishers_transf))
lagos_sd_pts <- st_centroid(lagos_sd)

# Now check overlap
fisheries_in_lagos <- st_join(sd_fisheries_pts, lagos_sd, join = st_within)

# 259 SD fisheries lakes within the LAGOS-US lakes
sd_fisheries_in_lagos <- fisheries_in_lagos %>% 
  filter(!is.na(lagoslakeid)) 
nrow(sd_fisheries_in_lagos)

# 495 lakes that didn't match a LAGOS-US lake
fisheries_in_lagos %>% 
  filter(is.na(lagoslakeid)) %>% 
  nrow()

# 6 of the LAGOS-US lakes had multiple SD fisheries lakes within them
fisheries_in_lagos %>% 
  filter(!is.na(lagoslakeid)) %>% 
  group_by(lagoslakeid) %>% 
  summarize(n_sd_fisheries = n()) %>% 
  filter(n_sd_fisheries > 1)

# Visualize overlap
plot(st_geometry(lagos_sd))
plot(st_geometry(sd_fisheries_pts), add=TRUE, col = "red", pch = 16)
plot(st_geometry(sd_fisheries_in_lagos), add=TRUE, col = "blue", pch = 16)

image

Out of the 754 SD Managed Fisheries, I found that 259 of them had centroids that fell within the LAGOS-US dataset. Six of the LAGOS-US lakes had more than one of the SD Managed Fisheries.

To do next: play with the actually joining for the point-in-polygon. I used st_within and didn't mess with any kind of buffer. There is probably more we can do there.

@lindsayplatt
Copy link
Contributor

lindsayplatt commented Jan 6, 2022

Redid this analysis using the code @jread-usgs pointed me to from crosswalk_poly_intersect_poly(). Copied the code from that function to use here since I am not pipelining anything yet.

First, load the data (see first code chunk for links to these data sets)

library(sf)
library(tidyverse)

lagos_us_gdb <- "gis_locus_v1.0_gdb/gis_locus_v1.0.gdb"
lagos_us <- st_read(lagos_us_gdb) # Load lake shapes by default (since it is the first layer)

sd_fisheries <- st_read("South Dakota/ManagedFisheries.shp")
sd_fisheries_transf <- st_transform(sd_fisheries, st_crs(lagos_us))

lagos_sd <- st_crop(lagos_us, st_bbox(sd_fisheries_transf))

Then, perform the polygon intersection:

# Based on code in `crosswalk_poly_intersect_poly()`

poly1_ID_name <- "SD_ID"
poly1_data <- sd_fisheries_transf %>% 
  # Treat each SD fisheries lake unique based on the Name + StateID columns
  # There are 40 non-unique lakes if you just go by Name
  unite(site_id, Name, StateID, na.rm = TRUE) %>% 
  mutate(site_id = sprintf("SD_%s", site_id)) %>% 
  select(site_id)
poly2_data <- lagos_sd %>% 
  mutate(site_id = lagoslakeid) %>% 
  select(site_id) %>% rename(geometry = Shape)

poly2_data <- st_zm(poly2_data)

# Aggregate polys by site ID (this step likely unnecessary for NHD but keeping for generality)
poly1_agg <- poly1_data %>%
  group_by(site_id) %>%
  summarise()

poly2_agg <- poly2_data %>%
  group_by(site_id) %>%
  summarise()

# Measure areas
poly1_agg$poly1_Area <- as.numeric(st_area(poly1_agg))
poly2_agg$site_id_Area <- as.numeric(st_area(poly2_agg))

poly1_agg <- sf::st_buffer(poly1_agg, dist = 0)
poly2_agg <- sf::st_buffer(poly2_agg, dist = 0)

colnames(poly1_agg)[colnames(poly1_agg)=="site_id"] <- poly1_ID_name

intersect.sf <-  sf::st_intersection(poly1_agg, poly2_agg)
intersect.sf$Intersect_Area <- as.numeric(st_area(intersect.sf))

### Drop intersections that are <1% of poly1 area or poly2 area
intersect.sf <- intersect.sf %>%
  filter(Intersect_Area > .01*poly1_Area) %>%
  filter(Intersect_Area > .01*site_id_Area)

intersect_out <- intersect.sf %>%
  st_drop_geometry()

colnames(intersect_out)[colnames(intersect_out)=="poly1_Area"] <- paste0(poly1_ID_name, "_Area")

Finally, evaluate the intersection by visualizing and looking at counts

plot(st_geometry(lagos_sd))
plot(st_geometry(sd_fisheries_transf), add=TRUE, col = "red", pch = 16)
plot(st_geometry(intersect.sf), add=TRUE, col = "blue", pch = 16)

# How many matched lagos? How many didn't?
n_sd <- length(unique(poly1_data$site_id)) # 751 SD Lakes
n_sd_in_lagos <- length(which(unique(poly1_data$site_id) %in% unique(intersect_out$SD_ID))) # 331 unique SD fisheries map to at least one LAGOS lake
n_sd_not_in_lagos <- length(which(!unique(poly1_data$site_id) %in% unique(intersect_out$SD_ID))) # 420 SD fisheries to not map to a LAGOS lake

round(n_sd_in_lagos/n_sd*100) # 44%
round(n_sd_not_in_lagos/n_sd*100) #56%

# There are 49 SD lakes that overlap more than 1 LAGOS lake
intersect_out %>% 
  group_by(SD_ID) %>% 
  summarize(count_lagos = length(unique(site_id))) %>% 
  filter(count_lagos > 1)

# There are 14 LAGOS lakes that overlap more than 1 SD lake
intersect_out %>% 
  group_by(site_id) %>% 
  summarize(count_sd = length(unique(SD_ID))) %>% 
  filter(count_sd > 1)

image

There are 420 unique SD Fisheries lakes that do not overlap with any LAGOS-US lakes.

@lindsayplatt
Copy link
Contributor

Note that poly1_data has 754 observations but poly1_agg has only 751. This means that there were 3 lakes from the SD Fisheries data that shared the same Name & StateID.

@jordansread
Copy link

@lindsayplatt nice work.

From your centroid analysis, 259 of the lakes fell within LAGOS polygons and with this updated poly-poly match, that number went up to 331. Still less than half 😱

I think our original question was related to how LAGOS was dropping certain NHD high-res lakes that we're not dropping. LAGOS is dropping them because of some field values in NHD that suggest the lakes are intermittent or aren't real lakes in other ways (see "LAGOS because they are classified as intermittent" from above). We noticed a lot of these issues of dropped lakes appear in the Dakotas, so it was important to know what wouldn't be in LAGOS but is still an important lake to the fisheries managers. Since LAGOS is a subset of NHDHR (you can link to it with their crosswalk since it contains PERM_ID), perhaps it is necessary to do the ND to NHDHR crosswalk first, summarize what lakes aren't there ("X% of SD fisheries lakes didn't resolve to our NHDHR crosswalk") and then see which of the SD lakes that are matched to NHDHR are lost when we connect to LAGOS ids (which could be a table join instead of a spatial analysis).

@lindsayplatt
Copy link
Contributor

OK, I did the polygon crosswalking using SD fisheries & NHD polygons.

library(sf)
library(tidyverse)

nhd_lakes_sf <- readRDS('1_crosswalk_fetch/out/canonical_lakes_sf.rds')
nhd_lakes_sf <- st_zm(nhd_lakes_sf)

sd_fisheries <- st_read("South Dakota/ManagedFisheries.shp")
sd_fisheries_transf <- st_transform(sd_fisheries, st_crs(nhd_lakes_sf))

sd_nhd_lakes_sf <- nhd_lakes_sf %>% 
  st_make_valid() %>% 
  st_crop(st_bbox(sd_fisheries_transf))

poly1_ID_name <- "SD_ID"
poly1_data <- sd_fisheries_transf %>% 
  # Treat each SD fisheries lake unique based on the Name + StateID columns
  # There are 40 non-unique lakes if you just go by Name
  unite(site_id, Name, StateID, na.rm = TRUE) %>% 
  mutate(site_id = sprintf("SD_%s", site_id)) %>% 
  select(site_id) %>% 
  st_make_valid()
poly2_data <- sd_nhd_lakes_sf %>% 
  select(site_id) %>% 
  rename(geometry = Shape)

# THEN FOLLOWED THE SAME CODE STARTING WITH `poly1_agg <- `

Then, I evaluated the outputs.

# How many matched an nhd lake? How many didn't?
n_sd <- length(unique(poly1_data$site_id)) # 751 SD Lakes
n_sd_in_nhd <- length(which(unique(poly1_data$site_id) %in% unique(intersect_out$SD_ID))) # 309 unique SD fisheries map to at least one NHD lake
n_sd_not_in_nhd <- length(which(!unique(poly1_data$site_id) %in% unique(intersect_out$SD_ID))) # 442 SD fisheries to not map to a NHD lake

round(n_sd_in_nhd/n_sd*100) # 41%
round(n_sd_not_in_nhd/n_sd*100) #59%

# There are 71 SD lakes that overlap more than 1 NHD lake
intersect_out %>% 
  group_by(SD_ID) %>% 
  summarize(count_nhd = length(unique(site_id))) %>% 
  filter(count_nhd > 1)

# There are 43 NHD lakes that overlap more than 1 SD lake
intersect_out %>% 
  group_by(site_id) %>% 
  summarize(count_sd = length(unique(SD_ID))) %>% 
  filter(count_sd > 1)

So, there are 442 SD Fisheries that are not in the NHD at all. This seems suspicious as we would expect there to be more that match NHD than match LAGOS. Is canonical_lakes_sf the appropriate lakes file to use from our pipeline @jread-usgs ?

@jordansread
Copy link

I went through this separately (and differently, but adding a crosswalk to a branch on my local fork) and verified the numbers are the same.

I think I've either said or implied that we've got a similar size cut-off as LAGOS (1ha), but we actually don't. We're filtering out the smallest lakes below 4 ha (see "min_size"). This is almost certainly the reason we have fewer matched lakes.

@lindsayplatt
Copy link
Contributor

Thanks for going the extra step to verify this. So there are potentially 442 SD fisheries that we would want to add into our pipeline from another source, right?

@jordansread
Copy link

We don't currently have a method to add lakes from another source that don't resolve to our nhd IDs, since we're using the NHDHR ID as the comprehensive data "key" for this project. For relatively low(ish) cost, we could choose to bring in select NHDHR lakes that are below the 4ha cutoff. Reducing our cutoff to 1ha (like LAGOS has) is also possible, but makes every crosswalk calculation way more expensive ⏲️

@lindsayplatt
Copy link
Contributor

Ah right, OK. Is it interesting that so many of these SD Fisheries are < 4 ha? Trying to understand that size and 4 ha ~ 7.5 football fields, so not quite as small as I thought 😄

@jordansread
Copy link

I think it is safe to assume we're missing their small lakes, but without a direct comparison, we don't know for sure that it is size that is the reason they are matched.

@lindsayplatt
Copy link
Contributor

Not sure if this is useful, but doing a quick summary of the SD fisheries dataset (which has an Acres column already, though this could differ from what NHD has):

sd_fisheries %>% 
    st_drop_geometry() %>% 
    mutate(hectares = Acres/2.47105) %>% 
    summarize(n_total = n(), 
              n_above1ha = sum(hectares >= 1), 
              n_below1ha = sum(hectares < 1),
              n_above4ha = sum(hectares >= 4), 
              n_below4ha = sum(hectares < 4)) %>% t()

Result:

n_total     754
n_above1ha  384
n_below1ha  370
n_above4ha  323
n_below4ha  431

431 lakes below 4 ha is pretty similar to our missing 442, but doesn't quite account for the difference.

@jordansread
Copy link

wow, the number below 1ha is 👀 . Those are pretty small to be managed for fish!

@padilla410
Copy link
Contributor

Hey y'all, jumping into this conversation to summarize what is going on above:

  • the SD shapefile has 754 lakes, but more than half (~60%) of those lakes are expected get dropped from the final datasets. This is because many are smaller than the 4 ha threshold we set.
  • the number of lakes dropped is relatively consistent based on which matching approach we use (e.g., "SD lakes --> LAGOS data set --> NHDHR" or "SD lakes --> NHDHR")

Based on all of this, I am planning to go forward with developing the crosswalk using the standard "polygon-to-polygon" matching method outlined in the doc (i.e., the "SD lakes --> LAGOS --> NHDHR"). I will expect to see around ~330 lakes in that final data set.

@padilla410
Copy link
Contributor

padilla410 commented Apr 7, 2022

<<comment deleted from this thread and moved here>>

@lindsayplatt
Copy link
Contributor

@padilla410 once again, you have really gone the extra mile and dug deeply into these complex data issues 🙏 My understanding of this issue remaining open was that I was waiting to hear from Holly Kundel if any of those <4 ha lakes in SD were actually important to keep. The outcome from her would be whether or not a workaround to keep some of them was needed. Seems like you may have uncovered a different issue related to crosswalking our SD data to get them into the pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants