Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

additional functionality for MOL internal uses #236

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

matthewsrogan
Copy link

Hi Bruno, I am a data curator with the Map of Life and among my responsibilities is helping develop and apply cleaning workflows for global occurrence datasets. The bdc package mirrors an internal workflow we were developing, and so we are exploring using the bdc package as the core of our data prep pipeline for MOL and other projects within Yale’s Center for Biodiversity and Global Change. However, there are a few integral steps that we need that do not currently appear to be available within bdc. We are happy developing this extra functionality internally, but if you feel these steps have broader value, we would be happy to work with you to integrate them directly into the package.

I have written a few examples of functions that I wrote to work within the bdc pipeline. mol_roi.R is a spatial filtering step that lets us flag data using spatial layers such as our internal land layer derived from the Global Shoreline Vector. It’s basically a wrapper function with helpers for different formats of the spatial region of interest. mol_spatiotemporal_duplicate.R is a cleaning step we use to identify duplicate records prior to spatiotemporal filtering. It’s especially important when we integrate overlapping datasets, such as GBIF and California Consortium of Herbaria. mol_flagged_by_source.R is a function to flag records based on quality control attributes provided by the source (e.g. GBIF’s ‘issue’ attribute).

Other functions we need for our workflows include one to flag vagrant occurrences and geographic outliers based on the distance from the species’ range, and another to identify non-native occurrences such as those plants in gardens, animals in zoos, or domesticated species/variants.

Either way, we will continue to develop this functionality for our internal uses. But if you feel any of these steps are of value to the package’s user base, we are eager to work with you to make sure they work seamlessly with the package and that they pass all testing requirements. From our perspective, the more centralized everything is within bdc, the better.

The package is intuitive and meets our needs well, so we appreciate what you and your collaborators have done to develop it.

Cheers,
Matt Rogan

identify records outside a region of interest (mol_roi.R), identify records flagged according to a quality control attribute provided by the source (mol_flagged_by_source.R), and identify spatiotemporal duplicates (mol_spatiotemporal_duplicate.R).
@brunobrr
Copy link
Owner

Hi Matt,

My apologies for the delay in getting back to you.
Thank you very much. I'm sure these functions will be of great value to bdc's users.

bdc team will check the functions carefully, but I believe that they be incorporated into bdc seamlessly.

It will be our pleasure to invite you to be a collaborator of bdc here on GitHub and on future publications.

Cheers,
Bruno

Copy link
Collaborator

@kguidonimartins kguidonimartins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @matthewsrogan,

We are glad of your interest in the bdc and its usefulness for your research group. I think that your functions are a great contribution to bdc, and we can work on that to meet the package standards. First, we cannot include more dependencies directly in our package (for example, the terra package in your functions), but it can be solved using our internal function (see comments in the review). Also, can you include some tests to make sure your functions are returning what they must return? Although tests are not mandatory for CRAN, they ensure that our functions work as expected. Let us know if you need some help with this.

data,
flagStrings,
flagCols
){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be something like this: mol_flagged_by_source <- function(data, flagStrings, flagCols) {...}.


check_col(data, flagCols)

terms <- str_c(flagStrings, collapse = "|")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All function calls need prepended package name, for example: stringr::str_c(...). Same holds for if_any and str_detect below.

R/mol_flagged_by_source.R Show resolved Hide resolved
R/mol_roi.R Outdated
if(!file.exists(roi)) stop("The ROI filepath is invalid.")

# check proper format
if(!any(str_detect(roi,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing prepended package name in function calls here.

R/mol_roi.R Outdated

### Convert to sf/terra
if("SpatialPolygons" %in% .class2(roi)) roi <- st_as_sfc(roi)
if(class(roi) == "RasterLayer") roi <- rast(roi)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing prepended package name in function calls here.

R/mol_roi.R Outdated


unflgd <- crpd %>%
dplyr::mutate(within = st_within(.,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing prepended package name in function calls here.

R/mol_roi.R Outdated
maskValue){

### Reproject to raster CRS
dataCoords[, c("tempX", "tempY")] <- geom(project(vect(dataCoords,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing prepended package name in function calls here.

R/mol_roi.R Outdated

if(!is.na(maskValue)){
smpld <- smpld %>%
filter(value != maskValue)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing prepended package name in function calls here. filter here and in 238.

R/mol_roi.R Show resolved Hide resolved
R/mol_spatiotemporal_duplicate.R Show resolved Hide resolved
Copy link
Collaborator

@sjevelazco sjevelazco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi

I tested the mol_roi() function, and there are several errors. Also some inconsistencies with class of R object that this function is limited to, but they are widely used in bdc, for example, mol_roi is written to handle data.frame, however it must work with tibble too and return tibble objects :)

R/mol_roi.R Outdated
#' @param data data.frame. Containing geographical coordinates.
#' @param roi spatial region of interest. Can be provided as an sf, sfc,
#' SpatialPolygons, SpatialPolygonsDataFrame, rasterLayer, spatRaster,
#' or as a file path with extension #' ".shp", ".gpkg", or ".tif".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove #' before ".shp", ".gpkg", or ".tif".
(#' or as a file path with extension ".shp", ".gpkg", or ".tif".)

R/mol_roi.R Outdated
#' @details This test identifies records outside a particular region of interest
#' defined by a raster or vector layer
#'
#' @return A data.frame containing the column ".outside_roi". Compliant
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bdc functions return tibble objects

Comment on lines +63 to +64
check_col(data, c(lat, lon))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace by
all(c(lat, lon)%in%colnames(data))

So you can remove this check_col function that belongs to a package not used in bdc

Comment on lines +59 to +61
if (!is.data.frame(data)) {
stop(deparse(substitute(data)), " is not a data.frame", call. = FALSE)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function must be flexible to handle tibble or data.frame object

R/mol_roi.R Outdated
Comment on lines 65 to 72
if(!any(c("character",
"sf",
"sfc",
"SpatialPolygons",
"RasterLayer",
"SpatRaster")) %in% class(roi)){
stop("The region of interest is not in a valid format.")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parentheses are wrong here replace by

if(!any(c("character",
"sf",
"sfc",
"SpatialPolygons",
"RasterLayer",
"SpatRaster") %in% class(roi))){
stop("The region of interest is not in a valid format.")
}

R/mol_roi.R Outdated
### converts coordinates columns to numeric
data <-
data %>%
dplyr::rowid_to_column("id_temp") %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rowied_to_column() is from tibble package, fix it

R/mol_roi.R Outdated
Comment on lines 83 to 86
dataCoords %>%
dplyr::select(dplyr::all_of(c("id_temp", lon, lat))) %>%
dplyr::filter(!is.na(.data[[lat]]),
!is.na(.data[[lon]]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did dataCoords come from? this object was not created before in the function, please check it.
Also, the output of these codes is not being saved, fix it .

dplyr::filter(!is.na(.data[[lat]]),
!is.na(.data[[lon]]))

if(nrow(dataCoords) == 0) stop("No valid coordinates.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No filter was applied to test this,

Maybe you want to write

dataCoords <- data %>%
dplyr::select(dplyr::all_of(c("id_temp", lon, lat))) %>%
dplyr::filter(!is.na(.data[[lat]]),
!is.na(.data[[lon]]))

if(nrow(dataCoords) == 0) stop("No valid coordinates.")

R/mol_roi.R Outdated

### Convert to sf/terra
if("SpatialPolygons" %in% .class2(roi)) roi <- st_as_sfc(roi)
if(class(roi) == "RasterLayer") roi <- rast(roi)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an error here

if(class(roi) == "RasterLayer") roi <- rast(roi)
Error in if (class(roi) == "RasterLayer") roi <- rast(roi) :
the condition has length > 1

Maybe can be replaced by
class(roi)[1] == "RasterLayer"

R/mol_roi.R Outdated
Comment on lines 116 to 131
if(any(c("sf", "sfc") %in% class(roi))){

if(!any(is.na(maskValue), is.null(maskValue))){
warning("maskValue is ignored when ROI is provided as a vector layer.")
}

unflagged <- roi_sf(dataCoords,
roi,
lat,
lon)
} else(
unflagged <- roi_rast(dataCoords,
roi,
lat,
lon)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an error here

Spherical geometry (s2) switched off
although coordinates are longitude/latitude, st_union assumes that they are planar
Spherical geometry (s2) switched on
Error in dplyr::mutate():
! Problem while computing within = st_within(., roi, sparse = F)[, 1].
Caused by error in UseMethod():
! no applicable method for 'st_geometry' applied to an object of class "data.frame"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for running these checks and all of your suggestions. I hope to make all the corrections this week.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @matthewsrogan,

I would like to work on some suggested functions to turn them able to bdc. May I make pull requests to the forked repo bdc_mol?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Lucas, that would be great. I just pushed some updates I've been testing this week.

debugged raster and vector helper functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants