-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sbtools data discovery #261
Conversation
@aappling-usgs review/suggest edits once you return from travel! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whooey, that was a big one! there's lots to learn about sbtools & ScienceBase, i guess. many comments & ideas below.
|
||
The ScienceBase search tools can be very powerful, but lack the ability to easily recreate the search. If you want to incorporate dataset queries into a reproducible workflow, you can script them using the `sbtools` query functions. The terminology differs from the web interface slightly. Below are functions available to query the catalog: | ||
|
||
1. `query_sb` (generic SB query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this list is very helpful - nice to see all the options in one place.
i wonder if we can say any more about query_sb() in particular here...i think it's more flexible than the other 5, right? can it do everything that the other 5 can, and more? when would a user want to use this one instead of one of the others? do we know any of the things this one can do that the others can't? could it make sense to put this at the end of the list so that you can explain this one as a generalization of the others in the text that follows?
|
||
### Using `query_sb` | ||
|
||
`query_sb` is the "catch-all" function for querying ScienceBase from R. It only takes one argument for specifying query parameters, `query_list`. This is an R list with specific query parameters as the list names and the user query string as the list values. See the `DESCRIPTION` section of the help file for all options (`?query_sb`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about **Description** instead of `DESCRIPTION`, so nobody confuses this help file section with the package DESCRIPTION file?
|
||
# search by keyword | ||
precip_query <- list(q = 'precipitation') | ||
precip_data <- query_sb(query_list = precip_query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i personally prefer to avoid defining variables that only get used once and are already informatively labeled when they're used. therefore, if doing this myself, I'd convert lines 69-70 to
precip_data <- query_sb(query_list = list(q = 'precipitation'))
but i'd like to hear your case for doing it via an intermediate variable, as it currently is, and i'm fine with leaving it this way if you actively prefer it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I really had a reason...I think I just did it. Thanks for pointing it out, I agree - no need to clutter the environment for something that is used once.
|
||
# search by keyword + category | ||
precip_maps_query <- list(q = 'precipitation', browseType = "Static Map Image", sort='title') | ||
precip_maps_data <- query_sb(query_list = precip_query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here, precip_query
is probably not the argument you intended (b/c precip_maps_query
is what you just defined). possibly a minor case in point for not defining intermediate variables?
# search by keyword | ||
precip_query <- list(q = 'precipitation') | ||
precip_data <- query_sb(query_list = precip_query) | ||
length(precip_data) # 50 entries, so there is likely more than 50 results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of typos in the comment. instead:
length(precip_data) # 20 entries, so there are likely more than 20 results
```{r} | ||
# find data worked on in the last week | ||
today <- Sys.time() | ||
oneweekago <- today - (7*24*3600) # days * hrs/day * secs/hr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any interest in replacing this with
oneweekago <- today - as.difftime(7, units='days')
?
# find data worked on in the last week | ||
today <- Sys.time() | ||
oneweekago <- today - (7*24*3600) # days * hrs/day * secs/hr | ||
recent_data <- query_sb_date(start = today, end = oneweekago) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i started by wondering whether it was OK for start to come after end, then happened on this curious behavior:
> recent_data <- query_sb_date(start = today, end = oneweekago, limit=10000)
> length(recent_data)
[1] 114
> recent_data <- query_sb_date(end = today, start = oneweekago, limit=10000)
> length(recent_data)
[1] 122
...i don't know what to make of this. datetimes do get converted to dates (https://github.com/USGS-R/sbtools/blob/master/R/query_sb_date.R#L36), maybe one of start/end is inclusive while the other is exclusive?
i guess i'd recommend demonstrating with dates rather than datetimes, given that query_sb_date simplifies to dates no matter what. the inclusive/exclusive thing is curious but probably not worth teaching here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried doing this with Dates instead of datetimes, but still seeing this curious behavior...
length(query_sb_date(start = today, end = oneweekago, limit=10000))
[1] 10000
> length(query_sb_date(end = today, start = oneweekago, limit=10000))
[1] 407
> length(query_sb_date(start = as.Date(today), end = as.Date(oneweekago), limit=10000))
[1] 10000
> length(query_sb_date(end = as.Date(today), start = as.Date(oneweekago), limit=10000))
[1] 407
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, that's a much bigger difference than I saw! i guess it's possible that >=9593 items were created exactly on as.Date(oneweekago)
...or that when end < start, strange things happen. couldn't say which without more testing, which i'm not sure is worth our time.
that said, i didn't actually expect the Date approach to change the behavior - just to help clarify that Dates are the best precision you can hope for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems like something we don't need to defend against here.
Ah I see your point in doing that. I'll update it.
|
||
### Using `query_sb_datatype` | ||
|
||
`query_sb_datatype` is used to search ScienceBase by the type of data an item is listed as. Run `sb_datatypes()` to get a list of 50 available data types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh, good tip! and looks like this is also the list of possible browseType
s? i see that query_sb_datatype
creates a browseType
filter, and element 46 ("Static Map Image") is what you used for browseType
above in line 81...if you do move query_sb()
down to below all these more specific query_sb_...()
s, you could just refer back to this function when you do the query_sb(browseType...)
example
}) | ||
``` | ||
|
||
## No results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this should go above Best of Both Methods, given that it refers to the first method only?
|
||
```{r} | ||
# search for items related to a Water Quality Portal paper DOI | ||
query_results <- query_sb_doi(doi = '10.1002/2016WR019993') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol. making good use of a dead end from your work on the query_sb_doi
section, eh? there's no chance that this paper will ever make its way onto SB, is there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I know of!
Updated everything based on @aappling-usgs feedback. There is a known issue with spatial data queries returning empty lists (DOI-USGS/sbtools#237), so that will need to be revisited and updated before this curriculum "goes live". I've made an issue here for that - #270. |
length(atlantic_ocean) | ||
head(atlantic_ocean) | ||
|
||
# date query during Marco Polo's life |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😺
Some of the
query_sb
function examples need some work (can't get spatial ones to work, and doi ones are causing issues). This fixes #113