Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sbtools data discovery #261

Merged
merged 15 commits into from
Jul 5, 2017
Merged

Conversation

lindsayplatt
Copy link
Contributor

@lindsayplatt lindsayplatt commented Jun 27, 2017

Some of the query_sb function examples need some work (can't get spatial ones to work, and doi ones are causing issues). This fixes #113

@lindsayplatt
Copy link
Contributor Author

@aappling-usgs review/suggest edits once you return from travel!

Copy link
Member

@aappling-usgs aappling-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whooey, that was a big one! there's lots to learn about sbtools & ScienceBase, i guess. many comments & ideas below.


The ScienceBase search tools can be very powerful, but lack the ability to easily recreate the search. If you want to incorporate dataset queries into a reproducible workflow, you can script them using the `sbtools` query functions. The terminology differs from the web interface slightly. Below are functions available to query the catalog:

1. `query_sb` (generic SB query)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this list is very helpful - nice to see all the options in one place.

i wonder if we can say any more about query_sb() in particular here...i think it's more flexible than the other 5, right? can it do everything that the other 5 can, and more? when would a user want to use this one instead of one of the others? do we know any of the things this one can do that the others can't? could it make sense to put this at the end of the list so that you can explain this one as a generalization of the others in the text that follows?


### Using `query_sb`

`query_sb` is the "catch-all" function for querying ScienceBase from R. It only takes one argument for specifying query parameters, `query_list`. This is an R list with specific query parameters as the list names and the user query string as the list values. See the `DESCRIPTION` section of the help file for all options (`?query_sb`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about **Description** instead of `DESCRIPTION`, so nobody confuses this help file section with the package DESCRIPTION file?


# search by keyword
precip_query <- list(q = 'precipitation')
precip_data <- query_sb(query_list = precip_query)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i personally prefer to avoid defining variables that only get used once and are already informatively labeled when they're used. therefore, if doing this myself, I'd convert lines 69-70 to

precip_data <- query_sb(query_list = list(q = 'precipitation'))

but i'd like to hear your case for doing it via an intermediate variable, as it currently is, and i'm fine with leaving it this way if you actively prefer it this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I really had a reason...I think I just did it. Thanks for pointing it out, I agree - no need to clutter the environment for something that is used once.


# search by keyword + category
precip_maps_query <- list(q = 'precipitation', browseType = "Static Map Image", sort='title')
precip_maps_data <- query_sb(query_list = precip_query)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, precip_query is probably not the argument you intended (b/c precip_maps_query is what you just defined). possibly a minor case in point for not defining intermediate variables?

# search by keyword
precip_query <- list(q = 'precipitation')
precip_data <- query_sb(query_list = precip_query)
length(precip_data) # 50 entries, so there is likely more than 50 results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of typos in the comment. instead:

length(precip_data) # 20 entries, so there are likely more than 20 results

```{r}
# find data worked on in the last week
today <- Sys.time()
oneweekago <- today - (7*24*3600) # days * hrs/day * secs/hr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any interest in replacing this with

oneweekago <- today - as.difftime(7, units='days')

?

# find data worked on in the last week
today <- Sys.time()
oneweekago <- today - (7*24*3600) # days * hrs/day * secs/hr
recent_data <- query_sb_date(start = today, end = oneweekago)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i started by wondering whether it was OK for start to come after end, then happened on this curious behavior:

> recent_data <- query_sb_date(start = today, end = oneweekago, limit=10000)
> length(recent_data)
[1] 114
> recent_data <- query_sb_date(end = today, start = oneweekago, limit=10000)
> length(recent_data)
[1] 122

...i don't know what to make of this. datetimes do get converted to dates (https://github.com/USGS-R/sbtools/blob/master/R/query_sb_date.R#L36), maybe one of start/end is inclusive while the other is exclusive?

i guess i'd recommend demonstrating with dates rather than datetimes, given that query_sb_date simplifies to dates no matter what. the inclusive/exclusive thing is curious but probably not worth teaching here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried doing this with Dates instead of datetimes, but still seeing this curious behavior...

length(query_sb_date(start = today, end = oneweekago, limit=10000))
[1] 10000
> length(query_sb_date(end = today, start = oneweekago, limit=10000))
[1] 407
> length(query_sb_date(start = as.Date(today), end = as.Date(oneweekago), limit=10000))
[1] 10000
> length(query_sb_date(end = as.Date(today), start = as.Date(oneweekago), limit=10000))
[1] 407

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, that's a much bigger difference than I saw! i guess it's possible that >=9593 items were created exactly on as.Date(oneweekago)...or that when end < start, strange things happen. couldn't say which without more testing, which i'm not sure is worth our time.

that said, i didn't actually expect the Date approach to change the behavior - just to help clarify that Dates are the best precision you can hope for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems like something we don't need to defend against here.

Ah I see your point in doing that. I'll update it.


### Using `query_sb_datatype`

`query_sb_datatype` is used to search ScienceBase by the type of data an item is listed as. Run `sb_datatypes()` to get a list of 50 available data types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, good tip! and looks like this is also the list of possible browseTypes? i see that query_sb_datatype creates a browseType filter, and element 46 ("Static Map Image") is what you used for browseType above in line 81...if you do move query_sb() down to below all these more specific query_sb_...()s, you could just refer back to this function when you do the query_sb(browseType...) example

})
```

## No results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should go above Best of Both Methods, given that it refers to the first method only?


```{r}
# search for items related to a Water Quality Portal paper DOI
query_results <- query_sb_doi(doi = '10.1002/2016WR019993')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol. making good use of a dead end from your work on the query_sb_doi section, eh? there's no chance that this paper will ever make its way onto SB, is there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of!

@lindsayplatt
Copy link
Contributor Author

Updated everything based on @aappling-usgs feedback. There is a known issue with spatial data queries returning empty lists (DOI-USGS/sbtools#237), so that will need to be revisited and updated before this curriculum "goes live". I've made an issue here for that - #270.

length(atlantic_ocean)
head(atlantic_ocean)

# date query during Marco Polo's life
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😺

@aappling-usgs aappling-usgs merged commit f9c7f9e into USGS-R:master Jul 5, 2017
@lindsayplatt lindsayplatt deleted the online-course branch August 9, 2017 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sbtools: data discovery
2 participants