-
Notifications
You must be signed in to change notification settings - Fork 16
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #261 from lindsaycarr/online-course
sbtools data discovery
- Loading branch information
Showing
6 changed files
with
828 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,263 @@ | ||
--- | ||
title: "sbtools - Data discovery" | ||
date: "9999-07-01" | ||
author: "Lindsay R. Carr" | ||
slug: "sbtools-discovery" | ||
image: "img/main/intro-icons-300px/r-logo.png" | ||
output: USGSmarkdowntemplates::hugoTraining | ||
parent: Introduction to USGS R Packages | ||
weight: 2 | ||
draft: true | ||
--- | ||
|
||
```{r setup, include=FALSE, warning=FALSE, message=FALSE} | ||
library(knitr) | ||
knit_hooks$set(plot=function(x, options) { | ||
sprintf("<img src='../%s%s-%d.%s'/ title='%s'/>", | ||
options$fig.path, options$label, options$fig.cur, options$fig.ext, options$fig.cap) | ||
}) | ||
opts_chunk$set( | ||
echo=TRUE, | ||
fig.path="static/sbtools-discovery/", | ||
fig.width = 6, | ||
fig.height = 6, | ||
fig.cap = "TODO" | ||
) | ||
set.seed(1) | ||
``` | ||
|
||
Although ScienceBase is a great platform for uploading and storing your data, you can also use it to find other available data. You can do that manually by searching using the ScienceBase web interface or through `sbtools` functions. | ||
|
||
## Discovering data via web interface | ||
|
||
The most familiar way to search for data would be to use the ScienceBase search capabilities available online. You can search for any publically available data in the [ScienceBase catalog](https://www.sciencebase.gov/catalog/). Search by category (map, data, project, publication, etc), topic-based tags, or location; or search by your own key words. | ||
|
||
![ScienceBase Catalog Homepage](../static/img/sb_catalog_search.png#inline-img "search ScienceBase catalog") | ||
|
||
Learn more about the [catalog search features](www.sciencebase.gov/about/content/explore-sciencebase#2. Search ScienceBase) and explore the [advanced searching capabilities](www.sciencebase.gov/about/content/sciencebase-advanced-search) on the ScienceBase help pages. | ||
|
||
## Discovering data via sbtools | ||
|
||
The ScienceBase search tools can be very powerful, but lack the ability to easily recreate the search. If you want to incorporate dataset queries into a reproducible workflow, you can script them using the `sbtools` query functions. The terminology differs from the web interface slightly. Below are functions available to query the catalog: | ||
|
||
1. `query_sb_text` (matches title or description) | ||
2. `query_sb_doi` (use a DOI identifier) | ||
3. `query_sb_spatial` (data within or at a specific location) | ||
4. `query_sb_date` (items within time range) | ||
5. `query_sb_datatype` (type of data, not necessarily file type) | ||
6. `query_sb` (generic SB query) | ||
|
||
These functions take a variety of inputs, and all return an R list of `sbitems` (a special `sbtools` class). All of these functions default to 20 returned search results, but you can change that by specifying the argument `limit`. The `query_sb` is a generalization of the other functions, and has a number of additional query specifications: [Lucene query string](http://www.lucenetutorial.com/lucene-query-syntax.html), folder and parent items, item ids, or project status. Before we practice using these functions, make sure you load the `sbtools` package in your current R session. | ||
|
||
```{r load, message=FALSE, warning=FALSE} | ||
library(sbtools) | ||
``` | ||
|
||
### Using `query_sb_text` | ||
|
||
`query_sb_text` returns a list of `sbitems` that match the title or description fields. Use it to search authors, station names, rivers, states, etc. | ||
|
||
```{r query_sb_text} | ||
# search using a contributors name | ||
contrib_results <- query_sb_text("Robert Hirsch") | ||
head(contrib_results, 2) | ||
# search using place of interest | ||
park_results <- query_sb_text("Yellowstone") | ||
head(park_results, 2) | ||
# search using a river | ||
river_results <- query_sb_text("Rio Grande") | ||
length(river_results) | ||
head(river_results, 2) | ||
``` | ||
|
||
It might be easier to look at the results returned from queries by just looking at their titles. The other information stored in an sbitem is useful, but a little distracting when you are looking at many results. You can use `sapply` to extract the titles. | ||
|
||
```{r query_sb-sapply} | ||
# look at all titles returned from the site location query previously made | ||
sapply(river_results, function(item) item$title) | ||
``` | ||
|
||
Now you can use `sapply` to look at the titles for your returned searches instead of `head`. | ||
|
||
### Using `query_sb_doi` | ||
|
||
Use a Digital Object Identifier (DOI) to query ScienceBase. This should return only one list item, unless there is more than one ScienceBase item referencing this very unique identifier. | ||
|
||
```{r query_sb_doi} | ||
# USGS Microplastics study | ||
query_sb_doi('10.5066/F7ZC80ZP') | ||
# Environmental Characteristics data | ||
query_sb_doi('10.5066/F77W699S') | ||
``` | ||
|
||
### Using `query_sb_spatial` | ||
|
||
`query_sb_spatial` accepts 3 different methods for specifying a spatial area in which to look for data. To illustrate the methods, we are going to use the spatial extents of the Appalachian Mountains and the Continental US. | ||
|
||
```{r query_sb_spatial} | ||
appalachia <- data.frame( | ||
lat = c(34.576900, 36.114974, 37.374456, 35.919619, 39.206481), | ||
long = c(-84.771119, -83.393990, -81.256731, -81.492395, -78.417345)) | ||
conus <- data.frame( | ||
lat = c(49.078148, 47.575022, 32.914614, 25.000481), | ||
long = c(-124.722111, -67.996898, -118.270335, -80.125804)) | ||
# verifying where points are supposed to be | ||
maps::map('usa') | ||
points(conus$long, conus$lat, col="red", pch=20) | ||
points(appalachia$long, appalachia$lat, col="green", pch=20) | ||
``` | ||
|
||
The first way to query spatially is by specifying a bounding box `bbox` as an `sp` spatial data object. Visit the [`sp` package documentation](https://cran.r-project.org/web/packages/sp/vignettes/intro_sp.pdf) for more information on spatial data objects. | ||
|
||
```{r query_sb_spatial-bbox} | ||
# query by bounding box | ||
query_sb_spatial(bbox= | ||
sp::SpatialPoints(appalachia, proj4string = | ||
sp::CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"))) | ||
``` | ||
|
||
Alternatively, you can supply a vector of latitudes and a vector of longitudes using `lat` and `long` arguments. The function will automatically use the minimum and maximum from those vectors to construct a boundary box. | ||
|
||
```{r query_sb_spatial-latlong} | ||
# query by latitude and longitude vectors | ||
query_sb_spatial(long = appalachia$long, lat = appalachia$lat) | ||
query_sb_spatial(long = conus$long, lat = conus$lat) | ||
``` | ||
|
||
The last way to represent a spatial region to query ScienceBase is using a POLYGON Well-known text (WKT) object as a text string. The format is `"POLYGON(([LONG1 LAT1], [LONG2 LAT2], [LONG3 LAT3]))"`, where `LONG#` and `LAT#` are longitude and latitude pairs as decimals. See the [Open Geospatial Consortium WKT standard](http://www.opengeospatial.org/standards/wkt-crs) for more information. | ||
|
||
```{r query_sb_spatial-wkt} | ||
# query by WKT polygon | ||
wkt_coord_str <- paste(conus$long, conus$lat, sep=" ", collapse = ",") | ||
wkt_str <- sprintf("POLYGON((%s))", wkt_coord_str) | ||
query_sb_spatial(bb_wkt = wkt_str) | ||
``` | ||
|
||
### Using `query_sb_date` | ||
|
||
`query_sb_date` returns ScienceBase items that fall within a certain time range. There are multiple timestamps applied to items, so you will need to specify which one to match the range. The default queries are to look for items last updated between 1970-01-01 and today's date. See `?query_sb_date` for more examples of which timestamps are available. | ||
|
||
```{r query_sb_date} | ||
# find data worked on in the last week | ||
today <- Sys.Date() | ||
oneweekago <- today - as.difftime(7, units='days') # days * hrs/day * secs/hr | ||
recent_data <- query_sb_date(start = today, end = oneweekago) | ||
sapply(recent_data, function(item) item$title) | ||
# find data that's been created over the last year | ||
oneyearago <- today - as.difftime(365, units='days') # days * hrs/day * secs/hr | ||
recent_data <- query_sb_date(start = today, end = oneyearago, date_type = "dateCreated") | ||
sapply(recent_data, function(item) item$title) | ||
``` | ||
|
||
### Using `query_sb_datatype` | ||
|
||
`query_sb_datatype` is used to search ScienceBase by the type of data an item is listed as. Run `sb_datatypes()` to get a list of 50 available data types. | ||
|
||
```{r query_sb_datatype} | ||
# get ScienceBase news items | ||
sbnews <- query_sb_datatype("News") | ||
sapply(sbnews, function(item) item$title) | ||
# find shapefiles | ||
shps <- query_sb_datatype("Shapefile") | ||
sapply(shps, function(item) item$title) | ||
# find raster data | ||
sbraster <- query_sb_datatype("Raster") | ||
sapply(sbraster, function(item) item$title) | ||
``` | ||
|
||
## Best of both methods | ||
|
||
Although you can query from R, sometimes it's useful to look at an item on the web interface. You can use the `query_sb_*` functions and then follow that URL to view items on the web. This is especially handy for viewing maps and metadata, or to check or repair a ScienceBase item if any of the `sbtools`-based commands have failed. | ||
|
||
```{r query_sb-website} | ||
sbmaps <- query_sb_datatype("Static Map Image", limit=3) | ||
oneitem <- sbmaps[[1]] | ||
# get and open URL from single sbitem | ||
url_oneitem <- oneitem[['link']][['url']] | ||
browseURL(url_oneitem) | ||
# get and open URLs from many sbitems | ||
lapply(sbmaps, function(sbitem) { | ||
url <- sbitem[['link']][['url']] | ||
browseURL(url) | ||
return(url) | ||
}) | ||
``` | ||
|
||
### Using `query_sb` | ||
|
||
`query_sb` is the "catch-all" function for querying ScienceBase from R. It only takes one argument for specifying query parameters, `query_list`. This is an R list with specific query parameters as the list names and the user query string as the list values. See the `Description` section of the help file for all options (`?query_sb`). | ||
|
||
```{r query_sb-keywords} | ||
# search by keyword | ||
precip_data <- query_sb(query_list = list(q = 'precipitation')) | ||
length(precip_data) # 20 entries, but there is likely more than 20 results | ||
sapply(precip_data, function(item) item$title) | ||
# search by keyword, sort by last updated, and increase num results allowed | ||
precip_data_recent <- query_sb(query_list = list(q = 'precipitation', | ||
sort = 'lastUpdated', | ||
limit = 50)) | ||
length(precip_data_recent) # 50 entries, but the search criteria is the same, just sorted | ||
sapply(precip_data_recent, function(item) item$title) | ||
# search by keyword + type | ||
# Used sb_datatype() to figure out what types were allowed for "browseType" | ||
precip_maps_data <- query_sb(query_list = list(q = 'precipitation', browseType = "Static Map Image", sort='title')) | ||
sapply(precip_maps_data, function(item) item$title) | ||
``` | ||
|
||
If you want to search by more than one keyword, you should use Lucene query syntax. Visit [this page](http://www.lucenetutorial.com/lucene-query-syntax.html) for information on Lucene queries. For instance, you can have results returned that include both "flood" and "earthquake", or either "flood" or "earthquake". Current functionality requires a regular query to be specified in order for `lq` to return results. So, just include `q = ''` when executing Lucene queries (this is a known [issue](https://github.com/USGS-R/sbtools/issues/236) in `sbtools`). | ||
|
||
```{r query_sb-lucene} | ||
# search by 2 keywords (AND) | ||
hazard2and_data <- query_sb(query_list = list(q = '', lq = 'flood AND earthquake'), | ||
limit=200) | ||
length(hazard2and_data) | ||
# search by 2 keywords (OR) | ||
hazard2or_data <- query_sb(query_list = list(q = '', lq = 'flood OR earthquake'), | ||
limit=200) | ||
length(hazard2or_data) | ||
# search by 3 keywords with grouped query | ||
hazard3_data <- query_sb(query_list = | ||
list(q = '', lq = '(flood OR earthquake) AND drought'), | ||
limit=200) | ||
length(hazard3_data) | ||
``` | ||
|
||
## No results | ||
|
||
Some of your queries will probably return no results. When there are no results that match your query, the returned list will have a length of 0. | ||
|
||
```{r query_sb-empty} | ||
# search for items related to a Water Quality Portal paper DOI | ||
wqp_paper <- query_sb_doi(doi = '10.1002/2016WR019993') | ||
length(wqp_paper) | ||
head(wqp_paper) | ||
# spatial query in the middle of the Atlantic Ocean | ||
atlantic_ocean <- query_sb_spatial(long=28.790431, lat=-41.436485) | ||
length(atlantic_ocean) | ||
head(atlantic_ocean) | ||
# date query during Marco Polo's life | ||
marco_polo <- query_sb_date(start = as.Date("1254-09-15"), | ||
end = as.Date("1324-01-08")) | ||
length(marco_polo) | ||
head(marco_polo) | ||
``` |
Oops, something went wrong.