Skip to content

Commit

Permalink
Merge pull request #261 from lindsaycarr/online-course
Browse files Browse the repository at this point in the history
sbtools data discovery
  • Loading branch information
aappling-usgs authored Jul 5, 2017
2 parents 34b37bf + 2afd902 commit f9c7f9e
Show file tree
Hide file tree
Showing 6 changed files with 828 additions and 4 deletions.
2 changes: 1 addition & 1 deletion content/usgs-packages/geoknife_Intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ set.seed(1)

This lesson will explore how to find and download large gridded datasets via the R package `geoknife`. The package was created to allow easy access to data stored in the [Geo Data Portal (GDP)](https://cida.usgs.gov/gdp/), or any gridded dataset available through the [OPeNDAP](https://www.opendap.org/) protocol DAP2. `geoknife` refers to the gridded dataset as the `fabric`, the spatial feature of interest as the `stencil`, and the subset algorithm parameters as the `knife` (see below).

![geoknife terminology figure](../static/img/geoknife_summary.png "figure illustrating definitions of fabric, stencil, and knife")
![geoknife terminology figure](../static/img/geoknife_summary.png#inline-img "figure illustrating definitions of fabric, stencil, and knife")

## Lesson Objectives

Expand Down
5 changes: 2 additions & 3 deletions content/usgs-packages/geoknife_Intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,19 @@ author: Lindsay R. Carr
date: 9999-10-01
slug: geoknife-intro
title: geoknife - Introduction
draft: true
image: img/main/intro-icons-300px/r-logo.png
identifier:
menu:
main:
parent: Introduction to USGS R Packages
weight: 2
draft: true
---
Lesson Summary
--------------

This lesson will explore how to find and download large gridded datasets via the R package `geoknife`. The package was created to allow easy access to data stored in the [Geo Data Portal (GDP)](https://cida.usgs.gov/gdp/), or any gridded dataset available through the [OPeNDAP](https://www.opendap.org/) protocol DAP2. `geoknife` refers to the gridded dataset as the `fabric`, the spatial feature of interest as the `stencil`, and the subset algorithm parameters as the `knife` (see below).

![geoknife terminology figure](../static/img/geoknife_summary.png "figure illustrating definitions of fabric, stencil, and knife")
![geoknife terminology figure](../static/img/geoknife_summary.png#inline-img "figure illustrating definitions of fabric, stencil, and knife")

Lesson Objectives
-----------------
Expand Down
263 changes: 263 additions & 0 deletions content/usgs-packages/sbtools_Discovery.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,263 @@
---
title: "sbtools - Data discovery"
date: "9999-07-01"
author: "Lindsay R. Carr"
slug: "sbtools-discovery"
image: "img/main/intro-icons-300px/r-logo.png"
output: USGSmarkdowntemplates::hugoTraining
parent: Introduction to USGS R Packages
weight: 2
draft: true
---

```{r setup, include=FALSE, warning=FALSE, message=FALSE}
library(knitr)
knit_hooks$set(plot=function(x, options) {
sprintf("<img src='../%s%s-%d.%s'/ title='%s'/>",
options$fig.path, options$label, options$fig.cur, options$fig.ext, options$fig.cap)
})
opts_chunk$set(
echo=TRUE,
fig.path="static/sbtools-discovery/",
fig.width = 6,
fig.height = 6,
fig.cap = "TODO"
)
set.seed(1)
```

Although ScienceBase is a great platform for uploading and storing your data, you can also use it to find other available data. You can do that manually by searching using the ScienceBase web interface or through `sbtools` functions.

## Discovering data via web interface

The most familiar way to search for data would be to use the ScienceBase search capabilities available online. You can search for any publically available data in the [ScienceBase catalog](https://www.sciencebase.gov/catalog/). Search by category (map, data, project, publication, etc), topic-based tags, or location; or search by your own key words.

![ScienceBase Catalog Homepage](../static/img/sb_catalog_search.png#inline-img "search ScienceBase catalog")

Learn more about the [catalog search features](www.sciencebase.gov/about/content/explore-sciencebase#2. Search ScienceBase) and explore the [advanced searching capabilities](www.sciencebase.gov/about/content/sciencebase-advanced-search) on the ScienceBase help pages.

## Discovering data via sbtools

The ScienceBase search tools can be very powerful, but lack the ability to easily recreate the search. If you want to incorporate dataset queries into a reproducible workflow, you can script them using the `sbtools` query functions. The terminology differs from the web interface slightly. Below are functions available to query the catalog:

1. `query_sb_text` (matches title or description)
2. `query_sb_doi` (use a DOI identifier)
3. `query_sb_spatial` (data within or at a specific location)
4. `query_sb_date` (items within time range)
5. `query_sb_datatype` (type of data, not necessarily file type)
6. `query_sb` (generic SB query)

These functions take a variety of inputs, and all return an R list of `sbitems` (a special `sbtools` class). All of these functions default to 20 returned search results, but you can change that by specifying the argument `limit`. The `query_sb` is a generalization of the other functions, and has a number of additional query specifications: [Lucene query string](http://www.lucenetutorial.com/lucene-query-syntax.html), folder and parent items, item ids, or project status. Before we practice using these functions, make sure you load the `sbtools` package in your current R session.

```{r load, message=FALSE, warning=FALSE}
library(sbtools)
```

### Using `query_sb_text`

`query_sb_text` returns a list of `sbitems` that match the title or description fields. Use it to search authors, station names, rivers, states, etc.

```{r query_sb_text}
# search using a contributors name
contrib_results <- query_sb_text("Robert Hirsch")
head(contrib_results, 2)
# search using place of interest
park_results <- query_sb_text("Yellowstone")
head(park_results, 2)
# search using a river
river_results <- query_sb_text("Rio Grande")
length(river_results)
head(river_results, 2)
```

It might be easier to look at the results returned from queries by just looking at their titles. The other information stored in an sbitem is useful, but a little distracting when you are looking at many results. You can use `sapply` to extract the titles.

```{r query_sb-sapply}
# look at all titles returned from the site location query previously made
sapply(river_results, function(item) item$title)
```

Now you can use `sapply` to look at the titles for your returned searches instead of `head`.

### Using `query_sb_doi`

Use a Digital Object Identifier (DOI) to query ScienceBase. This should return only one list item, unless there is more than one ScienceBase item referencing this very unique identifier.

```{r query_sb_doi}
# USGS Microplastics study
query_sb_doi('10.5066/F7ZC80ZP')
# Environmental Characteristics data
query_sb_doi('10.5066/F77W699S')
```

### Using `query_sb_spatial`

`query_sb_spatial` accepts 3 different methods for specifying a spatial area in which to look for data. To illustrate the methods, we are going to use the spatial extents of the Appalachian Mountains and the Continental US.

```{r query_sb_spatial}
appalachia <- data.frame(
lat = c(34.576900, 36.114974, 37.374456, 35.919619, 39.206481),
long = c(-84.771119, -83.393990, -81.256731, -81.492395, -78.417345))
conus <- data.frame(
lat = c(49.078148, 47.575022, 32.914614, 25.000481),
long = c(-124.722111, -67.996898, -118.270335, -80.125804))
# verifying where points are supposed to be
maps::map('usa')
points(conus$long, conus$lat, col="red", pch=20)
points(appalachia$long, appalachia$lat, col="green", pch=20)
```

The first way to query spatially is by specifying a bounding box `bbox` as an `sp` spatial data object. Visit the [`sp` package documentation](https://cran.r-project.org/web/packages/sp/vignettes/intro_sp.pdf) for more information on spatial data objects.

```{r query_sb_spatial-bbox}
# query by bounding box
query_sb_spatial(bbox=
sp::SpatialPoints(appalachia, proj4string =
sp::CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")))
```

Alternatively, you can supply a vector of latitudes and a vector of longitudes using `lat` and `long` arguments. The function will automatically use the minimum and maximum from those vectors to construct a boundary box.

```{r query_sb_spatial-latlong}
# query by latitude and longitude vectors
query_sb_spatial(long = appalachia$long, lat = appalachia$lat)
query_sb_spatial(long = conus$long, lat = conus$lat)
```

The last way to represent a spatial region to query ScienceBase is using a POLYGON Well-known text (WKT) object as a text string. The format is `"POLYGON(([LONG1 LAT1], [LONG2 LAT2], [LONG3 LAT3]))"`, where `LONG#` and `LAT#` are longitude and latitude pairs as decimals. See the [Open Geospatial Consortium WKT standard](http://www.opengeospatial.org/standards/wkt-crs) for more information.

```{r query_sb_spatial-wkt}
# query by WKT polygon
wkt_coord_str <- paste(conus$long, conus$lat, sep=" ", collapse = ",")
wkt_str <- sprintf("POLYGON((%s))", wkt_coord_str)
query_sb_spatial(bb_wkt = wkt_str)
```

### Using `query_sb_date`

`query_sb_date` returns ScienceBase items that fall within a certain time range. There are multiple timestamps applied to items, so you will need to specify which one to match the range. The default queries are to look for items last updated between 1970-01-01 and today's date. See `?query_sb_date` for more examples of which timestamps are available.

```{r query_sb_date}
# find data worked on in the last week
today <- Sys.Date()
oneweekago <- today - as.difftime(7, units='days') # days * hrs/day * secs/hr
recent_data <- query_sb_date(start = today, end = oneweekago)
sapply(recent_data, function(item) item$title)
# find data that's been created over the last year
oneyearago <- today - as.difftime(365, units='days') # days * hrs/day * secs/hr
recent_data <- query_sb_date(start = today, end = oneyearago, date_type = "dateCreated")
sapply(recent_data, function(item) item$title)
```

### Using `query_sb_datatype`

`query_sb_datatype` is used to search ScienceBase by the type of data an item is listed as. Run `sb_datatypes()` to get a list of 50 available data types.

```{r query_sb_datatype}
# get ScienceBase news items
sbnews <- query_sb_datatype("News")
sapply(sbnews, function(item) item$title)
# find shapefiles
shps <- query_sb_datatype("Shapefile")
sapply(shps, function(item) item$title)
# find raster data
sbraster <- query_sb_datatype("Raster")
sapply(sbraster, function(item) item$title)
```

## Best of both methods

Although you can query from R, sometimes it's useful to look at an item on the web interface. You can use the `query_sb_*` functions and then follow that URL to view items on the web. This is especially handy for viewing maps and metadata, or to check or repair a ScienceBase item if any of the `sbtools`-based commands have failed.

```{r query_sb-website}
sbmaps <- query_sb_datatype("Static Map Image", limit=3)
oneitem <- sbmaps[[1]]
# get and open URL from single sbitem
url_oneitem <- oneitem[['link']][['url']]
browseURL(url_oneitem)
# get and open URLs from many sbitems
lapply(sbmaps, function(sbitem) {
url <- sbitem[['link']][['url']]
browseURL(url)
return(url)
})
```

### Using `query_sb`

`query_sb` is the "catch-all" function for querying ScienceBase from R. It only takes one argument for specifying query parameters, `query_list`. This is an R list with specific query parameters as the list names and the user query string as the list values. See the `Description` section of the help file for all options (`?query_sb`).

```{r query_sb-keywords}
# search by keyword
precip_data <- query_sb(query_list = list(q = 'precipitation'))
length(precip_data) # 20 entries, but there is likely more than 20 results
sapply(precip_data, function(item) item$title)
# search by keyword, sort by last updated, and increase num results allowed
precip_data_recent <- query_sb(query_list = list(q = 'precipitation',
sort = 'lastUpdated',
limit = 50))
length(precip_data_recent) # 50 entries, but the search criteria is the same, just sorted
sapply(precip_data_recent, function(item) item$title)
# search by keyword + type
# Used sb_datatype() to figure out what types were allowed for "browseType"
precip_maps_data <- query_sb(query_list = list(q = 'precipitation', browseType = "Static Map Image", sort='title'))
sapply(precip_maps_data, function(item) item$title)
```

If you want to search by more than one keyword, you should use Lucene query syntax. Visit [this page](http://www.lucenetutorial.com/lucene-query-syntax.html) for information on Lucene queries. For instance, you can have results returned that include both "flood" and "earthquake", or either "flood" or "earthquake". Current functionality requires a regular query to be specified in order for `lq` to return results. So, just include `q = ''` when executing Lucene queries (this is a known [issue](https://github.com/USGS-R/sbtools/issues/236) in `sbtools`).

```{r query_sb-lucene}
# search by 2 keywords (AND)
hazard2and_data <- query_sb(query_list = list(q = '', lq = 'flood AND earthquake'),
limit=200)
length(hazard2and_data)
# search by 2 keywords (OR)
hazard2or_data <- query_sb(query_list = list(q = '', lq = 'flood OR earthquake'),
limit=200)
length(hazard2or_data)
# search by 3 keywords with grouped query
hazard3_data <- query_sb(query_list =
list(q = '', lq = '(flood OR earthquake) AND drought'),
limit=200)
length(hazard3_data)
```

## No results

Some of your queries will probably return no results. When there are no results that match your query, the returned list will have a length of 0.

```{r query_sb-empty}
# search for items related to a Water Quality Portal paper DOI
wqp_paper <- query_sb_doi(doi = '10.1002/2016WR019993')
length(wqp_paper)
head(wqp_paper)
# spatial query in the middle of the Atlantic Ocean
atlantic_ocean <- query_sb_spatial(long=28.790431, lat=-41.436485)
length(atlantic_ocean)
head(atlantic_ocean)
# date query during Marco Polo's life
marco_polo <- query_sb_date(start = as.Date("1254-09-15"),
end = as.Date("1324-01-08"))
length(marco_polo)
head(marco_polo)
```
Loading

0 comments on commit f9c7f9e

Please sign in to comment.