Skip to content

Commit

Permalink
updated sst example
Browse files Browse the repository at this point in the history
  • Loading branch information
cboettig committed Dec 22, 2023
1 parent cab5e6c commit 463871f
Show file tree
Hide file tree
Showing 4 changed files with 14 additions and 18 deletions.
4 changes: 2 additions & 2 deletions _freeze/contents/sst/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"hash": "3791317cfa5a2112149b2a238a966935",
"hash": "48fdcef8873a460364694658be75d91a",
"result": {
"markdown": "---\ntitle: 'EarthData strategies'\n\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nknitr::opts_chunk$set(message=FALSE, warning=FALSE)\n```\n:::\n\n\n\n\nIn this example, we have over 13K distinct points where turtles have been sampled over many years, and we wish to extract the sea surface temperature at each coordinate point.\nThis task is somewhat different than ones we have considered previously, because instead of extracting data from a continuous cube in space & time, we have only discrete points in space and time we wish to access. Downloading files representing entirety of space or continuous ranges in time is thus particularly inefficient. Here we will try and pluck out only the sample points we need.\n\n\nThis design is a nice chance to illustrate workflows that depend directly on URLs, without leveraging the additional metadata that comes from STAC. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(earthdatalogin)\nlibrary(rstac)\nlibrary(tidyverse)\nlibrary(stars)\nlibrary(tmap)\nlibrary(gdalcubes)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# handle earthdata login\nedl_netrc()\n```\n:::\n\n\nWe begin by reading in our turtle data from a spreadsheet and converting the latitude/longitude columns to a spatial vector (spatial points features) object using `sf`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nturtles <- \n read_csv(\"https://raw.githubusercontent.com/nmfs-opensci/NOAAHackDays-2024/main/r-tutorials/data/occ_all.csv\",\n show_col_types = FALSE) |> \n st_as_sf(coords = c(\"decimalLongitude\", \"decimalLatitude\"))\n\nst_crs(turtles) <- 4326 # lon-lat coordinate system\n\n\ndates <- turtles |> distinct(date) |> pull(date)\n```\n:::\n\n\nWe have 13222 data points, including 1695 unique dates that stretch from 2004-12-14 to 2019-10-17.\n\nLet's take a quick look at the distribution of the data (coloring by date range):\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Quick plot of the turtle data\npal <- tmap::tm_scale_ordinal(5)\ntm_basemap(\"CartoDB.DarkMatter\") + \n tm_shape(turtles) + tm_dots(\"date\", fill.scale = pal)\n```\n\n::: {.cell-output-display}\n![](sst_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\n\nSearching for NASA sea-surface-temperature data is somewhat tricky, since the search interfaces take only date ranges, not specific dates. The SST data is organized as one netcdf file with global extent for each day, so we'll have to request all URLs in a date range and then filter those down to the URLs matching the dates we need. \n\nNASA's STAC search is much slower and more prone to server errors than most STAC engines. NASA provides its own search API, which is substantially faster and more reliable, though limited to 2000 results per call.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlast_date <- as.character(max(turtles$date))\nresp <- edl_search(short_name = \"MUR-JPL-L4-GLOB-v4.1\",\n temporal = c(\"2015-01-01\", last_date))\nurls <- edl_extract_urls(resp)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Subset urls to only those dates that are found in turtles data please\nurl_dates <- as.Date(gsub(\".*(\\\\d{8})\\\\d{6}.*\", \"\\\\1\", urls), format=\"%Y%m%d\")\nurls <- urls[ url_dates %in% dates ]\n\n# If we didnt' search the full turtle history:\nurl_dates <- as.Date(gsub(\".*(\\\\d{8})\\\\d{6}.*\", \"\\\\1\", urls), format=\"%Y%m%d\")\nmini_turtle <- turtles |> filter(date %in% url_dates)\n```\n:::\n\n\nOkay, we have 924 urls now from which to extract temperatures at our observed turtle coordinates.\n\n## A partial approach, via stars:\n\nThis approach works on a subset of URLs, unfortunately stars is not particularly robust at reading in large numbers of URLs\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsome_urls <- urls[1:50]\nsome_dates <- as.Date(gsub(\".*(\\\\d{8})\\\\d{6}.*\", \"\\\\1\", some_urls), format=\"%Y%m%d\")\n# If we test with a subset of urls, we need to test with a subset of turtles too!\ntiny_turtle <- mini_turtle |> filter(date %in% some_dates)\n\nbench::bench_time({ \n sst <- read_stars(paste0(\"/vsicurl/\", some_urls), \"analysed_sst\", \n #along = list(time = some_dates), ## \n quiet=TRUE)\n st_crs(sst) <- 4326 # Christ, someone omitted CRS from metadata\n # before we can extract on dates, we need to populate this date information\n sst <- st_set_dimensions(sst, \"time\", values = some_dates)\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nprocess real \n 6.84s 1.32m \n```\n:::\n\n```{.r .cell-code}\nbench::bench_time({\n turtle_temp <- st_extract(sst, tiny_turtle, time_column = \"date\")\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nprocess real \n 9.71s 2.13m \n```\n:::\n:::\n\n\n\n## gdalcubes -- A more scalable solution\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(gdalcubes)\ngdalcubes_set_gdal_config(\"GDAL_NUM_THREADS\", \"ALL_CPUS\")\ngdalcubes_options(parallel = TRUE)\n```\n:::\n\n\nAccess to NASA's EarthData collection requires an authentication.\nThe `earthdatalogin` package exists only to handle this! \nUnlike `sf`, `terra` etc, the way `gdalcubes` calls `gdal` \ndoes not inherit global environmental variables, so this helper\nfunction sets the configuration.\n \n\n::: {.cell}\n\n```{.r .cell-code}\nearthdatalogin::with_gdalcubes()\n```\n:::\n\n\n\nUnfortunately, NASA's netcdf files lack some typical metadata regarding projection and extent (bounding box) of the data. Some tools are happy to ignore this, just assuming a regular grid, but because GDAL supports explicitly spatial extraction, it wants to know this information. Nor is this information even provided in the STAC entries! Oh well -- here we provide it rather manually using GDAL's \"virtual dataset\" prefix-suffix syntax (e.g. note the `a_srs=OGC:CRS84`), so that GDAL does not complain that the CRS (coordinate reference system) is missing. Additional metadata such as the timestamp for each image is always included in a STAC entry and so can be automatically extracted by `stac_image_collection`. (`stars` is more forgiving about letting us tack this information on later.)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvrt <- function(url) {\n prefix <- \"vrt://NETCDF:/vsicurl/\"\n suffix <- \":analysed_sst?a_srs=OGC:CRS84&a_ullr=-180,90,180,-90\"\n paste0(prefix, url, suffix)\n}\n```\n:::\n\n\n\nNow we're good to go. We create the VRT versions of the URLs to define the cube. We can then extract sst data at the point coordinates given by turtle object.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbench::bench_time({\n \ncube <- gdalcubes::stack_cube(vrt(urls), datetime_values = url_dates)\nsst_df <- cube |> extract_geom(mini_turtle, time_column = \"date\")\n\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nprocess real \n 6.73s 2.86m \n```\n:::\n:::\n\n\nThe resulting `data.frame` has the NASA value for SST matching the time and space noted noted in the data. The NetCDF appears to encodes temperatures to two decimal points of accuracy by using integers with a scale factor of 100 (integers are more compact to store than floating points), so we have to convert these. There are also what looks like some spurious negative values that may signal missing data. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# re-attach the spatial information\nturtle_sst <- \n mini_turtle |> \n tibble::rowid_to_column(\"FID\") |>\n inner_join(sst_df, by=\"FID\") |> \n # NA fill and convert to celsius\n mutate(x1 = replace_na(x1, -32768),\n x1 = case_when(x1 < -300 ~ NA, .default = x1),\n nasa_sst = (x1 + 27315) * 0.001)\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npal <- tmap::tm_scale_continuous(5, values=\"hcl.blue_red\")\ntm_basemap(\"CartoDB.DarkMatter\") + \n tm_shape(turtle_sst) + tm_dots(\"nasa_sst\", fill.scale = pal)\n```\n\n::: {.cell-output-display}\n![](sst_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::",
"markdown": "---\ntitle: 'EarthData strategies'\n\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nknitr::opts_chunk$set(message=FALSE, warning=FALSE)\n```\n:::\n\n\n\n\nIn this example, we have over 13K distinct points where turtles have been sampled over many years, and we wish to extract the sea surface temperature at each coordinate point.\nThis task is somewhat different than ones we have considered previously, because instead of extracting data from a continuous cube in space & time, we have only discrete points in space and time we wish to access. Downloading files representing entirety of space or continuous ranges in time is thus particularly inefficient. Here we will try and pluck out only the sample points we need.\n\n\nThis design is a nice chance to illustrate workflows that depend directly on URLs, without leveraging the additional metadata that comes from STAC. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(earthdatalogin)\nlibrary(rstac)\nlibrary(tidyverse)\nlibrary(stars)\nlibrary(tmap)\nlibrary(gdalcubes)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# handle earthdata login\nedl_netrc()\n```\n:::\n\n\nWe begin by reading in our turtle data from a spreadsheet and converting the latitude/longitude columns to a spatial vector (spatial points features) object using `sf`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nturtles <- \n read_csv(\"https://raw.githubusercontent.com/nmfs-opensci/NOAAHackDays-2024/main/r-tutorials/data/occ_all.csv\",\n show_col_types = FALSE) |> \n st_as_sf(coords = c(\"decimalLongitude\", \"decimalLatitude\"))\n\nst_crs(turtles) <- 4326 # lon-lat coordinate system\n\n\ndates <- turtles |> distinct(date) |> pull(date)\n```\n:::\n\n\nWe have 13222 data points, including 1695 unique dates that stretch from 2004-12-14 to 2019-10-17.\n\nLet's take a quick look at the distribution of the data (coloring by date range):\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Quick plot of the turtle data\npal <- tmap::tm_scale_ordinal(5)\ntm_basemap(\"CartoDB.DarkMatter\") + \n tm_shape(turtles) + tm_dots(\"date\", fill.scale = pal)\n```\n\n::: {.cell-output-display}\n![](sst_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\n\nSearching for NASA sea-surface-temperature data is somewhat tricky, since the search interfaces take only date ranges, not specific dates. The SST data is organized as one netcdf file with global extent for each day, so we'll have to request all URLs in a date range and then filter those down to the URLs matching the dates we need. \n\nNASA's STAC search is unfortunately much slower and more prone to server errors than most STAC engines at present. NASA provides its own search API, which is substantially faster and more reliable, but does not provide metadata in a widely recognized standard. Here we will get 1000s of URLs, covering every day in this nearly 15 year span.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nurls <- edl_search(short_name = \"MUR-JPL-L4-GLOB-v4.1\",\n temporal = c(min(turtles$date), max(turtles$date)))\n```\n:::\n\n\nNow we subset URLs to only those dates that are found in turtles data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nurl_dates <- as.Date(gsub(\".*(\\\\d{8})\\\\d{6}.*\", \"\\\\1\", urls), format=\"%Y%m%d\")\nurls <- urls[ url_dates %in% dates ]\n```\n:::\n\n\nOkay, we have 1695 URLs now from which to extract temperatures at our observed turtle coordinates.\n\n## A partial approach, via stars:\n\nThis approach works on a subset of URLs, unfortunately stars is not particularly robust at reading in large numbers of URLs\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsome_urls <- urls[1:50]\nsome_dates <- as.Date(gsub(\".*(\\\\d{8})\\\\d{6}.*\", \"\\\\1\", some_urls), format=\"%Y%m%d\")\n# If we test with a subset of urls, we need to test with a subset of turtles too!\ntiny_turtle <- turtles |> filter(date %in% some_dates)\n\nbench::bench_time({ \n sst <- read_stars(paste0(\"/vsicurl/\", some_urls), \"analysed_sst\", \n #along = list(time = some_dates), ## \n quiet=TRUE)\n st_crs(sst) <- 4326 # Christ, someone omitted CRS from metadata\n # before we can extract on dates, we need to populate this date information\n sst <- st_set_dimensions(sst, \"time\", values = some_dates)\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nprocess real \n 5.58s 1.16m \n```\n:::\n\n```{.r .cell-code}\nbench::bench_time({\n turtle_temp <- st_extract(sst, tiny_turtle, time_column = \"date\")\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nprocess real \n 15.73s 4.58m \n```\n:::\n:::\n\n\n\n## gdalcubes -- A more scalable solution\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(gdalcubes)\ngdalcubes_set_gdal_config(\"GDAL_NUM_THREADS\", \"ALL_CPUS\")\ngdalcubes_options(parallel = TRUE)\n```\n:::\n\n\nAccess to NASA's EarthData collection requires an authentication.\nThe `earthdatalogin` package exists only to handle this! \nUnlike `sf`, `terra` etc, the way `gdalcubes` calls `gdal` \ndoes not inherit global environmental variables, so this helper\nfunction sets the configuration.\n \n\n::: {.cell}\n\n```{.r .cell-code}\nearthdatalogin::with_gdalcubes()\n```\n:::\n\n\n\nUnfortunately, NASA's netcdf files lack some typical metadata regarding projection and extent (bounding box) of the data. Some tools are happy to ignore this, just assuming a regular grid, but because GDAL supports explicitly spatial extraction, it wants to know this information. Nor is this information even provided in the STAC entries! Oh well -- here we provide it rather manually using GDAL's \"virtual dataset\" prefix-suffix syntax (e.g. note the `a_srs=OGC:CRS84`), so that GDAL does not complain that the CRS (coordinate reference system) is missing. Additional metadata such as the timestamp for each image is always included in a STAC entry and so can be automatically extracted by `stac_image_collection`. (`stars` is more forgiving about letting us tack this information on later.)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvrt <- function(url) {\n prefix <- \"vrt://NETCDF:/vsicurl/\"\n suffix <- \":analysed_sst?a_srs=OGC:CRS84&a_ullr=-180,90,180,-90\"\n paste0(prefix, url, suffix)\n}\n```\n:::\n\n\n\nNow we're good to go. We create the VRT versions of the URLs to define the cube. We can then extract sst data at the point coordinates given by turtle object.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndatetime <- as.Date(gsub(\".*(\\\\d{8})\\\\d{6}.*\", \"\\\\1\", urls), format=\"%Y%m%d\")\ncube <- gdalcubes::stack_cube(vrt(urls), datetime_values = datetime)\n\nbench::bench_time({\n \nsst_df <- cube |> extract_geom(turtles, time_column = \"date\")\n\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nprocess real \n 8.65s 4.51m \n```\n:::\n:::\n\n\nThe resulting `data.frame` has the NASA value for SST matching the time and space noted noted in the data. The NetCDF appears to encodes temperatures to two decimal points of accuracy by using integers with a scale factor of 100 (integers are more compact to store than floating points), so we have to convert these. There are also what looks like some spurious negative values that may signal missing data. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# re-attach the spatial information\nturtle_sst <- \n turtles |> \n tibble::rowid_to_column(\"FID\") |>\n inner_join(sst_df, by=\"FID\") |> \n # NA fill and convert to celsius\n mutate(x1 = replace_na(x1, -32768),\n x1 = case_when(x1 < -300 ~ NA, .default = x1),\n nasa_sst = (x1 + 27315) * 0.001)\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npal <- tmap::tm_scale_continuous(5, values=\"hcl.blue_red\")\ntm_basemap(\"CartoDB.DarkMatter\") + \n tm_shape(turtle_sst) + tm_dots(\"nasa_sst\", fill.scale = pal)\n```\n\n::: {.cell-output-display}\n![](sst_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::",
"supporting": [
"sst_files"
],
Expand Down
Binary file modified _freeze/contents/sst/figure-html/unnamed-chunk-13-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified _freeze/contents/sst/figure-html/unnamed-chunk-14-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
28 changes: 12 additions & 16 deletions contents/sst.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -60,28 +60,21 @@ tm_basemap("CartoDB.DarkMatter") +

Searching for NASA sea-surface-temperature data is somewhat tricky, since the search interfaces take only date ranges, not specific dates. The SST data is organized as one netcdf file with global extent for each day, so we'll have to request all URLs in a date range and then filter those down to the URLs matching the dates we need.

NASA's STAC search is much slower and more prone to server errors than most STAC engines. NASA provides its own search API, which is substantially faster and more reliable, though limited to 2000 results per call.
NASA's STAC search is unfortunately much slower and more prone to server errors than most STAC engines at present. NASA provides its own search API, which is substantially faster and more reliable, but does not provide metadata in a widely recognized standard. Here we will get 1000s of URLs, covering every day in this nearly 15 year span.

```{r}
last_date <- as.character(max(turtles$date))
resp <- edl_search(short_name = "MUR-JPL-L4-GLOB-v4.1",
temporal = c("2015-01-01", last_date))
urls <- edl_extract_urls(resp)
urls <- edl_search(short_name = "MUR-JPL-L4-GLOB-v4.1",
temporal = c(min(turtles$date), max(turtles$date)))
```

Now we subset URLs to only those dates that are found in turtles data.

```{r}
# Subset urls to only those dates that are found in turtles data please
url_dates <- as.Date(gsub(".*(\\d{8})\\d{6}.*", "\\1", urls), format="%Y%m%d")
urls <- urls[ url_dates %in% dates ]
# If we didnt' search the full turtle history:
url_dates <- as.Date(gsub(".*(\\d{8})\\d{6}.*", "\\1", urls), format="%Y%m%d")
mini_turtle <- turtles |> filter(date %in% url_dates)
```

Okay, we have `r length(urls)` urls now from which to extract temperatures at our observed turtle coordinates.
Okay, we have `r length(urls)` URLs now from which to extract temperatures at our observed turtle coordinates.

## A partial approach, via stars:

Expand All @@ -92,7 +85,7 @@ This approach works on a subset of URLs, unfortunately stars is not particularl
some_urls <- urls[1:50]
some_dates <- as.Date(gsub(".*(\\d{8})\\d{6}.*", "\\1", some_urls), format="%Y%m%d")
# If we test with a subset of urls, we need to test with a subset of turtles too!
tiny_turtle <- mini_turtle |> filter(date %in% some_dates)
tiny_turtle <- turtles |> filter(date %in% some_dates)
bench::bench_time({
sst <- read_stars(paste0("/vsicurl/", some_urls), "analysed_sst",
Expand Down Expand Up @@ -143,10 +136,13 @@ Now we're good to go. We create the VRT versions of the URLs to define the cube.


```{r}
datetime <- as.Date(gsub(".*(\\d{8})\\d{6}.*", "\\1", urls), format="%Y%m%d")
cube <- gdalcubes::stack_cube(vrt(urls), datetime_values = datetime)
bench::bench_time({
cube <- gdalcubes::stack_cube(vrt(urls), datetime_values = url_dates)
sst_df <- cube |> extract_geom(mini_turtle, time_column = "date")
sst_df <- cube |> extract_geom(turtles, time_column = "date")
})
```
Expand All @@ -157,7 +153,7 @@ The resulting `data.frame` has the NASA value for SST matching the time and spac
```{r}
# re-attach the spatial information
turtle_sst <-
mini_turtle |>
turtles |>
tibble::rowid_to_column("FID") |>
inner_join(sst_df, by="FID") |>
# NA fill and convert to celsius
Expand Down

0 comments on commit 463871f

Please sign in to comment.