Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guidance on how to best use STAC and Zarr together to the Cloud-Optimized Geospatial Formats Guide #134

Open
maxrjones opened this issue Feb 10, 2025 · 1 comment
Assignees

Comments

@maxrjones
Copy link
Collaborator

Duplicate of NASA-IMPACT/veda-odd#81 at @wildintellect's request

@gadomski's talk at the Pangeo showcase prompted a great question and subsequent discussion about how best to use Zarr with STAC. Questions about using Zarr with STAC have come up many times. Therefore, I think that guidance on this topic would really help the migration to cloud-native file formats. I think it would be great to use notes from today's discussion as the starting point for a PR to the could optimized geospatial formats guide. The PR would add a framework for deciding when/how to use Zarr with STAC. We could share the PR with the broader STAC, Pangeo, and CNG communities for comments, feedback, and additions.

@jsignell
Copy link
Collaborator

Ok here is what I have thought through so far:

STAC (SpatioTemporal Asset Catalogs) is a specification for defining and searching any type of data that has spatial and temporal dimensions. STAC has seen significant adoption in the earth observation community. Zarr is a specification for storing groups of cloud-optimized arrays. Zarr has been adopted by the earth modeling community (led by Pangeo). Both STAC and Zarr offer a flexible nested structure with arbitrary metadata at a variety of levels -- for STAC: catalog, collection, item, asset, for Zarry: group, array, (and maybe in the future chunk with icechunk/kerchunk??). This flexibility has contributed to their popularity, but can also makes it hard to tell what they are designed to be particularly good at.

Comparison table

STAC Zarr
for data with spatial and temporal dimensions for groups of arrays
supports arbitrarty metadata for catalogs, collections, items, assets supports arbitrary metadata for groups, arrays
good at search and discovery good at filtering within a dataset
searching returns items filtering returns chunks
stores metadata separately from data stores metadata alongside data (except when virtualized)

Note: This mostly has to do with the STAC spec and the STAC API spec, but STAC also has the extensions which include an onthology of domain-specific metadata as well as a mechanism for adding new extensions. This is similar to CF conventions on STAC but

Some additional thoughts:

  • STAC makes it easy to separate the concerns of who can search and find data from who can read data. This is useful when you want to allow access only via defined interfaces (like titiler or tipg).
  • In order to read a subset of a file using STAC you always need to touch that file - for instance you need to read the header of a COG file. This is different from a virtulaized zarr dataset where you can access a specific chunk.
  • Zarr is good for when data tends to be alligned.
    • it is an antipattern to put all your data into one zarr store (Tom Nicholas)
    • put things in one icechunk store that you expect to update together (icechunk docs)
  • STAC is good for level 1 and 2 data (Pete Gadomski)

Interface

For people familiar with xarray they are likely to want to get to xarray as quickly as possible and don't mind doing more filtering once they are there.

For people more familiar with STAC they are likely to want to use STAC tooling to do searching and so it might make sense to push STAC filters down into zarr (maybe using a tool like xpublish?).

So there should be good tooling to go back and forth between STAC and zarr.

Here is what exists so far:

Accessing Zarr with STAC-like patterns

Think of this as an nd data cube of regularly gridded model output

For data producers:

  • If you have an xarray object you can write out STAC to represent it (xstac)
    • Exposes variables to STAC using datacube extension
    • Includes kwargs that let xarray know how to open the dataset xarray extension
    • Includes ideas for how to store kerchunk in STAC
      • Comment from Tom Augspurger: we want to avoid JSON strings inside the pgstac as much as possible so that we can take advantage of pgstac's dehydration
  • FUTURE: (once STAC spec v1.1.0 is fully incorporated in tooling) Take advantage of inheritance to reduce duplication at the item and asset levels.

For data consumers:

  • If the zarr dataset is already cataloged in STAC following xstac conventions, ideally you can go straight to xarray using xpystac backend
  • If the zarr dataset is not cataloged in STAC I don't think anything currently exists. If we want to pursue this there are two paths: translate STAC requests on the fly directly into xarray, or do a one-time step similar to virtualization to generate an ad-hoc static STAC object.

Acessing STAC with xarray patterns

Think of this as a bunch of COGs at the same place over a period of time.

For data producers:

  • Take advantage of roles to make it clear which assets should be included in a data cube.
  • FUTURE: Potentially provide a virtual reference file to make the intended stacking explicit

For data consumers:

  • If the dataset is array-like, ideally you can go straight to xarray using the xpystac backend or odc-stac or stackstac directly.

Storing results:

This is a newer area of development. The core idea is that instead of repeatedly querying you could store the results for easy access.

  • Store the result of a searching STAC:
    • stac-geoparquet (using stacrs)
    • FUTURE: (maybe if array-like): icechunk or kerchunk (using virtualizarr STAC reader + xpystac)
  • Store the result of filtering Zarr:
    • icechunk or kerchunk (using virtualizarr)

Based off of the following discussions:

Other Comments:

STAC is also an extremely verbose representation of this metadata for regularly gridded datasets. Virtualization with kerchunk or Virtualizarr is a often a much more efficient representation of the individual underlying files and xarray over a virtual dataset provides an excellent “query” experience.

Sean Harkins

Same theme that comes up often, of where the line is between the external metadata that belongs in a catalog versus the internal structure of a dataset. Zarr, for instance, has a hierarchy with metadata at every level - should they be duplicated in a catalog for the sake of search ability? Sometimes...

Martin Durant

@jsignell jsignell self-assigned this Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants