Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to upload STAC collections and STAC items to a STAC API #16

Closed
15 tasks done
JohanKJSchreurs opened this issue Feb 12, 2024 · 1 comment
Closed
15 tasks done
Assignees

Comments

@JohanKJSchreurs
Copy link
Collaborator

JohanKJSchreurs commented Feb 12, 2024

Add ability to upload the created STAC collections and STAC items to a STAC API

We will use this for the HRL VPP data in this ticket
Open-EO/openeo-geopyspark-driver#460

Breakdown of requirements

Input side

  • Tool needs to read products from OpenSearch / terracatalogueclient.
    • Tool can read these products in a piecemeal fashion because the API limits the amount of products in one query.
  • Tool needs to create a STAC collection based on the metadata of the OpenSearch collection.
  • Tool needs to map the products to STAC Items for the collection.
    • In order to create the STAC items it also needs to group the bands (products) that belong in one STAC Item as assets in that STAC item.

Some Technical Points

Because stac-catalog-builder is intended to be reasonably generic, its structure must allow to work with two types of inputs:

  • A directory of asset files (currently only GeoTIFF supported).
  • The OpenSearch input via terracatalogueclient.

And similarly, on the output side it should support creating both:

  • Static STAC collections/catalogs, as the output for a set of GeoTIFFs.
  • See section below for details: Uploading to the STAC API, for HR VPP.

Output side, Uploading to the STAC API, for HR VPP:

  • Tool can authenticate via OIDC to connect to the STAC-API.
  • Tool can read the settings for authentication from a file or environment variables to protect sensitive info.
  • Tool can POST and PUT a STAC collection to the STAC API.
    • Tool can add the necessary info to the collection that allows STAC-API to authorize who can read/write the collection.
  • Tool can POST and PUT a STAC items to the STAC API:
    • Implemented but not tested yet.
  • Should have, to keep the upload fast enough: tool can also upload STAC items in bulk via the corresponding STAC API extension.
    • Though technically it would work without this, but it would probably be slow.
@JohanKJSchreurs JohanKJSchreurs self-assigned this Feb 13, 2024
JohanKJSchreurs added a commit that referenced this issue Feb 20, 2024
…nd save intermediate vpp metadata to geoparquet
JohanKJSchreurs added a commit that referenced this issue Feb 21, 2024
…w collections: saving 10k to 100k STAC items without linking to collection
JohanKJSchreurs added a commit that referenced this issue Feb 22, 2024
We retrieve products from VPP in a piecemeal way, because there is a limit on the amount of products you can get in one query.
The client doesn't have full support for pagination, so we adapt our query to limit the number of products per query.
This divides the total termporal extent of the collection into small enough time slots.
But some products have a long enoug period (start/end datetime) that they span several of the time slots we request.
That lead to the producst being received multiple times.

Solution:
We now check if we already have the product and skip it when it is a duplicate.
JohanKJSchreurs added a commit that referenced this issue Feb 27, 2024
…nd save intermediate vpp metadata to geoparquet
JohanKJSchreurs added a commit that referenced this issue Feb 27, 2024
…w collections: saving 10k to 100k STAC items without linking to collection
JohanKJSchreurs added a commit that referenced this issue Feb 27, 2024
We retrieve products from VPP in a piecemeal way, because there is a limit on the amount of products you can get in one query.
The client doesn't have full support for pagination, so we adapt our query to limit the number of products per query.
This divides the total termporal extent of the collection into small enough time slots.
But some products have a long enoug period (start/end datetime) that they span several of the time slots we request.
That lead to the producst being received multiple times.

Solution:
We now check if we already have the product and skip it when it is a duplicate.
JohanKJSchreurs added a commit that referenced this issue Feb 28, 2024
JohanKJSchreurs added a commit that referenced this issue Mar 5, 2024
JohanKJSchreurs added a commit that referenced this issue Mar 5, 2024
JohanKJSchreurs added a commit that referenced this issue Mar 6, 2024
JohanKJSchreurs added a commit that referenced this issue Mar 6, 2024
@JohanKJSchreurs
Copy link
Collaborator Author

Implemented by the following two PRs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant