Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace placeholders with extended variant concept in DataCatalog #965

Open
DirkEilander opened this issue May 29, 2024 · 3 comments
Open
Labels
Enhancement New feature or request V1

Comments

@DirkEilander
Copy link
Contributor

Kind of request

Changing existing functionality

Enhancement Description

  • Placeholders allow to define multiple sources in a short data catalog item. Each item gets its own unique name but all parameters, except for the file paths, have to be the same.
  • Variants allow for sources with different versions or providers. Other variants, e.g. different data resolution, model (e.g. CMIP), etc.

Use case

We should discuss if we want to merge these concept to have a simpler interface for users

Additional Context

No response

@DirkEilander DirkEilander added Enhancement New feature or request Needs refinement issue still needs refinement labels May 29, 2024
@DirkEilander DirkEilander added this to the v1.0 beta milestone May 29, 2024
@Jaapel
Copy link
Contributor

Jaapel commented Jun 6, 2024

Suggested yaml:

my_nice_source:
  data_type: DataFrame  # not editable by variants
  ...
  variant_keys:
    - metadata.provider
    - metadata.crs
    - driver.filesystem
  variants:
    - uri: s3://bucket/key1/key2.json  # required for all variants.
      metadata:
        crs: 4326
        provider: organisation1
      driver:
        filesystem: s3
    - uri: /mnt/p/cooldata.json
      metadata:
        crs: 90002
        provider: organization2
      driver:
        filesystem: local
        default_variant: True

Where variant_keys are keys that uniquely define the variant, which should be present in each variant definition. Other fields like uri can overwrite the source definition. dots in variant_keys define nested fields. If no variant is requested a the default variant is used, which is flagged by the default_variant key. All variants should be of the same datatype, hence this field cannot be overwritten, but all other fields can be overwritten.

@DirkEilander
Copy link
Contributor Author

DirkEilander commented Jun 7, 2024

Also discussed: DataCatalog._sources should become a dictionary of lists with all variants (instead of a nested dict currently) where we find the requested variant based on filtering. To request a specific variant a dictionary with source name and variant keys and associated values is given to the data_like argument in DataCatalog.get_rasterdataset (and similar) methods, see below. If now unique variant is found an error is raised.

da = data_catalog.get_rasterdataset(
    data_like = {"source": "my_nice_source", "metadata.crs": 4326},
    ...
)

In addition to the yaml format above which specifies variant_keys that are already existing keys of the the data source, it should also be possible to define new keys. This can already be added to metadata in the current setup, but we could also create a specific variant field in DataSource. I suggest that keys specified in the variant field don't need a section prefix to keep requesting data as above short.

my_nice_source:
  variant_keys:
    - name
  variants:
    - uri: s3://bucket/key1/key2.json
      variant:
        name: key2
    - uri: /mnt/p/cooldata.json
      variant:
        name: cooldata

@DirkEilander DirkEilander changed the title discuss possible merge of placeholder and variant concepts in DataCatalog replace placeholders with extended variant concept in DataCatalog Jun 7, 2024
@DirkEilander DirkEilander removed the Needs refinement issue still needs refinement label Jun 7, 2024
@DirkEilander
Copy link
Contributor Author

@hboisgon We would like to also get your feedback on this issue. With this new variant concept I think we have a single (before we had variant, alias and placeholder), but flexible way to define multiple variants of the same source. For the cmip6 model archive it would require a longer catalog yaml file, but with more flexibility to accommodate small differences between files in terms of format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request V1
Projects
None yet
Development

No branches or pull requests

4 participants