Cordex extension and trying to build on higher level abstractions #64

huard · 2024-10-11T18:22:53Z

I added a CMIP6-CORDEX extension and implementation, trying to create base classes that would simplify the addition of other extensions.

This simplifies a bit the implementation part, but you'll see that there is still some boilerplate code we could do without on the implementation side.

The main change is that I created a generic THREDDSCatalogDataModel. Extensions then only have to define the data model for their properties, and how to construct a unique ID. If a jsonschema is provided, then it will be used to validate the incoming data. I've disabled the validation done at the STAC extension level for now (see below).

I've struggled a bit with the role of the jsonschema here. In climate science, this is not a very popular tool. Even if scientific schemas appeared, we'd have to embed them into a STAC specific schema. You'll see that I've created a schema directory with the CORDEX schema for global attributes, but this is not a STAC schema per say. A STAC schema would embed those attributes into a property object, accompanied by a type. I didn't know how to embed a schema into another, that's why I disabled the extension schema validation.

To try it:

stac-populator run Ouranos_CMIP6-CORDEX http://localhost:8880/stac https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/disk2/ouranos/CORDEX/CMIP6/DD/NAM-12/OURANOS/MPI-ESM1-2-LR/ssp370/r1i1p1f1/CRCM5/v1-r1/day/tas/v20231208/catalog.html

If you think this kind of abstraction is useful, I could port those changes to the CMIP6 case in another PR.

fmigneault

About the "embedded schema", I think an extension could be defined as such (using YAML for short, but convert to JSON for applying it):

type: object
required:
  - type
  - properties
properties:
  type: 
    const: Feature
  properties:
    $ref: "STACpopulator/extensions/schemas/cordex6/cmip6-cordex-global-attrs-schema.json"

Then, you can reuse the JSON schema on its own or as STAC extension definition.

As for the PR itself, I have a strong sensation that THREDDSCatalogDataModel is essentially trying to accomplish what the "helpers" were trying to do (but missing some interface to connect the dots).

It is a bit hard to analyze the code path with all the abstractions involved. So, if I misinterpreted something in my comments, please let me know.

other todos

Need to add Ouranos_CMIP6-CORDEX to the table in the README.
Update changelog

.gitignore

STACpopulator/extensions/base.py

fmigneault · 2024-10-11T20:52:27Z

STACpopulator/extensions/base.py

+        uri =  cls._schema_uri.default
+        if uri is not None:
+            schema = json.load(open(uri))


Could be improved with requests file-handler, allowing either local or remote URI, but not "blocking" for the PR.

I was unsure how to deal with references within a schema if it was not local.

It's possible with the jsonschema library that we're using to have the library resolve remote references with requests. See the example here (which uses httpx not requests but the idea is the same).

I agree that this can be for another PR though

STACpopulator/extensions/base.py

fmigneault · 2024-10-11T21:02:29Z

STACpopulator/extensions/base.py

+    # List of properties not meant to be validated by json schema.
+    _schema_exclude: list[str] = PrivateAttr([])


Can't the model_config be used for that?

class Model(DataModel): model_config = ConfigDict( populate_by_name=True, extra="ignore", fields={"field-to-exclude": {"exclude":True}, )

Otherwise, reuse the same PrivateAttr approach, and filter by annotation/field-type?

I don't think so, because those were fields I wanted to exclude from the schema validation, but not from the model dump. I was thinking of a case where the schema is strictly prohibiting extra attributes, but I realize this might be a very edgy corner case.

STACpopulator/populator_base.py

fmigneault · 2024-10-11T21:46:10Z

STACpopulator/implementations/Ouranos_CMIP6-CORDEX/add_CORDEX6.py

+    def create_stac_item(self, item_name: str, item_data: dict[str, Any]) -> dict[str, Any]:
+        dm = self.data_model.from_data(item_data)
+        return dm.stac_item()


Looks like this could be directly in STACpopulatorBase since it only refers to data_model overridden by the class. Especially if extensions are generalized, this might become redundant across implementations.

However, I'm noticing here that we are still limited by a single extension. If I want to define a dataset that uses datacube and Cordex6DataModel properties, I have to create yet another populator and define the create_stac_item with by custom set of operations.

What we might need instead a list of helper-exntenions that apply onto the given data.
The pattern is very consistent.

For example, CMIP6populator and CORDEX_STAC_Populator could have:

class CMIP6populator(STACpopulatorBase): item_helpers = [CMIP6Helper, DatacubeHelper, THREDDSHelper] class CORDEX_STAC_Populator(STACpopulatorBase): item_helpers = [Cordex6Helper]

And then, we would have:

class STACpopulatorBase: def create_stac_item(self, item_name: str, item_data: dict[str, Any]) -> dict[str, Any]: item = pystac.Item(...) for helper in self.item_helpers: helper = SomeHelper(item_data) item = helper.apply(item) return item

Where each helper has something along the lines of:

def apply(item: pystac.Item) -> pystac.Item: dc_ext = DatacubeExtension.ext(item, add_if_missing=True) dc_ext.apply(dimensions=dc_helper.dimensions, variables=dc_helper.variables) return dc_ext.item # or def apply(item: pystac.Item) -> pystac.Item: valid_data = Cordex6DataModel(self.item_data) valid_json = json.loads(valid_data.model_dump_json(by_alias=True)) item.properties.update(valid_json) return item

Using this "helper" approach, you wouldn't need to define all the boiler-plate code for a typical "stac extension classes". What apply() does is up to the helper.

I like this idea, will look into it and come back with questions.

Note that that THREDDSCatalogDataModel automatically applies the datacube and thredds extension.

One issue I see with this is that the extension helpers have different __init__ requirements. So either the helpers know how to parse the input data, or the object instantiating them provides that logic.

We could have an in-between solution.
The item_helpers list could define instances rather than type references:

item_helpers = [ HelperWithoutArg(), THREDDSHelper(["<url>"]), ]

Anything that can be supplied at init would be created right away, and the STAC item objects would be obtained during the apply(item) call.

I don't think there are any cases where the helpers would be missing references limiting this approach, but to investigate...

Not sure I follow. You need the data to create instances of the helpers.

What we could do is something like this:

@classmethod def from_data(cls, data): """Instantiate class from data provided by THREDDS Loader. """ # This is where we match the Loader's output to the STAC item and extensions inputs. If we had multiple # loaders, that's probably the only thing that would be different between them. return cls(data=data, start_datetime=data["groups"]["CFMetadata"]["attributes"]["time_coverage_start"], end_datetime=data["groups"]["CFMetadata"]["attributes"]["time_coverage_end"], geometry=ncattrs_to_geometry(data), bbox=ncattrs_to_bbox(data), properties=data["attributes"], ) @model_validator(mode="before") @classmethod def datacube_helper(cls, data): """Validate the DataCubeHelper.""" data["datacube"] = DataCubeHelper(data['data']) return data @model_validator(mode="before") @classmethod def thredds_helper(cls, data): """Validate the DataCubeHelper.""" data["thredds"] = THREDDSHelper(data['data']["access_urls"]) return data

fmigneault · 2024-10-11T22:01:09Z

STACpopulator/implementations/Ouranos_CMIP6-CORDEX/add_CORDEX6.py

+    data_model = Cordex6DataModel
+    item_geometry_model = None  # Unnecessary, but kept for consistency


This is defined for the CMIP6populator:

class CMIP6populator(STACpopulatorBase): item_properties_model = CMIP6Properties item_geometry_model = GeoJSONPolygon

And data_model = Cordex6DataModel basically offers:

Cordex6DataModel.properties == CordexCmip6 # -> just like CMIP6Properties Cordex6DataModel -> THREDDSCatalogDataModel.geometry # -> just like item_geometry_model

I'm wondering if there's any duplication of the intended use of these properties?

Yes, because I didn't want to break the CMIP extension and implementation just yet. My idea was to try to generalize the CORDEX example, get a sense of where this is going, and once we're happy, then bring the changes to CMIP6.

STACpopulator/extensions/cordex6.py

fmigneault · 2024-10-11T22:04:41Z

STACpopulator/extensions/cordex6.py

+# This is generated using datamodel-codegen + manual edits
+class CordexCmip6(DataModel):


Since the model is generated from the schema, why is the @model_validator needed to load and validate the JSON schema?

I'm not seeing the subtlety from static code analysis.

It's slimmed down version of the schema without the actual CV validation. The schema includes enums with the CVs, while the pydantic.DataModel does not.

This is a question I struggled with. I felt it didn't make a lot of sense to duplicate the jsonschema validation in pydantic. On the other hand, relying only on the schema and not even seeing the attributes in the code felt obscure and not admin friendly. So I thought it would be useful to have a pydantic DataModel layer where you can add attributes to the data model, and exclude some that are in the schema but you don't want in the STAC item.

I wonder if that is an issue with datamodel-codegen, or an option to provide?
Normally, the enums should be possible using Literal type with pydantic.

I think it makes sense to have the DataModel auto-generated from schema to provide the attributes. It's easier to manipulate by users used to Python but not so much JSON schema.

Yes, it's definitely possible. I just didn't include the Literals in the python code.

…rom the _prefix attribute

Makefile

mishaschwartz · 2024-10-17T12:38:07Z

STACpopulator/extensions/base.py

+        uri =  cls._schema_uri.default
+        if uri is not None:
+            schema = json.load(open(uri))


It's possible with the jsonschema library that we're using to have the library resolve remote references with requests. See the example here (which uses httpx not requests but the idea is the same).

I agree that this can be for another PR though

STACpopulator/extensions/base.py

mishaschwartz · 2024-10-17T14:04:19Z

STACpopulator/extensions/base.py

+    start_datetime: datetime
+    end_datetime: datetime
+
+    extensions: list = []


Is there a reason these can't be a list of the extensions themselves? Why have this as a list of strings referring to class attributes?

Well, I'm not sure then how we would pass the relevant attributes to each helper. I'll try to see if I can do something about this.

I think it is better to let the helpers add them automatically, as they might change over time (e.g.: new extension version with modified attributes).

Sorry, I'll try to explain a bit better (or maybe I'm not understanding the issue).

Right now we do something like this:

class BaseSTAC: ... def stac_item(self) -> "pystac.Item": ... for ext in self.extensions: getattr(self, ext).apply(item) class THREDDSCatalogDataModel(BaseSTAC): ... properties: ExtensionHelper datacube: DataCubeHelper thredds: THREDDSHelper ... extensions: list = ["properties", "datacube", "thredds"]

Why can't we do this?

class BaseSTAC: ... def stac_item(self) -> "pystac.Item": ... for ext in self.extensions: ext.apply(item) class THREDDSCatalogDataModel(BaseSTAC): ... extensions: list = [ExtensionHelper, DataCubeHelper, THREDDSHelper]

Because of data ingestion.

For example, the THREDDS extension needs to be passed THREDDSHelper(data['data']["access_urls"])

I currently manage this with model_validator:

@model_validator(mode="before") @classmethod def thredds_helper(cls, data): """Instantiate the THREDDSHelper.""" data["thredds"] = THREDDSHelper(data['data']["access_urls"]) return data

I'm not sure how we'd do that with your proposal without adding some obscure magic.
What I've now done is automatically detect extensions from the annotation (if it's a Helper subclass). Hope that's ok for now.

I was actually misinterpreting the question thinking of extensions as the STAC.Item.extensions (ie: the URI to the applied extensions). I think we should rename the attribute, because that is very confusing. It should be helpers to highlight the use of the helpers that have extended capabilities for applying the STAC extensions (and sometimes non-extension attributes).

Ideally, we should have something like:

helpers: list[Type[Helper]] = [ExtensionHelper, DataCubeHelper, THREDDSHelper]

Because only classes of the helpers are used (not instances), they should be able to receive the item data dynamically for the apply() call.

If the data source is needed, the Helper base class could have it as a required argument for apply() or in its __init__(), whichever makes more sense.

Renamed _extensions to _helpers.

I don't see how I could implement your proposal, without hard-coding the data ingestion logic into the helpers themselves, which would couple them tightly with the Loader, which I thought we should avoid.

Minor nitpick.

I think the BaseSTAC class should define something like:

class BaseSTAC(abc.ABC): @classmethod @abc.abstractmethod def helpers(cls) -> list[Type[Helper]]: raise NotImplementedError

This way, any derived class and smart IDE flags right away that helpers must be overridden.

I think helpers should be used instead of _helpers because it is part of the "public" interface of that class, for anyone that derives a new implementation from it.

STACpopulator/extensions/datacube.py

STACpopulator/implementations/Ouranos_CMIP6-CORDEX/__init__.py

STACpopulator/extensions/thredds.py

STACpopulator/implementations/Ouranos_CMIP6-CORDEX/add_CORDEX6.py

tests/test_cordex.py

huard · 2024-10-17T19:52:42Z

I've added a mechanism to embed the schema into a STAC item schema, that I save in /tmp for now. This was just for testing. We can disable that feature for now. I'll be on travel starting tomorrow, so won't be able to work on this much.

tests/data/cordex6_ncml.json

tests/data/cordex6_raw.json

huard added 9 commits October 3, 2024 11:36

typos

3c85233

work on extension abstraction and cordex example

f1c5683

removed json from .gitignore. Add CORDEX6 json schema

c0bdeb2

merge

3a4b6ef

embedding datacube and thredds extension in the base logic

b7d9b75

got it to work

98c6420

cordex implementation

33d78b9

add missing item_geometry_model

0212008

get the cordex extension to work with the stac-populator cli.

04d94df

huard requested a review from fmigneault as a code owner October 11, 2024 18:22

huard added 3 commits October 11, 2024 14:34

add some notes and comments

ac0cec2

clean-up

03e5738

remove break

8cceeab

fmigneault reviewed Oct 11, 2024

View reviewed changes

huard added 9 commits October 15, 2024 11:09

suggestions from review

8b583bb

added apply method to extension helpers

8db0847

include schemas in installation source

df35431

Put generic STAC item logic into BaseSTAC class

c6e68e8

docstring

d76fdd8

added xscen extension. automated the creation of an alias generator f…

adb1065

…rom the _prefix attribute

added license_type to xscen schema. Test over ncml

fd49161

update README and CHANGES.

08ccd42

Merge branch 'xscen_extension' into cordex_extension

2fb0eb6

mishaschwartz reviewed Oct 17, 2024

View reviewed changes

huard added 5 commits October 17, 2024 13:45

suggestions from Mischa

312694c

need to pass properties dict

a847c3f

suggestions from review

72a8271

upgrade numpy type conversion function

4817d66

add extensions automatically

2b43022

fmigneault reviewed Oct 18, 2024

View reviewed changes

tests/data/cordex6_ncml.json Outdated Show resolved Hide resolved

tests/data/cordex6_raw.json Outdated Show resolved Hide resolved

huard added 2 commits October 18, 2024 09:15

rename _extensions to _helpers

118810c

format test data jsons

d1d838e

mishaschwartz mentioned this pull request Oct 21, 2024

code cleanup #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cordex extension and trying to build on higher level abstractions #64

Cordex extension and trying to build on higher level abstractions #64

huard commented Oct 11, 2024

fmigneault left a comment •

edited by huard

Loading

fmigneault Oct 11, 2024

huard Oct 15, 2024

mishaschwartz Oct 17, 2024

fmigneault Oct 11, 2024

huard Oct 15, 2024

fmigneault Oct 11, 2024

huard Oct 15, 2024

huard Oct 15, 2024

huard Oct 15, 2024

fmigneault Oct 15, 2024

huard Oct 15, 2024

huard Oct 15, 2024

fmigneault Oct 11, 2024

huard Oct 15, 2024

fmigneault Oct 11, 2024

huard Oct 15, 2024

fmigneault Oct 15, 2024

huard Oct 15, 2024

mishaschwartz Oct 17, 2024

mishaschwartz Oct 17, 2024 •

edited

Loading

huard Oct 17, 2024

fmigneault Oct 17, 2024

mishaschwartz Oct 17, 2024

huard Oct 17, 2024

fmigneault Oct 18, 2024 •

edited

Loading

huard Oct 18, 2024

fmigneault Oct 18, 2024

huard commented Oct 17, 2024

		# List of properties not meant to be validated by json schema.
		_schema_exclude: list[str] = PrivateAttr([])

		data_model = Cordex6DataModel
		item_geometry_model = None # Unnecessary, but kept for consistency

		# This is generated using datamodel-codegen + manual edits
		class CordexCmip6(DataModel):

Cordex extension and trying to build on higher level abstractions #64

Are you sure you want to change the base?

Cordex extension and trying to build on higher level abstractions #64

Conversation

huard commented Oct 11, 2024

fmigneault left a comment • edited by huard Loading

Choose a reason for hiding this comment

other todos

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mishaschwartz Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmigneault Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huard commented Oct 17, 2024

fmigneault left a comment •

edited by huard

Loading

mishaschwartz Oct 17, 2024 •

edited

Loading

fmigneault Oct 18, 2024 •

edited

Loading