Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cordex extension and trying to build on higher level abstractions #64

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

huard
Copy link
Collaborator

@huard huard commented Oct 11, 2024

I added a CMIP6-CORDEX extension and implementation, trying to create base classes that would simplify the addition of other extensions.

This simplifies a bit the implementation part, but you'll see that there is still some boilerplate code we could do without on the implementation side.

The main change is that I created a generic THREDDSCatalogDataModel. Extensions then only have to define the data model for their properties, and how to construct a unique ID. If a jsonschema is provided, then it will be used to validate the incoming data. I've disabled the validation done at the STAC extension level for now (see below).

I've struggled a bit with the role of the jsonschema here. In climate science, this is not a very popular tool. Even if scientific schemas appeared, we'd have to embed them into a STAC specific schema. You'll see that I've created a schema directory with the CORDEX schema for global attributes, but this is not a STAC schema per say. A STAC schema would embed those attributes into a property object, accompanied by a type. I didn't know how to embed a schema into another, that's why I disabled the extension schema validation.

To try it:

stac-populator run Ouranos_CMIP6-CORDEX http://localhost:8880/stac https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/disk2/ouranos/CORDEX/CMIP6/DD/NAM-12/OURANOS/MPI-ESM1-2-LR/ssp370/r1i1p1f1/CRCM5/v1-r1/day/tas/v20231208/catalog.html

If you think this kind of abstraction is useful, I could port those changes to the CMIP6 case in another PR.

Copy link
Collaborator

@fmigneault fmigneault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the "embedded schema", I think an extension could be defined as such (using YAML for short, but convert to JSON for applying it):

type: object
required:
  - type
  - properties
properties:
  type: 
    const: Feature
  properties:
    $ref: "STACpopulator/extensions/schemas/cordex6/cmip6-cordex-global-attrs-schema.json"

Then, you can reuse the JSON schema on its own or as STAC extension definition.

As for the PR itself, I have a strong sensation that THREDDSCatalogDataModel is essentially trying to accomplish what the "helpers" were trying to do (but missing some interface to connect the dots).

It is a bit hard to analyze the code path with all the abstractions involved. So, if I misinterpreted something in my comments, please let me know.

other todos

  • Need to add Ouranos_CMIP6-CORDEX to the table in the README.
  • Update changelog

.gitignore Show resolved Hide resolved
STACpopulator/extensions/base.py Outdated Show resolved Hide resolved
Comment on lines +78 to +80
uri = cls._schema_uri.default
if uri is not None:
schema = json.load(open(uri))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be improved with requests file-handler, allowing either local or remote URI, but not "blocking" for the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure how to deal with references within a schema if it was not local.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible with the jsonschema library that we're using to have the library resolve remote references with requests. See the example here (which uses httpx not requests but the idea is the same).

I agree that this can be for another PR though

STACpopulator/extensions/base.py Outdated Show resolved Hide resolved
Comment on lines 68 to 69
# List of properties not meant to be validated by json schema.
_schema_exclude: list[str] = PrivateAttr([])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't the model_config be used for that?

class Model(DataModel):
    model_config = ConfigDict(
      populate_by_name=True,
      extra="ignore",
      fields={"field-to-exclude": {"exclude":True}, 
    )

Otherwise, reuse the same PrivateAttr approach, and filter by annotation/field-type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because those were fields I wanted to exclude from the schema validation, but not from the model dump. I was thinking of a case where the schema is strictly prohibiting extra attributes, but I realize this might be a very edgy corner case.

STACpopulator/populator_base.py Outdated Show resolved Hide resolved
Comment on lines 31 to 33
def create_stac_item(self, item_name: str, item_data: dict[str, Any]) -> dict[str, Any]:
dm = self.data_model.from_data(item_data)
return dm.stac_item()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this could be directly in STACpopulatorBase since it only refers to data_model overridden by the class. Especially if extensions are generalized, this might become redundant across implementations.

However, I'm noticing here that we are still limited by a single extension. If I want to define a dataset that uses datacube and Cordex6DataModel properties, I have to create yet another populator and define the create_stac_item with by custom set of operations.

What we might need instead a list of helper-exntenions that apply onto the given data.
The pattern is very consistent.

For example, CMIP6populator and CORDEX_STAC_Populator could have:

class CMIP6populator(STACpopulatorBase):
    item_helpers = [CMIP6Helper, DatacubeHelper, THREDDSHelper]

class CORDEX_STAC_Populator(STACpopulatorBase):
    item_helpers = [Cordex6Helper]

And then, we would have:

class STACpopulatorBase:
    def create_stac_item(self, item_name: str, item_data: dict[str, Any]) -> dict[str, Any]:
        item = pystac.Item(...)
        for helper in self.item_helpers:
            helper = SomeHelper(item_data)
            item = helper.apply(item)
        return item

Where each helper has something along the lines of:

def apply(item: pystac.Item) -> pystac.Item:
    dc_ext = DatacubeExtension.ext(item, add_if_missing=True)
    dc_ext.apply(dimensions=dc_helper.dimensions, variables=dc_helper.variables)
    return dc_ext.item

# or 

def apply(item: pystac.Item) -> pystac.Item:
    valid_data = Cordex6DataModel(self.item_data)
    valid_json = json.loads(valid_data.model_dump_json(by_alias=True))
    item.properties.update(valid_json)
    return item

Using this "helper" approach, you wouldn't need to define all the boiler-plate code for a typical "stac extension classes". What apply() does is up to the helper.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, will look into it and come back with questions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that that THREDDSCatalogDataModel automatically applies the datacube and thredds extension.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue I see with this is that the extension helpers have different __init__ requirements. So either the helpers know how to parse the input data, or the object instantiating them provides that logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have an in-between solution.
The item_helpers list could define instances rather than type references:

item_helpers = [
  HelperWithoutArg(),
  THREDDSHelper(["<url>"]),
]

Anything that can be supplied at init would be created right away, and the STAC item objects would be obtained during the apply(item) call.

I don't think there are any cases where the helpers would be missing references limiting this approach, but to investigate...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow. You need the data to create instances of the helpers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we could do is something like this:

    @classmethod
    def from_data(cls, data):
        """Instantiate class from data provided by THREDDS Loader.
        """
        # This is where we match the Loader's output to the STAC item and extensions inputs. If we had multiple
        # loaders, that's probably the only thing that would be different between them.
        return cls(data=data,
                   start_datetime=data["groups"]["CFMetadata"]["attributes"]["time_coverage_start"],
                   end_datetime=data["groups"]["CFMetadata"]["attributes"]["time_coverage_end"],
                   geometry=ncattrs_to_geometry(data),
                   bbox=ncattrs_to_bbox(data),
                   properties=data["attributes"],
                   )

    @model_validator(mode="before")
    @classmethod
    def datacube_helper(cls, data):
        """Validate the DataCubeHelper."""
        data["datacube"] = DataCubeHelper(data['data'])
        return data

    @model_validator(mode="before")
    @classmethod
    def thredds_helper(cls, data):
        """Validate the DataCubeHelper."""
        data["thredds"] = THREDDSHelper(data['data']["access_urls"])
        return data

Comment on lines 28 to 29
data_model = Cordex6DataModel
item_geometry_model = None # Unnecessary, but kept for consistency
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined for the CMIP6populator:

class CMIP6populator(STACpopulatorBase):
    item_properties_model = CMIP6Properties
    item_geometry_model = GeoJSONPolygon

And data_model = Cordex6DataModel basically offers:

Cordex6DataModel.properties == CordexCmip6  #  -> just like CMIP6Properties
Cordex6DataModel -> THREDDSCatalogDataModel.geometry  # -> just like item_geometry_model

I'm wondering if there's any duplication of the intended use of these properties?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because I didn't want to break the CMIP extension and implementation just yet. My idea was to try to generalize the CORDEX example, get a sense of where this is going, and once we're happy, then bring the changes to CMIP6.

STACpopulator/extensions/cordex6.py Outdated Show resolved Hide resolved
Comment on lines 13 to 14
# This is generated using datamodel-codegen + manual edits
class CordexCmip6(DataModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the model is generated from the schema, why is the @model_validator needed to load and validate the JSON schema?

I'm not seeing the subtlety from static code analysis.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slimmed down version of the schema without the actual CV validation. The schema includes enums with the CVs, while the pydantic.DataModel does not.

This is a question I struggled with. I felt it didn't make a lot of sense to duplicate the jsonschema validation in pydantic. On the other hand, relying only on the schema and not even seeing the attributes in the code felt obscure and not admin friendly. So I thought it would be useful to have a pydantic DataModel layer where you can add attributes to the data model, and exclude some that are in the schema but you don't want in the STAC item.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if that is an issue with datamodel-codegen, or an option to provide?
Normally, the enums should be possible using Literal type with pydantic.

I think it makes sense to have the DataModel auto-generated from schema to provide the attributes. It's easier to manipulate by users used to Python but not so much JSON schema.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's definitely possible. I just didn't include the Literals in the python code.

Makefile Outdated Show resolved Hide resolved
Comment on lines +78 to +80
uri = cls._schema_uri.default
if uri is not None:
schema = json.load(open(uri))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible with the jsonschema library that we're using to have the library resolve remote references with requests. See the example here (which uses httpx not requests but the idea is the same).

I agree that this can be for another PR though

STACpopulator/extensions/base.py Outdated Show resolved Hide resolved
start_datetime: datetime
end_datetime: datetime

extensions: list = []
Copy link
Contributor

@mishaschwartz mishaschwartz Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these can't be a list of the extensions themselves? Why have this as a list of strings referring to class attributes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'm not sure then how we would pass the relevant attributes to each helper. I'll try to see if I can do something about this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to let the helpers add them automatically, as they might change over time (e.g.: new extension version with modified attributes).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'll try to explain a bit better (or maybe I'm not understanding the issue).

Right now we do something like this:

class BaseSTAC:
    ...
    def stac_item(self) -> "pystac.Item":
        ...
        for ext in self.extensions:
            getattr(self, ext).apply(item)

class THREDDSCatalogDataModel(BaseSTAC):
    ...
    properties: ExtensionHelper
    datacube: DataCubeHelper
    thredds: THREDDSHelper
    ...
    extensions: list = ["properties", "datacube", "thredds"]

Why can't we do this?

class BaseSTAC:
    ...
    def stac_item(self) -> "pystac.Item":
        ...
        for ext in self.extensions:
            ext.apply(item)

class THREDDSCatalogDataModel(BaseSTAC):
    ...
    extensions: list = [ExtensionHelper, DataCubeHelper, THREDDSHelper]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of data ingestion.

For example, the THREDDS extension needs to be passed THREDDSHelper(data['data']["access_urls"])

I currently manage this with model_validator:

    @model_validator(mode="before")
    @classmethod
    def thredds_helper(cls, data):
        """Instantiate the THREDDSHelper."""
        data["thredds"] = THREDDSHelper(data['data']["access_urls"])
        return data

I'm not sure how we'd do that with your proposal without adding some obscure magic.
What I've now done is automatically detect extensions from the annotation (if it's a Helper subclass). Hope that's ok for now.

Copy link
Collaborator

@fmigneault fmigneault Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually misinterpreting the question thinking of extensions as the STAC.Item.extensions (ie: the URI to the applied extensions). I think we should rename the attribute, because that is very confusing. It should be helpers to highlight the use of the helpers that have extended capabilities for applying the STAC extensions (and sometimes non-extension attributes).

Ideally, we should have something like:

helpers: list[Type[Helper]] = [ExtensionHelper, DataCubeHelper, THREDDSHelper]

Because only classes of the helpers are used (not instances), they should be able to receive the item data dynamically for the apply() call.

If the data source is needed, the Helper base class could have it as a required argument for apply() or in its __init__(), whichever makes more sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed _extensions to _helpers.

I don't see how I could implement your proposal, without hard-coding the data ingestion logic into the helpers themselves, which would couple them tightly with the Loader, which I thought we should avoid.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick.

I think the BaseSTAC class should define something like:

class BaseSTAC(abc.ABC):
   @classmethod
   @abc.abstractmethod
   def helpers(cls) -> list[Type[Helper]]:
       raise NotImplementedError

This way, any derived class and smart IDE flags right away that helpers must be overridden.

I think helpers should be used instead of _helpers because it is part of the "public" interface of that class, for anyone that derives a new implementation from it.

STACpopulator/extensions/datacube.py Show resolved Hide resolved
STACpopulator/extensions/thredds.py Show resolved Hide resolved
tests/test_cordex.py Outdated Show resolved Hide resolved
@huard
Copy link
Collaborator Author

huard commented Oct 17, 2024

I've added a mechanism to embed the schema into a STAC item schema, that I save in /tmp for now. This was just for testing. We can disable that feature for now. I'll be on travel starting tomorrow, so won't be able to work on this much.

tests/data/cordex6_ncml.json Outdated Show resolved Hide resolved
tests/data/cordex6_raw.json Outdated Show resolved Hide resolved
@mishaschwartz mishaschwartz mentioned this pull request Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants