Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding marble specific fields to item properties #30

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dchandan
Copy link
Collaborator

@dchandan dchandan commented Nov 9, 2023

This PR adds two properties to a STAC item, and in doing so helps describe aspects of the data or the STAC catalog that are useful on the Marble network: mrbl:is_replica and mrbl:host_node.

mrbl:host_node, specifies the name of the Marble node on which the data is stored. Valid values for this (validity is not checked in this PR) are the values of the keys in the Marble node registry.

mrbl:is_replica is set to false in the STAC entry for an item when both the item and the catalog are on mrbl:host_node, and is true in any replicated STAC entries on any other node in the network (the purpose is very similar to the ESGF replica attribute, except on Marble the data is not replicated, only the catalog).

The purpose of this PR is to get a sense of what others think about this, and to solicit useful feedback. "Implementation" itself is not critical for this PR, ideally, down the line, we would write a "Marble STAC extension" to add these properties and/or other future Marble related properties. Right now, we can just do it in a hack-ish way.

Copy link
Collaborator

@fmigneault fmigneault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to merge #28 first.
On follow PRs, tests will have to pass.
I add to patch some tests. Expected features/support were still not working.

@@ -6,7 +6,7 @@ CATALOG = https://daccs.cs.toronto.edu/twitcher/ows/proxy/thredds/catalog/datase
# CATALOG = https://daccs.cs.toronto.edu/twitcher/ows/proxy/thredds/catalog/datasets/CMIP6/CMIP/AS-RCEC/catalog.html

testcmip6:
python $(IMP_DIR)/CMIP6_UofT/add_CMIP6.py $(STAC_HOST) $(CATALOG)
python $(IMP_DIR)/CMIP6_UofT/add_CMIP6.py $(STAC_HOST) $(CATALOG) PAVICS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this command is becoming too cluttered with optional arguments.
Can you add a --host-node option or similar?
I think it should use the URL of the STAC_HOST by default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can streamline this.

item_data["attributes"]["host_node"] = self.host_node

item = STAC_item_from_metadata(iid, item_data, self.item_properties_model, self.item_geometry_model)
# try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove debugging code

@@ -173,6 +187,7 @@ def create_stac_item(self, item_name: str, item_data: MutableMapping[str, Any])
parser = argparse.ArgumentParser(prog="CMIP6 STAC populator")
parser.add_argument("stac_host", type=str, help="STAC API address")
parser.add_argument("thredds_catalog_URL", type=str, help="URL to the CMIP6 THREDDS catalog")
parser.add_argument("host_node", type=str, help="Name of the Marble node on which the data resides")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @fmigneault that this could be a flagged option. If we don't make this optional then we can never use this implementation to populate a stac instance that is not part of the Marble network.

However, I think that we should add some validation checks here as well if --host_node is specified:

  • check that there exists a node with the same name as "host_node" in the registry
  • check that the registry contains both a thredds and stac service for that node
  • check that the "stac_host" argument matches the url of the stac service in the registry
  • check that the "thredds_catalog_host" argument matches the url of the thredds service in the registry

If any of these conditions are false, we should raise an error with an appropriate message.

Comment on lines +83 to +84
is_replica: bool = Field(..., serialization_alias="mrbl:is_replica")
host_node: str = Field(..., serialization_alias="mrbl:host_node")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These attributes would apply to every dataset hosted on the node, so I suggest they should be defined inside a Parent class that every collection inherits from.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could also avoid the mrbl shortcut, and simply have marble explicitly.
It is not really that much a long name to justify shortening it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should use marble.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These attributes would apply to every dataset hosted on the node, so I suggest they should be defined inside a Parent class that every collection inherits from.

I understood your intention and I think the proposal can be useful. But I didn't understand (or can't quite visualize) your proposal for how to do that. Could you please elaborate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class inherits from STACItemProperties, which would be common to all implementations, correct ? So I'd add those two attributes in that class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I see.

Comment on lines +156 to +157
# with the data residing on that node. Replica STAC items on other nodes in the Marble network will be
# populated using a different app, and there the is_replica flag will be True.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is meant exactly by replica here. In the ESGF world, this has to do with who mirrors from whom. Will MARBLE include replication of datasets?

Also, more generally, if you are going to curate copies of ESGF files that you aggregate yourself, how are you going to determine whether one of the files making the aggregation has been updated, and this update needs to be propagated to your own version?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you say that replica items would be populated by another app. Why would that be? The replica is going to be a netCDF file sitting on a THREDDS server.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean the replica of a STAC item (just the catalog entry, not the file) ? In that case, I think we need to work out how STAC catalogs are going to be shared from one node to another.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is meant exactly by replica here. In the ESGF world, this has to do with who mirrors from whom. Will MARBLE include replication of datasets?

Not, datasets will not be replicated. We don't have enough space on the network. What I meant is that we have plans to ensure that STAC entries from one node are visible in the STAC browser of another node. As an example, at some point the Ouranos STAC catalog will also have a "CMIP6-UofT" collection whose entries be (near) duplicates of the entries on UofT's STAC catalog for the "CMIP6-UofT" (all the properties will be same and crucially, access links will be same), but marble:host_node property will be UofT and marble:is_replica property will be true. Similarly, UofT will have replica entries for catalogs on Ouranos. This way, a user can query any single node on the network to get info on data across the network.

Also, more generally, if you are going to curate copies of ESGF files that you aggregate yourself, how are you going to determine whether one of the files making the aggregation has been updated, and this update needs to be propagated to your own version?

It is my understanding that when files are updated on ESGF, they are marked with a new version number, existing files are never overridden. We should follow the same approach. The aggregations will be for all files in a single version. If, for example, one of the files in say an EC-Earth dataset (which come in 165 individual files) are updated, then it will be up to the manager of the node to decide if he/she wants to create a new aggregation using the updated file. My approach for UofT node would be to not touch do that unless the changes are substantial. Let me know if I didn't understand your question correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean the replica of a STAC item (just the catalog entry, not the file) ? In that case, I think we need to work out how STAC catalogs are going to be shared from one node to another.

Yes just STAC items. We will be doing this by completing the implementation of STACLoader. This will crawl a STAC catalog, get the JSON data, modify it as needed (for example to change marble:is_replica) and post it to another STAC catalog.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider for mirrors or "replica": https://github.com/stac-extensions/alternate-assets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants