Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Dataset __eq__ and set_manager #512

Merged
merged 6 commits into from
Feb 2, 2024

Conversation

aaraney
Copy link
Member

@aaraney aaraney commented Jan 30, 2024

Reintroduce Dataset's __eq__ method. This was lost during the pydantic refactor. The current behavior is incorrect for several reasons. The most glaring being Dataset's must have the same manager_uuid field values.

dmod.core -- 0.13.0

Additions

  • Dataset.set_manager method. This replaces setting a manager instance as if it were a property.

Changes

  • Reintroduce Dataset's __eq__ method. This was lost during the pydantic refactor. The current behavior is incorrect for several reasons. The most glaring being Dataset's must have the same manager_uuid field values.

dmod.modeldata -- 0.9.4

Changes

  • Use dmod.core.Dataset's set_manager api.

@aaraney aaraney marked this pull request as draft January 30, 2024 18:59
@aaraney
Copy link
Member Author

aaraney commented Jan 30, 2024

Pretty sure I missed one place where a setter should be used now.

Edit: found the place and fixed it.

@aaraney aaraney marked this pull request as ready for review January 30, 2024 19:01
Copy link
Contributor

@robertbartel robertbartel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things I have a few questions about.

and self.category == other.category
and self.created_on == other.created_on
and self.data_domain == other.data_domain
and self.dataset_type == other.dataset_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several properties that are not examined here. For some, like manager_uuid, I think this makes sense. For others, like uuid, my first reaction is that it doesn't makes sense to omit it. And for still others, like derived_from, I'm not sure. (FWIW, I think those that are optional like uuid should be optionally checked, with a None value not disqualifying from equality.)

Any particular reason why some of these weren't added?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing up these questions! I had similar questions myself. I went back in the git history to the first point before the pydantic refactor work and copied and pasted the __eq__ implementation from that point. I thought that would at least be a good starting point.

I am not sure if we should compare uuid's or not? I lean toward yes, but I could hear the argument for no. Our idea of a dataset has slightly changed since pre-pydantic refactor as we now use, more or less, aggregate datasets for everything but forcing data for a simulation. I think that change in thinking was probably why they were not compared in the previous version of the code when the thinking was to determine suitable existing datasets that could be used to satisfy a given simulation request? Is that fair and what are your thoughts?

Copy link
Contributor

@robertbartel robertbartel Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our idea of a dataset has slightly changed since pre-pydantic refactor as we now use, more or less, aggregate datasets for everything but forcing data for a simulation.

I don't think our abstract idea has change; we just evolved another more practical data format.

Here's what I suggest:

  • expires: test both are in the future that neither has an explicit expire time value in the past (i.e., None is allowed)
    • disagreeing about how long data is good for doesn't imply disagreeing on whether your are talking about the same data, as long as everyone agrees the data is still good
  • derived_from: optionally test for equality
    • different explicit origins implies different data, but lack of information about origin does not
  • derivations: ignore
    • doesn't really say anything about "this" data
  • last_updated: optionally test for equality
    • if one side says things have updated more recently than the other, chances are these are not truly talking about the same data
  • uuid: ??? ... for now, optionally test for equality, but this needs its own issue for follow-up
    • I thought more about this, and I'm less sure of what to do
    • does it make sense to have a universally unique identifier if it is not universally unique (i.e., if there are ever two non-equal objects with the same value, even if the value is None)?
    • let's not allow this PR to get stuck here, but we might need to do more regarding this attribute to make the design consistent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay if expires is None on both datasets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I should have clarified, but I think that's fine also. None implies the dataset is not temporary. Certainly if both are None, this fits with equality. And I don't think disagreeing on whether the dataset is temporary should necessarily imply looking at different data (again, as long as the expire time hasn't passed yet).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying!

python/lib/core/dmod/core/dataset.py Show resolved Hide resolved
----------
value: DatasetManager | None
The new value for ::attribute:`manager`.
"""
self._manager = value if isinstance(value, DatasetManager) else None

if value is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do this to make sure manager_uuid also gets properly unset.

Suggested change
if value is not None:
if value is None:
self.manager_uuid = None
else:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, @robertbartel. Just pushed up a fix. I slightly altered the possible behavior, but I think its for the best. LMK if you think otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, so what about the case when you are initializing from a deserialized json message that includes manager_uuid? This boils down to is the invariant that an instance's manager and manager_uuid fields are always either None or some valid value? Or can you have a manager_uuid field without a manager being set? I think the answer if for sure the latter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the previous behavior pre-pydantic was to leave the manager_uuid field if you set the manager with a None value (or any non-covariant DatasetManager type).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed up a change to only modify manager_uuid if a DatasetManager type is passed as the value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, so what about the case when you are initializing from a deserialized json message that includes manager_uuid?

Assuming we keep the manager_uuid attribute (I'll come back to this), we need to support this case. The motive for including the manager_uuid field is that we can't serialize the manager, so instead we serialize a way to identify the right manager.

This boils down to is the invariant that an instance's manager and manager_uuid fields are always either None or some valid value?

Not strictly always. Practically, but not strictly. E.g., not immediately after deserialization.

FWIW the previous behavior pre-pydantic was to leave the manager_uuid field if you set the manager with a None value (or any non-covariant DatasetManager type).

Right, I thought earlier in the discussion for the PR that this was an oversight. The only reason to support setting the manager to None, beyond when we are first initializing the object, would be to unset the manager. And we aren't properly unsetting the manager if the Dataset, when serialized, still reflects said manager's uuid. For now though, leave this as you have it; I'm starting to think it more a consequence of a problem in the design than an oversight.

I don't want us to get further caught up here, so I'll open a separate issue, but I think we should examine whether it is really appropriate/necessary/useful to serialize the manager's uuid with a dataset. Manager instance uuids end up being dynamically generated each time, and the service only permits one manager per DatasetType (i.e., there's only ever one OBJECT_STORE type manager at a time, which gets an arbitrary uuid, so if we know a Dataset is OBJECT_STORE, maybe we don't need to and shouldn't keep a manager uuid separately from a manager reference).

@aaraney
Copy link
Member Author

aaraney commented Feb 2, 2024

@robertbartel, can you give this one quick last glance? I think we are just about ready to merge.

@aaraney
Copy link
Member Author

aaraney commented Feb 2, 2024

Just rebased to master and force pushed after the merge of #461.

@robertbartel robertbartel merged commit ecec45d into NOAA-OWP:master Feb 2, 2024
4 of 8 checks passed
@robertbartel robertbartel added the maas MaaS Workstream label Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maas MaaS Workstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants