Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refscan reports validation error when has_calibration value is id of something other than a CalibrationInformation #26

Open
aclum opened this issue Oct 10, 2024 · 9 comments
Assignees
Labels
upstream-issue Cause of issue is in an upstream dependency

Comments

@aclum
Copy link

aclum commented Oct 10, 2024

this slot uses an any_of range where a string is allowed so nmdc:wfnom-11-0mzdja70.1 for example should not show up as a violoation.

@aclum
Copy link
Author

aclum commented Oct 10, 2024

This should be valid
image


Edit: @eecavanna added this "light" copy of the same screenshot (the "light" copy was achieved by switching the GitHub UI to the "daytime" theme). I suspect the original screenshot has a transparent background.
image

@eecavanna
Copy link
Collaborator

Here's how I think that slot is defined in nmdc-schema version v11.0.1:

  has_calibration:
    name: has_calibration
    description: a calibration instance associated with a process
    notes:
    - '...'
    from_schema: https://w3id.org/nmdc/nmdc
    any_of:
    - range: CalibrationInformation
    - range: string

Source: nmdc_materialized_patterns.yaml. I redacted the notes value to make the snippet more concise.

Refscan ignores ranges that aren't classes.

@eecavanna eecavanna changed the title bug with validation of has_calibration Refscan reports validation error when has_calibration value is id of something other than a CalibrationInformation Oct 10, 2024
@eecavanna eecavanna added the upstream-issue Cause of issue is in an upstream dependency label Oct 10, 2024
@aclum
Copy link
Author

aclum commented Oct 10, 2024

what you have for the schema for this has_calibration is correct.
What we talked about in the infra meeting was having records which can't be validated output to a separate file that those which are incorrect.

@eecavanna
Copy link
Collaborator

eecavanna commented Oct 26, 2024

Maybe related (🤷): microbiomedata/nmdc-schema#2139

I'm still seeing this as a modeling issue. I think the schema authors want people to be able to use this slot to both (a) refer to a CalibrationInformation instance and (b) to store arbitrary strings. I don't understand why they would use one slot for both of those things, as opposed to using two slots: has_calibration and some_strings. I suspect the current modeling was not done the way it was, on purpose.

@aclum
Copy link
Author

aclum commented Oct 28, 2024

This is on a path to resolution but the information needs to be moved off of WorkflowExeuction subclasses to CalibrationInformation. The linked ticket you mention is what will address this.

Short answer is that refscan should not ignore ranges that aren't classes. A few weeks ago we discussed separating out records which can't be validated (b/c of issues like this) vs values which do not have referential integrity.

@eecavanna
Copy link
Collaborator

eecavanna commented Oct 28, 2024

Short answer is that refscan should not ignore ranges that aren't classes.

I don't see how that can be generalized without refscan trying to interpret every String-ranged field of every document as a potential reference. In other words, I don't know how refscan can determine that https://microbiomedata.github.io/nmdc-schema/has_calibration/ (whose effective range is String | CalibrationInformation) was designed to contain a reference (to anything other than a CalibrationInformation instance), but — for example — https://microbiomedata.github.io/nmdc-schema/ecosystem/ (whose effective range is String) was not.

@eecavanna
Copy link
Collaborator

eecavanna commented Oct 28, 2024

A few weeks ago we discussed separating out records which can't be validated (b/c of issues like this) vs values which do not have referential integrity.

I could add some kind of --skip-references (or --ignore-references) CLI option that refscan users could use to tell refscan not to bother checking the referential integrity of specific references. I don't know exactly what the syntax would be (I think it would include the "source document ID" and the "field name"). That way, the user is consciously opting out of some checking, rather than relying on refscan to try to infer things about the modelers' intentions.

@aclum
Copy link
Author

aclum commented Oct 31, 2024

I don't see how that can be generalized without refscan trying to interpret every String-ranged field of every document as a potential reference. In other words, I don't know how refscan can determine that https://microbiomedata.github.io/nmdc-schema/has_calibration/ (whose effective range is String | CalibrationInformation) was designed to contain a reference (to anything other than a CalibrationInformation instance), but — for example — https://microbiomedata.github.io/nmdc-schema/ecosystem/ (whose effective range is String) was not.

My thought is you would do the inverse, any slot where String is part of the allowable range, either as the range or with an any_of range, you would report that you can't validate those records.

FWIW Brynn and Katherine are fixing this shortly so we could also just leave this for now as it is not a use case we expect to see going forward.

@eecavanna
Copy link
Collaborator

eecavanna commented Oct 31, 2024

Ah, gotcha!

I found the PR where they are fixing this. It's microbiomedata/nmdc-schema#2235

That PR's description includes this statement:

It also fixes the schema so the range of has_calibration is only CalibrationInformation (removes any_of and string range)

I'll leave refscan as is and then re-run the scan (with the new schema version) once the schema has undergone the changes being introduced in that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream-issue Cause of issue is in an upstream dependency
Development

No branches or pull requests

2 participants