Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add extra_slots metamodel slot #205

Merged
merged 7 commits into from
Oct 21, 2024
Merged

Conversation

sneakers-the-rat
Copy link
Contributor

@sneakers-the-rat sneakers-the-rat commented Sep 26, 2024

Related to:

Many modeling frameworks allow one to specify how to handle extra values provided to an instance of a class. By default linkml forbids all additional data in those frameworks that allow that. In addition to declaring whether or not extra data is allowed, it would also be nice to constrain what type that extra data can be. There are no other existing linkml metamodel items that could be used for this that i am aware of, but please let me know if we can repurpose something existing here.

Options

Option 1: Single slot

(current contents of PR, simplified)

  extra_slots:
    ifabsent: false
    any_of:
      - range: boolean
      - range: anonymous_slot_expression

Option 2: Class slot

If we want to avoid doing any_of in the metamodel, we could also do something like this:

slots:
  extra_slots:
    range: ExtraSlotsExpression
  allowed:
    range: boolean

classes:
  ExtraSlotsExpression:
    mixins:
      - expression
    slots:
      - allowed
      - slot_expression

I'm not sure if there is a general "allowed" slot, ctrl+f isn't finding one, but seems better than making a single-purpose "extra_slots_allowed" slot. Maybe slot_expression should be anonymous_slot_expression here but that seems like a lot to type when defining a schema lol, idk if that is incorrect semantically.

this makes one awkward case which is syntactically possible but semantically impossible

allowed slot_expression valid
True null True
False null True
True present True
False present False

And also doesn't allow us to ifabsent allowed to False because a slot_expression should be able to be specified without explicitly setting allowed: true imo. So the "default allowed: False" behavior becomes "extra slots are not allowed if allowed: false or allowed: null && slot_expression: null"

Examples

the examples in the PR and the prior issue give examples of expected use, but for the sake of recordkeeping:

Allow all extra slots:

Single Slot Class Slot
MyClass:
  extra_slots: true
MyClass:
  extra_slots:
    allowed: true

JSON Schema

assuming "$schema": "https://json-schema.org/draft/2020-12/schema" for all these, and adding a dummy "foo" slot because it's just empty otherwise

{
  "properties": {
    "foo": { "type": "string" }
  },
  "additionalProperties": true
}

Pydantic

class MyModel(BaseModel):
    model_config = ConfigDict(extra='allow')

Allow no extra slots (default)

Single Slot Class Slot
MyClass:
  extra_slots: false

or undefined

MyClass:
  # nothing
MyClass:
  extra_slots:
    allowed: false

or undefined

MyClass:
  # nothing

JSON Schema

{
  "properties": {
    "foo": { "type": "string" }
  },
  "additionalProperties": false
}

Pydantic

(in reality we wouldn't set this, because it's both pydantic's default and also the default in the ConfiguredBaseClass, but for illustration...)

class MyModel(BaseModel):
    model_config = ConfigDict(extra='forbid')

Constrain extra slots by slot expression

Simple Types

Only allow additional strings

Single Slot Class Slot
MyClassA:
  extra_slots:
    range: string
MyClassA:
  extra_slots:
    # not required, but equivalent
    # allowed: true
    slot_expression: 
      range: string

JSON Schema

additionalProperties also accepts a schema object...

{
  "properties": {
    "foo": { "type": "string" }
  },
  "additionalProperties": { "type": "string" }
}

Pydantic

class MyModel(BaseModel):
    __pydantic_extra__: dict[str, str] = Field(init=False)
    model_config = ConfigDict(extra='allow')

Unions

Single Slot Class Slot
MyClass:
  extra_slots:
    any_of:
      - range: string
      - range: integer
MyClass:
  extra_slots:
    slot_expression:
      any_of:
        - range: string
        - range: integer

JSON Schema

{
  "properties": {
    "foo": { "type": "string" }
  },
  "additionalProperties": { "anyOf": [ {"type": "string"}, {"type": "integer"} ] }
}

Pydantic

class MyModel(BaseModel):
    __pydantic_extra__: dict[str, str | int ] = Field(init=False)
    model_config = ConfigDict(extra='allow')

Class Ranges

Allow extra slots if they are instances of the class SecondClass

Single Slot Class Slot
MyClass:
  extra_slots:
    range: SecondClass
MyClass:
  extra_slots:
    slot_expression:
      range: SecondClass

JSON Schema

{
  "$defs": {
    "SecondClass": {
      "additionalProperties": false,
      "properties":
      {
        "bar": { "type": "integer" }
      }
    }
  },
  "properties": {
    "foo": { "type": "string" }
  },
  "additionalProperties": {
    "$ref": "#/$defs/SecondClass"
  }
}

Pydantic

class SecondClass(BaseModel):
    bar: int

class MyModel(BaseModel):
    __pydantic_extra__: dict[str, SecondClass] = Field(init=False)
    model_config = ConfigDict(extra='allow')

Discussion

Ambiguity

The two major points of ambiguity that i can see

  • interpreting this as defining constraints on extra slots defined in child classes (seems relatively low likelihood, but possible). Added a note in the description for this property to head that off
  • interpreting some properties of an anonymous slot expression as being about the requiredness or cardinality of extra slots (seems more likely). eg. one might think they could set maximum_cardinality to limit the number of extra slots that could be provided, or make providing extra slots required. Added examples to clarify these behaviors - I am not sure if there is a circumstance where we would want to support that, but if we did then we could turn this into a class so someone could specify something like this, which would be backwards compatible with any anonymous_slot_expression definitions that happen in the meantime:
extra_slots:
  extra_params:
    maximum_cardinality: 5
  # the rest of the anonymous slot expression..
  range: string
  multivalued: true
  maximum_cardinality: 3

to mean "there can be at most 5 extra slots that are lists of strings at most length 3"

Naming

"Extra slots" might be a bit too specific, are there other cases where we would want to allow/deny extra things in a domain? @sierra-moxon brings up the name "closed" which has currency in closed-world/open-world parlance of RDF and formal logic circles, but is less obvious to your average data modeler.


This is high priority for me since it's blocking final implementation of nwb-linkml, so if we can make relatively short work of this then i'll implement it for pydanticgen, pythongen, and json schema gen as thanks for the quickness.

@sneakers-the-rat
Copy link
Contributor Author

i am told that there is prior art here? https://mapping-commons.github.io/sssom/spec-model/

@turbomam
Copy link
Contributor

This makes me nervous but I agree 100% that many users want it, and if we are going to allow it, this seems like a step in the right direction.

I never would have thought of specifying the maximum cardinality of extra slots, but I suppose it wouldn't hurt. I would be interested in seeing some examples of the limit solving a problem.

I'm also interested to hear how SSSOM provides a solution for this that LinkML could follow.

@sneakers-the-rat do you think your implementation could limit extra slots to scalar key/value pairs? Maybe constraining the range to string would do that, in the sense that the extra slot's value could never be an instance of some class.

@sierra-moxon
Copy link
Member

sierra-moxon commented Sep 26, 2024

Are 5 and 3 swapped in your example maximum_cardinality slots?

extra_slots:
  extra_params:
    maximum_cardinality: 5
  # the rest of the anonymous slot expression..
  range: string
  multivalued: true
  maximum_cardinality: 3

This says to me: "total of 3 extra slots, each with 5 parameters each"?

@sierra-moxon
Copy link
Member

Also - how does it play with the closed metamodel component?

@sneakers-the-rat
Copy link
Contributor Author

This makes me nervous but I agree 100% that many users want it

I get it. I often find myself of several minds working on linkml, and there's this recurring 3-way tension between how we think schemas should be modeled, what is *possible to express," and what is feasible to implement. I think allowing arbitrary extra items probably scores low on the "should" scale (why not add those things to the model), though not always true for eg. property-centric frameworks, but very high on the "expressiveness" scale, and relatively high on the implementability scale for those frameworks where it's possible. In this case i'm modeling a schema/format/standard for whom constraints on arbitrary extra fields are a core part of the format. I personally would have probably designed it differently, but it exists and it would be nice to be able to express it.

SSSOM

Don't know it, but yes as always let's integrate with prior art if possible. I am partly making this PR to try and get the ball rolling on this rather than trying to say "this is definitely what we should do and i have considered all options" because the other issues without a PR on the table were languishing.

do you think your implementation could limit extra slots to scalar key/value pairs? Maybe constraining the range to string would do that, in the sense that the extra slot's value could never be an instance of some class.

I think that allowing the slot range is relatively important, and if we don't add it we'll want to do so later. the two examples i was giving in the prior issue were JSON Schema and Pydantic, both of which allow you to set what would be partial/anonymous slot schemas in linkml for extra properties. specifically allowing class ranges in extra is important because you might want to say "any additional properties must be some instance of this class or any of its subclasses" where you can't naturally anticipate what each of those subclasses might be, but you can specify some general abstract form in a parent class. This would also be important to be able to do in pattern-based or structurally typed frameworks.

the version in this PR doesn't get us all the way into abstractionland (eg. one can't set a pattern constraint on what additional fields name could be, i don't think), but does allow us the expressiveness to generate into those frameworks.

Are 5 and 3 swapped in your example maximum_cardinality slots?

haha yes see this is the exact ambiguity i'm talking about. The AnonymousSlotExpression would be defining the pattern to validate each additional property present in the data, rather than all additional properties.

So by itself,

MyClass:
  extra_slots:
    range: string
    multivalued: true
    maximum_cardinality: 3

would behave like

MyClass:
  attributes:
    extra_1:
      range: string
      multivalued: true
      maximum_cardinality: 3
    extra_2:
      range: string
      multivalued: true
      maximum_cardinality: 3
    # ...

which is why i added this example to clarify that explicitly.

The other way would be mixing levels, where it would become ambiguous whether we were talking about all or each - eg if multivalued: false, does that mean we can only have one additional property in the data? if we were to do something like an extra_params to parameterize all extras, then i think we would want to make a separate class with a specific set of metamodel slots to support like *_cardinality, etc. because reusing anonymous_slot_expression would make for too many metamodel slots that would be meaningless/ambiguous in that context.

Also - how does it play with the closed metamodel component?

i'm not finding this, can you link me to the definition?

@sierra-moxon
Copy link
Member

I got it from the linked issue - linkml/linkml#1404 looks like it's a flag to a specific generator. but my question still is bugging me a bit - do we have to consider that flag and how it behaves with this change?

domain: class_definition
ifabsent: false
any_of:
- range: boolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is to avoid any of in the metamodel (we also had this discussion in Berkeley:-). One reason is that it means the metamodel can't be mapped to a relational model without bespoke transformations (thus breaking eg Ben's Django editing workflow). It could also impede mapping of linkml to other target languages in future. I know this seems like favoring implementers over users but in this case I think the user may be better served by a more explicit way to declare unrestrictedness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated OP with another possible example that avoids this :)

@sierra-moxon
Copy link
Member

from LinkML dev call - @sneakers-the-rat - would you mind coming next week to discuss at our next dev call? (or we can schedule a one-off if this is urgent).

@sneakers-the-rat
Copy link
Contributor Author

I would think that flag gets deprecated in favor of specifying it in the schema, with ample transition time where it basically behaves like extra_slots=True/False in the meantime? Seems like if we make it a property in a schema we should take out generator-conditional behavior. Maybe we also want to add this as a schema-level property as well if someone wants to set it for all classes at once? I could implement that in schemaview if we want.

@sneakers-the-rat
Copy link
Contributor Author

Had a few minutes before needing to run, but updated OP with a class slot option that avoids any_of in the metamodel. I'll come back and add json schema and pydantic examples later

@cmungall
Copy link
Member

I favor the 2nd option. It also makes it easier for frameworks to declaratively declare their conformance profile (e.g. json-schema can declare it supports extra_slots.allowed but not extra_slots.slot_expression.

If you like we can proceed with extra_slots.allowed in its own PR if it's easier to chunk this up. It should be uncontroversial both in terms of utility-vs-complexity and its semantics (as is evidenced by presence in both json-schema and pydantic). In fact we are scheduled for a new release soon so this part could easily go in.

For constraining the expression part, I would like to make sure we are future proofing for future changes. See my comments here linkml/linkml#2241 (comment) on the common use case of allowing additional slots if their names match certain patterns. I am not sure how obscure a use case this is (or how obscure the ability to semi-constrain additional slots is). It would be good to compile examples outside of NWB/HDMF where this is allowed.

We also need to decide on inheritance semantics. What if I want to declare at a base class level that any extra slots should be integers, but I will leave it to subclasses to decide whether this is switched on? I think it's easier to permit this kind of behavior than forbid it. The semantics here should be monotonic, i.e you can progressively constrain not relax.

@sneakers-the-rat
Copy link
Contributor Author

the common use case of allowing additional slots if their names match certain patterns

This is related to the "level shifting" problem discussed above re: whether max_cardinality refers to each of the additional slots implied by the data or all of the slots: I.e. whether it means "each extra slot can have max cardinality 5" or "there can be maximum 5 additional slots." Splitting up the extra expression allows for this subtlety to be resolved naturally, by adding an additional slot that conditions patterns like what the extra slots can be named, their cardinality, etc. Separate from declaring conditions for the provided slots themselves

What if I want to declare at a base class level that any extra slots should be integers, but I will leave it to subclasses to decide whether this is switched on? I think it's easier to permit this kind of behavior than forbid it. The semantics here should be monotonic, i.e you can progressively constrain not relax.

This is related to the previous discussion in #2241 on the difference between specifying addition slots provided by the data and additional slots defined in inheriting classes. Re: monotonicity, since there isn't a general way this is enforced/declared in the metamodel or schemaview, I would think that it would take on the same conventional status as the rest of linkml: it is currently technically possible to override parent class declarations for all but min/max value last I checked, but the conventional expectation is monotonicity.

@sneakers-the-rat
Copy link
Contributor Author

Updated OP with examples in JSON schema and pydantic

@cmungall
Copy link
Member

cmungall commented Oct 1, 2024

Just a general note: we have been discussing this proposal in relation to pydantic and json-schema, but shacl has an analogous mechanism sh:closed.

See https://www.w3.org/TR/shacl/#ClosedConstraintComponent

(I don't know if we also want to have something like shacl's ignore here too)

@sneakers-the-rat
Copy link
Contributor Author

Is there a way to express constraints for extra unspecified properties as well? Or would that just be adding a union onto the general property constraint? I see the ignoredProperties constraint, but that looks like it's just limiting to a (list of) URIs for slots/props

@sierra-moxon
Copy link
Member

@sneakers-the-rat - thank you very much for the addition doc here! Option 2 is terrific - and now I understand it! :D

@sneakers-the-rat
Copy link
Contributor Author

OK i have updated the impl here to be the class slot. I reused range_expression since it was the only slot i could find with a matching range, and it feels like an appropriate usage, semantically.

How does this look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants