Intermicrobial interaction modelling decisions #1

cpauvert · 2023-06-15T13:52:36Z

This is a rolling issue of pros and cons of modelling decisions especially regarding how to model relationships (see docs) as well as assumptions taken during the development of the MIIID metadata schema.

0. Model all properties from the Perspective paper as strings

Pros:

Easy to model
Can accomodate n-ary relationships
Easy-ish as a TSV

Cons:

No mapping of items/enums to WikiData or ontologies
No constraint nor validation (except with circonvoluted regex)
Not FAIR in the sense that the schema does not know what is each element
Burden on the user for formatting input

Status: not considered

1. microbial `Participant` as a separate class and participants is a `slot` accepting multiple `Participant`

Pros:

Name and taxonomic identifiers (as well as future properties) belong to the same object which reduces errors during data input.
Makes more sense in a modelling context for mapping for instance (https://linkml.io/linkml/schemas/uris-and-mappings.html#mappings)
Can accomodate n-ary relationships

Cons:

Conversion from YAML -> TSV -> YAML does not preserve data integrity (Inlining does not preserve data integrity #2)
Will prove difficult for integration with DataHarmonizer
Inlining of this nested structure in a TSV will makes it hard for people to contribute
Inlining prevents converting to others format as well that are not tree-like

Status: Tried as a first approach. Superseded by (3)
Commit: 9a2016a

2. Model interaction using the biolink

Pros:

Reusing existing schema and being part of larger association schema
So all pros of (1)

Cons:

"Only" Subject-Predicate-Object, meaning no n-ary relationships cane be modelled (Document n-ary relationship design patterns biolink/biolink-model#566)

Status: not considered yet because of complexity

3. participants is a `slot` accepting multivalued names, tax_id

Pros:

Easier to model
Each value of the multivalue has an homogeneous range
Can accomodate n-ary relationships

Cons:

Name and taxonomic identifiers (as well as future properties) DO NOT belong to the same object which ~~reduces~~ increases errors during data input.

Status: ~~considered~~ implemented
Commit: 10fadb6

The text was updated successfully, but these errors were encountered:

cpauvert · 2023-06-15T15:23:45Z

Regarding tax_id.
it is defined as integer at the moment but correspond to NCBI tax_id. Should I use https://biolink.github.io/biolink-model/docs/OrganismTaxon.html or the ontology NCBITaxon? Could be related to linkml/linkml#1112

Tried first as an ontology following https://linkml.io/linkml/schemas/enums.html#dynamic-enums. See e914dd4 in branch taxid-as-ontology but not sure how to implement a query to the ontology. But the tooling has yet to come (blog)

Tried to encode tax_id as a type, especially because there is also a Wikidata property that I could map to. See 068c60f in branch taxid-as-type. But still unsure whether the user should (A) input an integer that the model know it is an NCBI TaxID or (B) prefix the id with the correct namespace NCBITax:2. I'm puzzled especially because the conversion to ttl then does not expand the NCBI prefix for instance.

cpauvert · 2023-07-25T16:27:04Z

Further modelling and technical questions (no time to be presented at NMDC x NFDI4Microbiota meeting of 2023-07-25):

@cmungall as you were interested to have a closer look! Suggestions/Feedback would be much appreciated!

A. Modelling

A1 How to encode drop-out experiments with a focal strain?

Hard to model drop-out experiments with current model structure.
Example: 1 strain affects many differents at the same time Carlström et al., Nat Ecol Evol 2019
Could be modelled as an hypergraph? See discussion LinkML Community 2023-07-20 for the possibility of hypergraphs in LinkML
Could take inspiration in previous attemps to model gene knock-outs in gene network experiments.

A2 What is the cardinality of multi-method paper?

Take a study with a large association network of 50 strains and 200 associations, with co-culture growth outcomes for a handful of them (5 species). How to describe this dataset? Should we have:
- one description – one paper. As inclusive as possible, with the 50 strains and some with additional cultivation data
- one description – one method. Splitted, with a description for the association network and a description for the co-culture.
- one description – one subset of the network as enriched as possible
No approaches is perfect

B. Technical questions

B1 How to handle missing data (INSDC)? See #4

B2 How to constraint values to an ontology (via the dynamic enums)?

Lack of tooling for dynamic enums at the moment (doc)
Solved temporarily with provide regex to constraint ontology term + idenfier because of current lack of dynamic enum #8

B3 How to describe a slot with the human-readable ontology term and the machine-readable ontology number?

like for ENVO during submission (maybe you have had experience with this @turbomam ?)
Basically it is to get the best of both world: human-readable and machine-actionable..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermicrobial interaction modelling decisions #1

Intermicrobial interaction modelling decisions #1

cpauvert commented Jun 15, 2023 •

edited

Loading

cpauvert commented Jun 15, 2023 •

edited

Loading

cpauvert commented Jul 25, 2023

Intermicrobial interaction modelling decisions #1

Intermicrobial interaction modelling decisions #1

Comments

cpauvert commented Jun 15, 2023 • edited Loading

0. Model all properties from the Perspective paper as strings

1. microbial Participant as a separate class and participants is a slot accepting multiple Participant

2. Model interaction using the biolink

3. participants is a slot accepting multivalued names, tax_id

cpauvert commented Jun 15, 2023 • edited Loading

cpauvert commented Jul 25, 2023

A. Modelling

A1 How to encode drop-out experiments with a focal strain?

A2 What is the cardinality of multi-method paper?

B. Technical questions

B1 How to handle missing data (INSDC)? See #4

B2 How to constraint values to an ontology (via the dynamic enums)?

B3 How to describe a slot with the human-readable ontology term and the machine-readable ontology number?

cpauvert commented Jun 15, 2023 •

edited

Loading

1. microbial `Participant` as a separate class and participants is a `slot` accepting multiple `Participant`

3. participants is a `slot` accepting multivalued names, tax_id

cpauvert commented Jun 15, 2023 •

edited

Loading