Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermicrobial interaction modelling decisions #1

Open
cpauvert opened this issue Jun 15, 2023 · 2 comments
Open

Intermicrobial interaction modelling decisions #1

cpauvert opened this issue Jun 15, 2023 · 2 comments

Comments

@cpauvert
Copy link
Contributor

cpauvert commented Jun 15, 2023

This is a rolling issue of pros and cons of modelling decisions especially regarding how to model relationships (see docs) as well as assumptions taken during the development of the MIIID metadata schema.

0. Model all properties from the Perspective paper as strings

Pros:

  • Easy to model
  • Can accomodate n-ary relationships
  • Easy-ish as a TSV

Cons:

  • No mapping of items/enums to WikiData or ontologies
  • No constraint nor validation (except with circonvoluted regex)
  • Not FAIR in the sense that the schema does not know what is each element
  • Burden on the user for formatting input

Status: not considered

1. microbial Participant as a separate class and participants is a slot accepting multiple Participant

Pros:

Cons:

  • Conversion from YAML -> TSV -> YAML does not preserve data integrity (Inlining does not preserve data integrity #2)
  • Will prove difficult for integration with DataHarmonizer
  • Inlining of this nested structure in a TSV will makes it hard for people to contribute
  • Inlining prevents converting to others format as well that are not tree-like

Status: Tried as a first approach. Superseded by (3)
Commit: 9a2016a

2. Model interaction using the biolink

Pros:

  • Reusing existing schema and being part of larger association schema
  • So all pros of (1)

Cons:

Status: not considered yet because of complexity

3. participants is a slot accepting multivalued names, tax_id

Pros:

  • Easier to model
  • Each value of the multivalue has an homogeneous range
  • Can accomodate n-ary relationships

Cons:

  • Name and taxonomic identifiers (as well as future properties) DO NOT belong to the same object which reduces increases errors during data input.

Status: considered implemented
Commit: 10fadb6

@cpauvert
Copy link
Contributor Author

cpauvert commented Jun 15, 2023

Regarding tax_id.
it is defined as integer at the moment but correspond to NCBI tax_id. Should I use https://biolink.github.io/biolink-model/docs/OrganismTaxon.html or the ontology NCBITaxon? Could be related to linkml/linkml#1112

Tried first as an ontology following https://linkml.io/linkml/schemas/enums.html#dynamic-enums. See e914dd4 in branch taxid-as-ontology but not sure how to implement a query to the ontology. But the tooling has yet to come (blog)

Tried to encode tax_id as a type, especially because there is also a Wikidata property that I could map to. See 068c60f in branch taxid-as-type. But still unsure whether the user should (A) input an integer that the model know it is an NCBI TaxID or (B) prefix the id with the correct namespace NCBITax:2. I'm puzzled especially because the conversion to ttl then does not expand the NCBI prefix for instance.

@cpauvert
Copy link
Contributor Author

Further modelling and technical questions (no time to be presented at NMDC x NFDI4Microbiota meeting of 2023-07-25):

@cmungall as you were interested to have a closer look! Suggestions/Feedback would be much appreciated!

A. Modelling

A1 How to encode drop-out experiments with a focal strain?

  • Hard to model drop-out experiments with current model structure.
  • Example: 1 strain affects many differents at the same time Carlström et al., Nat Ecol Evol 2019
  • Could be modelled as an hypergraph? See discussion LinkML Community 2023-07-20 for the possibility of hypergraphs in LinkML
  • Could take inspiration in previous attemps to model gene knock-outs in gene network experiments.

A2 What is the cardinality of multi-method paper?

  • Take a study with a large association network of 50 strains and 200 associations, with co-culture growth outcomes for a handful of them (5 species). How to describe this dataset? Should we have:
    • one description – one paper. As inclusive as possible, with the 50 strains and some with additional cultivation data
    • one description – one method. Splitted, with a description for the association network and a description for the co-culture.
    • one description – one subset of the network as enriched as possible
  • No approaches is perfect

B. Technical questions

B1 How to handle missing data (INSDC)? See #4

B2 How to constraint values to an ontology (via the dynamic enums)?

B3 How to describe a slot with the human-readable ontology term and the machine-readable ontology number?

  • like for ENVO during submission (maybe you have had experience with this @turbomam ?)
  • Basically it is to get the best of both world: human-readable and machine-actionable..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant