Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible issues with the latest SynGO model batch #55

Closed
kltm opened this issue Jan 17, 2018 · 22 comments
Closed

Possible issues with the latest SynGO model batch #55

kltm opened this issue Jan 17, 2018 · 22 comments

Comments

@kltm
Copy link
Member

kltm commented Jan 17, 2018

From #54
Looking at http://noctua-dev.berkeleybop.org/editor/graph/gomodel:SYNGO_1371 as an example, apparent issues with SynGO models:

  • Contributors are supplied in the form SynGO:SynGO-bagni, rather than the specced ORCID. AFAIK, there is no resolver for this.
  • Edges between individuals contain no date (required), contributor (required), or providedBy (optional, soon required) information.
  • Evidence individuals are shared between edges. This is not part of the current or future model (where publications may be shared, but not the evidence).
  • The model-level providedBy appears to be the string SynGO-VU, rather than a CURIE or URL.
  • The model titles are not human readable, making (re)discovery and use possibly difficult.
  • While not necessarily an issue, I think it may be worthwhile considering the use of special group (providedBy) identifiers for bulk import model data sets so that end users can filter for their inclusion and exclusion. These SynGO models would only be reliable filterable on their current title. I would recommend a specific or additional providedBy.
  • For this model, there is no occurs_in evidence.
  • I have not tracked it down yet, but there is something about the evidence in this model on the part_of that prevents it from rendering correctly. It may be related to one of the issues above.

Tagging @cmungall @dosumis

@kltm
Copy link
Member Author

kltm commented Jan 17, 2018

Apparently some of these were in progress:

#43 (comment)
geneontology/syngo2lego#1

@dosumis
Copy link
Contributor

dosumis commented Jan 18, 2018

HI Seth,

Will try to get these fixed today - at least for cases where this is under my control:

  • Contributors are supplied in the form SynGO:SynGO-bagni, rather than the specced ORCID. AFAIK, there is no resolver for this.

    • SynGO have previously promised to provide ORCIDs. @ftwkoopmans Do you have these now? They don't appear to be in the JSON provided.
  • Edges between individuals contain no date (required), contributor (required), or providedBy (optional, soon required) information.

    • @kltm Just to check: In addition to the date and contributor on the evidence on each edge (object property assertion), you want a dc:date {some date}, dc:contributor { some source} as annotation axioms on the each edge (object property assertion) ? Let me know ASAP as very easy to add and re-run conversion.
  • Evidence individuals are shared between edges. This is not part of the current or future model (where publications may be shared, but not the evidence).

    • All evidence individuals now generated separately for each edge.
  • The model-level providedBy appears to be the string SynGO-VU, rather than a CURIE or URL.
    @ftwkoopmans - Maybe change this to a link to the new SynGO web site? Is this public yet?

  • The model titles are not human readable, making (re)discovery and use possibly difficult.

    • They are designed so that SynGO curators can move easily between their curation tool and Noctua. I guess we could potentially overload the title field with SynGO model number + short title, but I don't think SynGO has the text for this. @ftwkoopmans ?
  • While not necessarily an issue, I think it may be worthwhile considering the use of special group (providedBy) identifiers for bulk import model data sets so that end users can filter for their inclusion and exclusion. These SynGO models would only be reliable filterable on their current title. I would recommend a specific or additional providedBy.

    • Makes sense. Just need to agree on one.
  • For this model, there is no occurs_in evidence.

    • Fixed, but note that it's potentially misleading as there's one set of evidence per model and not all evidence for primary annotation will apply for occurs_in edge.
  • I have not tracked it down yet, but there is something about the evidence in this model on the part_of that prevents it from rendering correctly. It may be related to one of the issues above.

    • Might be down to shared individual. Let's see if fixing this or edge annotations fixes the issue.

@dosumis
Copy link
Contributor

dosumis commented Jan 18, 2018

@kltm - Looking at a randomly chosen model (gomodel-59a9023b00000151) it looks like edges without evidence have the additional date and contributor annotations, but those with evidence do not. Is this correct?

@dosumis
Copy link
Contributor

dosumis commented Jan 18, 2018

Re-running with evidence individuals fixed. Consistent with previous comment, I haven't added annotations directly to edges that are redundant with those on evidence individuals attached to those edges. I'll upload these models to a new branch/pull request. I have a branch edit that adds these additional annotation axioms so can switch quickly if necessary.

@kltm
Copy link
Member Author

kltm commented Jan 19, 2018

Sorry to run back through your notes this way; I wish there was threading, maybe we should split these out into separate issues?

For this model, there is no occurs_in evidence.

Fixed, but note that it's potentially misleading as there's one set of evidence per model and not all evidence for primary annotation will apply for occurs_in edge.

Yes, this is interesting--I hadn't fully realized that is what was going on here. We currently have no way of stating evidence for a model or individuals in it. In fact, it was a conscious choice that we tried and backed away from early on. I would hazard that it would be better to only have evidence where is was certain and leave "blanks" rather than have over-zealous application of evidence (which could cause oddities later on). @cmungall , any thoughts here?

While not necessarily an issue, I think it may be worthwhile considering the use of special group
(providedBy) identifiers for bulk import model data sets so that end users can filter for their inclusion >> and exclusion. These SynGO models would only be reliable filterable on their current title. I would >>recommend a specific or additional providedBy.

Makes sense. Just need to agree on one.

I believe that the only thing we'd need for this is something in groups.yaml (I believe you may already have it here: https://github.com/geneontology/go-site/blob/master/metadata/groups.yaml#L34) and either replace or add in addition to the current geneontology.org.

The model titles are not human readable, making (re)discovery and use possibly difficult.

They are designed so that SynGO curators can move easily between their curation tool and Noctua. I guess we could potentially overload the title field with SynGO model number + short title, but I don't think SynGO has the text for this. @ftwkoopmans ?

I noticed that there was text in the model-level comment annotation. Could that possibly be re-used or shortened?

Edges between individuals contain no date (required), contributor (required), or providedBy (optional, soon required) information.

@kltm Just to check: In addition to the date and contributor on the evidence on each edge (object property assertion), you want a dc:date {some date}, dc:contributor { some source} as annotation axioms on the each edge (object property assertion) ? Let me know ASAP as very easy to add and re-run conversion.

I believe that is correct: every object property assertion (edge) and individual within a model (and the model as well) gets date, contributor, and providedBy. There should be nothing without these annotations.

Generally tagging/invoking @cmungall to check the modeling above.

@cmungall
Copy link
Member

As I understand each syngo model contains exactly one GPAD assertion, so we take "one set of evidence per model" as being equivalent to a set of evidence on the "primary" edge in the model.

For title how about generating based on species + gene + primary term, plus the id?

@dosumis
Copy link
Contributor

dosumis commented Jan 19, 2018

As I understand each syngo model contains exactly one GPAD assertion, so we take "one set of evidence per model" as being equivalent to a set of evidence on the "primary" edge in the model.

One primary term & one evidence set per syngo model, but there can be multiple extensions.
It is possible for there to be multiple syngo models in one noctua model (generated so that can be linked manually later), but in that case there will be one evidence set for each syngo model. Given that, should I remove evidence from the extension?

For title how about generating based on species + gene + primary term, plus the id?

Seems reasonable & easy to implement. @ftwkoopmans - any opinion? Model IDs will still follow SynGO (although full URI will have additional UUID).

@dustine32
Copy link
Contributor

I was able to clone this project and run it on command-line sbt using the provided SynGO_export_test.json file as input. Here's my attempt to see whether the most recent code addresses these issues:

Date and contributor missing on edges
These are on individuals but not axioms. Should be easy to fix.

Evidence missing on occurs_in from extensions
Looks like David fixed this.

“contributor” - pulling username (“SynGO-*”) from json instead of ORCID. Waiting on SynGO to provide ORCID’s in export file. Or provide username-to-ORCID mapping?

“providedBy” - Still using "SynGO-VU", which is hard-coded in syngo2lego. Need to pick something that’s existing in groups.yaml. This could get added to groups.yaml but still need to format into URI (like “www.geneontology.org”).

Model titles - Currently SYNGO_###. From @cmungall:
“For title how about generating based on species + gene + primary term, plus the id?”
So format would be: {Species}{UniProtID}{goTerm}SYNGO{###}?
Ex. “Mouse_Q9EQZ7_GO:0016082_SYNGO_113”

Evidence references on part_of edge won’t display right:
As mentioned in syngo2lego issue #1, I think this is due to evidence individuals using “source” to specify PMID instead of SEPIO:0000124 term. Although, the same set of evidence (and thus individuals) display correctly on the enabled_by edge. Also, the regenerated model displays part_of references correctly on my server even though the “source” property is still used for PMID. Not sure yet what change fixed this. In 1/18/18 commit to SimpleModel.scala, it looks like David switched from reusing the same axiom for different edges to creating distinct ones (by calling gen_annotations).
I can change it to use SEPIO anyways, making sure it doesn’t break again after.

For the example model, I wasn't able to find input json containing SYNGO_1371. Is this in a repository somewhere?

Thanks!

-Dustin

@kltm
Copy link
Member Author

kltm commented Mar 15, 2018

@dustine32 I'm happy to look at any model PR--feel free to put a PR on dev and we can try and get those up.
Unfortunately, I'm not sure of the location of any upstream input JSON, you'd probably have to ask @dosumis ?

dustine32 added a commit to dustine32/noctua-models that referenced this issue Mar 16, 2018
@dosumis
Copy link
Contributor

dosumis commented Mar 16, 2018

Hi @dustine32,

Thanks for working on this. I may not have updated the source JSON file on the repo. Will fix that.
Don't see any pull request yet, but more than happy to review.

@thomaspd
Copy link

I was talking to the SynGO group earlier this week, and they plan to submit another bolus of annotations in the next couple of weeks.

@dustine32
Copy link
Contributor

@dosumis Thanks! I haven't put in a PR to syngo2lego yet because I've yet to change any code. I was more just making sure that I could run it and wrap my head around the current state of things. I did put a PR on noctua-models/dev so @kltm could take a look at some of syngo2lego's output when I ran it. It looked like some of the issues were fixed when comparing models currently in noctua-models/dev and the regenerated models.

@thomaspd Great! I wonder if the file source for these annotations is in the same JSON format (could've even fixed the ORCID issue themselves!).

@dosumis
Copy link
Contributor

dosumis commented Mar 16, 2018

@dosumis Thanks! I haven't put in a PR to syngo2lego yet because I've yet to change any code. I was more just making sure that I could run it and wrap my head around the current state of things. I did put a PR on noctua-models/dev so @kltm could take a look at some of syngo2lego's output when I ran it. It looked like some of the issues were fixed when comparing models currently in noctua-models/dev and the regenerated models.

Yep. IIRC , I fixed everything everything except model titles and the SEPIO thing (which I wasn't aware of, did this change?)

@thomaspd Great! I wonder if the file source for these annotations is in the same JSON format (could've even fixed the ORCID issue themselves!).

Should be same JSON. There's also a quick check of the json they provide via json schema.
@thomaspd - would be good to emphasise that we want orcids this time.

@thomaspd
Copy link

thomaspd commented Mar 16, 2018

OK, now I see it's the contributor that needs to be the orcid. I will talk to them.

@dosumis
Copy link
Contributor

dosumis commented Mar 16, 2018

Simple validation of their json using json_schema is here: https://github.com/geneontology/syngo2lego_data_conversion

travis checks any JSON dropped into the json folder against the schema in spec.

orcid goes in model.username, which is where they put initials now.

Last dataset provided by VU is here: https://github.com/geneontology/syngo2lego_data_conversion/blob/master/json/SynGO_export_2017-12-18_production.json

@dustine32
Copy link
Contributor

Rad! Thanks @dosumis and @thomaspd !

@kltm
Copy link
Member Author

kltm commented Mar 23, 2018

@dustine32 Looking at latest model batch, basing primarily on http://noctua-dev.berkeleybop.org/editor/graph/gomodel:SYNGO_1371 :

Great things:

  • All models seem to be annotated with ORCIDs
  • On cursory inspection, all models seem to have date, providedBy (as https://syngo.vu.nl), and contributor
  • There is now occurs_in evidence
  • For whatever reason from before, the evidence now "folds" correctly (likely next item)
  • There are 10 evidence statements each on three edges and there are now 30 evidence individuals--this would imply that the evidence modelling is now "corrected"

Things that can still be worked on?

  • Model titles are still uninformative; while the SynGO models are read-only at this point, it might be nice for something that's a little more searchable/readable in the title; I think @cmungall sketched out something simple for this

I think this is great and hope we can get these into production soon--these are a huge improvement in every way and I think easily meet all the required criteria for a "production" model.
Thank you!

@kltm
Copy link
Member Author

kltm commented Mar 23, 2018

Assuming no action on the title, I guess the next step would be a PR against the master branch. Once opened, the merge would have to be timed so that we could take the production site offline for a few minutes to rebuild.

dustine32 added a commit to dustine32/noctua-models that referenced this issue Mar 24, 2018
@dustine32
Copy link
Contributor

Thanks much @kltm ! Just put in the PR to master. But now that I look at the title issue again, I suppose I may not have officially asked @ftwkoopmans what his opinion on title generation was. @thomaspd @huaiyumi I know this came up in our meeting this week as well. @cmungall 's suggestion from above was:

For title how about generating based on species + gene + primary term, plus the id?

@kltm
Copy link
Member Author

kltm commented Mar 24, 2018

@dustine32 A title would be an awesome bow on the PR package. I'd hope that anything would be an improvement on the current.

@dustine32
Copy link
Contributor

@lpalbou Hey Laurent, here's the ticket that tracked some things needing fixing in the SynGO models. Let me know what field ("oboInOwl" something) and what value should be added to the model files and I can plug away. An example would help for sure.

kltm added a commit that referenced this issue Apr 27, 2018
Updated and new SynGO models for issue #55
@kltm
Copy link
Member Author

kltm commented Jul 18, 2018

@dustine32 thinks this is cleared, as do I.

@kltm kltm closed this as completed Jul 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants