-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch pipeline to read and output gaf 2.2 #211
Comments
Tagging @dustine32 @vanaukenk |
@dustine32 Once the feb release goes out, we should also switch our PAINT upstreams to point to your 2.2 files. |
@dougli1sqrd Alright if I edit a rolling checklist into your top comment? |
Sure thing |
So a quick look at the ontobio validate script (the main part of the pipeline parsing and "megastep", or as Seth calls it, the "kernel") I think all we would need to do is tell the And at this point the pipeline GAF parsing logic is agnostic to GAF version. As the file is read it looks for the But I believe at first glance that if we set (or we could parameterize with a command line arg) the writer version to gaf 2.2, we'll be done, for some definition of done. |
@kltm I am so excited to start pointing to those PAINT GAF 2.2 files! |
Excitement mounts. |
@dougli1sqrd @dustine32 (@vanaukenk) While I have not "finalized" the release (in process), it is now done and frozen, so there is almost no possibility that we'll need to use the current code base for a redo. To give our QC/QA and downstreams as much time as possible to see what we're doing this month, please go ahead and update things to a GAF 2.2 stance. |
On the way! |
@kltm I just switched the symlink on the PAINT server to point to the GAF 2.2 files. No changes to the datasets/paint.yaml file should be needed. |
GAF 2.2 output by default changes are present in newly release ontobio 2.3.0 |
I'm now updating the checklist at the top with current blocking tickets. |
@dougli1sqrd Clarifying question from @ukemi : Will I would assume "yes" for the paint one, as that is coming in from internal processes, not sure about the prediction file. Although, technically speaking, neither of these are "products"... |
Ooh, the predictions are still generated by owltools. Owltools doesn't speak gaf 2.2 does it? Since it's a change in requirements on the qualifier, maybe owltools won't notice? |
Hi @dougli1sqrd. |
Just double checked this. We actually get the PAINT annotations from That file will be changing, correct? |
Yeah that file will be changing. Here's the GAF 2.2 spec: http://geneontology.org/docs/go-annotation-file-gaf-format-2.2/ It's pretty simple, really. It's just the qualifier field that is changing. |
Thanks @dougli1sqrd. It is a minor change, but it's important for us to allow for it and be able to parse the new qualifiers correctly. So we have to know if it will actually change. What about the prediction file? |
They'll show up today, hopefully. The standard predictions look okay I think? Here's a sample:
|
This doesn't look like a gaf. It looks like the prediction mapping file. Here is the top of the current file. It's not zipped. My bad. !gaf-version: 2.0 |
Ah I was looking in the wrong place. Here's what the currently running snapshot made (the top):
|
Thanks Eric! So this one is still in gaf2.0. We will not change our load. |
Checking the WB files (input GAF2.2, output GAF2.2, output GPAD1.2), annotations look okay except for the ones with annotation extensions which are missing in the output files. I assume that issue is being fixed with this PR: geneontology/go-site#1618 so will continue to check other annotations until that fix percolates through. |
@vanaukenk It looks like a |
Thanks @kltm |
I've come across two other issues, one of which may be outside the scope of GAF2.2, but I'll put them both here, just in case.
Update: after talking with @ukemi , I'd like to confirm exactly what the PAINT source file is that is used to go into production. Maybe the changes I noted above are because I'm looking at the wrong source file. Thx; I'll continue testing..... |
Almost finished testing the WB files. |
Testing MGI files I find this annotation in the src file: But I cannot find this in the mgi gaf in the annotations file. |
Note that this annotation originates in the mgi_predictions file: MGI MGI:2137630 Pkmyt1 GO:0018215 PMID:21873635 IBA PANTHER:PTN000113601|ZFIN:ZDB-GENE-050301-2|PomBase:SPBC36B7.09|MGI:MGI:1341830|TAIR:locus:2024780|dictyBase:DDB_G0272837|UniProtKB:Q9LX30|MGI:MGI:1353449|PomBase:SPAC20G4.03c|PomBase:SPAC222.07c|PomBase:SPCC18B5.03|RGD:70883|UniProtKB:C6KTB8|WB:WBGene00006988|FB:FBgn0037327|MGI:MGI:1353448|ZFIN:ZDB-GENE-080422-1|FB:FBgn0040298|RGD:70884|UniProtKB:A0A1D8PQT9|PomBase:SPBC660.14|FB:FBgn0011737|UniProtKB:A0A0B4KHX7|UniProtKB:Q9BQI3|SGD:S000003723|SGD:S000002691|MGI:MGI:1353427|UniProtKB:Q9P2K8|UniProtKB:Q8IL26|UniProtKB:P19525|WB:WBGene00003970|MGI:MGI:103075|UniProtKB:Q9NZJ5 P protein kinase, membrane associated tyrosine/threonine 1 Myt1 protein taxon:10090 20200807 GOC |
is this an inference annotation? |
That is the field gets truncated on loading and it might result in it looking like another annotation with fewer items in the field? |
It IS in the MGI source file, it IS NOT in the GOC output file. It IS in the prediction (inference) file. I suspect that part of the processing on the GOC side is to prevent tail-eating by stripping all PAINT annotations from the MGI file and then injecting them back as part of the GOC pipeline. The predictions that are based on PAINT are being stripped, but are not reinjected. Is the pipeline stripping based on PMID? |
Give me a few minutes to check the wiki |
The problem is not on the MGI side. |
BY 'MGI source file" do you mean the one MGI supplies (which if there ia a PAINT annotation, I thought it's stripped: does it use the PMID (gaudet paper) or GO_Central I wonder? |
I also notice that PAINT annotations in the MGI file have the MGI reference for the PAINT paper, MGI:MGI:6201960, but this is not injected as part of the GO pipeline, so it is missing in the file provided by the GOC. I guess this is ok, but should be noted here as technically a discrepancy. |
@hdrabkin, you are correct. If the pipeline used both the PMID and the provider to distinguish PAINT annotations then it could distinguish those directly from PAINT versus those from predictions based on the PAINT annotations. PAINT gets GO_Central and the predictions get GOC in the provider field. lines 86 and 87 versus lines 110 and 111. |
yep here is how MGI pulls them
So I don't see that MGI looks for GOC_Central vs GOC when MGj loads them. |
Yes, but this is the other direction, this is for our load. We will also need to consider the provider to distinguish the annotations that are directly from PAINT versus those from prediction. Unless of course the GO pipeline injected the prediction annotations into the main file and we took both PAINT and the predictions from it and no longer loaded the file from the products directory (hint hint). |
So at MGI what we call the 'GO/CFP (Component, Function and Process)' load would be rolled into the 'GO/PAINT' load and we would get both the PAINT and prediction annotations from http://snapshot.geneontology.org/annotations/mgi.gaf.gz. |
No we have a separate load for GO?CFP |
Right! What I'm saying is that IMO the best solution would be to roll all this into one load from full file of mouse annotations (noctua too). So in other words, rather than do three separate loads from the GOC (PAINT, Predictions, Noctua), we have a one-stop shop. We being MGI. |
For gorule-0000061 implementation: |
@dougli1sqrd is gorule-0000061 implemented for the Thu Mar 4 00:01:38 PST 2021 snapshot build? |
The headers in the GAF2.2 files produced by the pipeline don't conform to our specs :-) Here is what's in our spec (and most groups have been very good about this formatting in the src files): generated-by: database listed in dbxrefs.yaml But here is what's in the annotation file produced: !Generated by GO Central |
Also note that the date format in the header is not the same as the date format in the annotation data (presence and absence of hyphens). We recently 'fixed' this in our file. |
I noticed this morning that most of our CC annotations from Noctua will be filtered or flagged because they use the part_of relation for all CC annotations. Recently this has changed to use located_in for cellular anatomical structures and part_of for protein complexes. We will need to update all the models that were made using the former standards in order for the annotations to be up to the new annotation practice. |
Any way this can be computationally automated? |
I see this not only with MGI models, but in SynGO annotations as well. |
'Any way this can be computationally automated?' |
We will have to update the relations in GO-CAM models computationally, along with some other relation changes. |
It doesn't look like rule 61 is being implemented correctly. For example, an input annotation to root cellular component that uses 'located in' is not being repaired to 'is active in'. Source WB GAF: GOC-produced GAF: GOC-produced GPAD: |
Oh! Okay thanks for pointing this out! I'll check it out. |
closing, as 2.2 has been made for several years at this point |
Once the Feb release goes out, switch the pipeline to consume and produce gaf 2.2 instead of 2.1.
This ticket will contain any updates and comments concerning tweaks, code updates, tests, etc that verifies the pipeline is working in gaf 2.2.
Checklist
Once we have:
Then:
master
(@dougli1sqrd ) (done by default in go-site and ontobio)snapshot
andrelease
(@kltm) (done by default in go-site and ontobio)Current outstanding blocking issues:
Test:
failed, fix in progress above
The text was updated successfully, but these errors were encountered: