Switch pipeline to read and output gaf 2.2 #211

dougli1sqrd · 2021-02-01T22:59:14Z

kltm · 2021-02-01T23:00:07Z

Tagging @dustine32 @vanaukenk

kltm · 2021-02-01T23:00:39Z

@dustine32 Once the feb release goes out, we should also switch our PAINT upstreams to point to your 2.2 files.

kltm · 2021-02-01T23:01:15Z

@dougli1sqrd Alright if I edit a rolling checklist into your top comment?

dougli1sqrd · 2021-02-01T23:02:06Z

Sure thing

dougli1sqrd · 2021-02-01T23:26:11Z

So a quick look at the ontobio validate script (the main part of the pipeline parsing and "megastep", or as Seth calls it, the "kernel") I think all we would need to do is tell the GafWriter to be version 2.2 and then we'd be outputting all gaf 2.2 from the pipeline.

And at this point the pipeline GAF parsing logic is agnostic to GAF version. As the file is read it looks for the gaf-version string, and figures out what type of line to expect. Not having a version currently just sets to a default version (presently 2.1), and attempts to proceed. This behavior can be changed, as well as which version is the default.

But I believe at first glance that if we set (or we could parameterize with a command line arg) the writer version to gaf 2.2, we'll be done, for some definition of done.

dustine32 · 2021-02-01T23:27:33Z

@kltm I am so excited to start pointing to those PAINT GAF 2.2 files!

kltm · 2021-02-02T00:00:29Z

Excitement mounts.
I've created a basic checklist at the top. Please add items there as you think of them.

kltm · 2021-02-03T21:30:40Z

@dougli1sqrd @dustine32 (@vanaukenk) While I have not "finalized" the release (in process), it is now done and frozen, so there is almost no possibility that we'll need to use the current code base for a redo. To give our QC/QA and downstreams as much time as possible to see what we're doing this month, please go ahead and update things to a GAF 2.2 stance.

dougli1sqrd · 2021-02-03T22:12:27Z

On the way!

dustine32 · 2021-02-03T23:09:12Z

@kltm I just switched the symlink on the PAINT server to point to the GAF 2.2 files. No changes to the datasets/paint.yaml file should be needed.

dougli1sqrd · 2021-02-04T22:35:32Z

GAF 2.2 output by default changes are present in newly release ontobio 2.3.0

kltm · 2021-02-05T19:38:49Z

I'm now updating the checklist at the top with current blocking tickets.

kltm · 2021-02-08T19:20:57Z

@dougli1sqrd Clarifying question from @ukemi :

Will
http://snapshot.geneontology.org/products/annotations/mgi-prediction.gaf.gz
http://snapshot.geneontology.org/products/annotations/paint_mgi.gaf
also be GAF 2.2?

I would assume "yes" for the paint one, as that is coming in from internal processes, not sure about the prediction file. Although, technically speaking, neither of these are "products"...

dougli1sqrd · 2021-02-08T20:03:37Z

Ooh, the predictions are still generated by owltools. Owltools doesn't speak gaf 2.2 does it? Since it's a change in requirements on the qualifier, maybe owltools won't notice?

ukemi · 2021-02-08T20:10:32Z

Hi @dougli1sqrd.
These are two of the three files we pick up from the GOC in our weekly loads. If they are moving to gaf2.2, we need to change how we parse them by the end of this week.
ping @loricorbani @hdrabkin

ukemi · 2021-02-08T20:34:57Z

Just double checked this. We actually get the PAINT annotations from
http://snapshot.geneontology.org/annotations/mgi.gaf.gz

That file will be changing, correct?

dougli1sqrd · 2021-02-09T20:29:38Z

Yeah that file will be changing.

Here's the GAF 2.2 spec: http://geneontology.org/docs/go-annotation-file-gaf-format-2.2/

It's pretty simple, really. It's just the qualifier field that is changing.

ukemi · 2021-02-09T20:42:18Z

Thanks @dougli1sqrd. It is a minor change, but it's important for us to allow for it and be able to parse the new qualifiers correctly. So we have to know if it will actually change. What about the prediction file?
http://snapshot.geneontology.org/products/annotations/mgi-prediction.gaf.gz
Do you know when the 2.2 files will show up in snapshot?

dougli1sqrd · 2021-02-09T20:52:25Z

They'll show up today, hopefully.

The standard predictions look okay I think? Here's a sample:

GO:0018215	protein phosphopantetheinylation	AnnotationPropagation	P	GO:0070737	protein-glycine ligase activity, elongating	IDA
GO:0018215	protein phosphopantetheinylation	AnnotationPropagation	P	GO:0032452	histone demethylase activity	IBA
GO:0016577	histone demethylation	AnnotationPropagation	P	GO:0032452	histone demethylase activity	IBA
GO:0007186	G protein-coupled receptor signaling pathway	AnnotationPropagation	P	GO:0004930	G protein-coupled receptor activity	IBA
GO:0050911	detection of chemical stimulus involved in sensory perception of smell	AnnotationPropagation	P	GO:0004984	olfactory receptor activity	IBA
GO:0006357	regulation of transcription by RNA polymerase II	AnnotationPropagation	P	GO:0000981	DNA-binding transcription factor activity, RNA polymerase II-specific	IBA
GO:0050911	detection of chemical stimulus involved in sensory perception of smell	AnnotationPropagation	P	GO:0004984	olfactory receptor activity	IBA
GO:0018215	protein phosphopantetheinylation	AnnotationPropagation	P	GO:0004222	metalloendopeptidase activity	IBA
GO:0006508	proteolysis	AnnotationPropagation	P	GO:0004222	metalloendopeptidase activity	IBA

ukemi · 2021-02-09T21:00:58Z

This doesn't look like a gaf. It looks like the prediction mapping file. Here is the top of the current file. It's not zipped. My bad.

dougli1sqrd · 2021-02-09T21:09:12Z

Ah I was looking in the wrong place. Here's what the currently running snapshot made (the top):

!gaf-version: 2.0
! 
! Date: 2021/02/09
! 
!  Used ontologies and versions (optional)
! 	go/extensions/go-gaf	go/releases/2021-02-02/extensions/go-gaf.owl
! 
!  Generated predictions
! 
MGI	MGI:101762	Elk3		GO:0006357	PMID:21873635	IBA	PANTHER:PTN000218930|UniProtKB:P41161|UniProtKB:P41970|UniProtKB:P19419|MGI:MGI:1350926|UniProtKB:Q15723|MGI:MGI:107180|UniProtKB:P15036|MGI:MGI:95554|MGI:MGI:99253|UniProtKB:Q06546|UniProtKB:Q9NZC4|UniProtKB:P28324|UniProtKB:P11308|UniProtKB:P43268|MGI:MGI:1341168|UniProtKB:P41212|MGI:MGI:1101781|FB:FBgn0000567|FB:FBgn0003118|UniProtKB:P50548|MGI:MGI:98282|UniProtKB:Q9Y603|MGI:MGI:109336|MGI:MGI:1335079|FB:FBgn0000097|UniProtKB:P50549|RGD:628860|UniProtKB:Q01892|UniProtKB:P32519|UniProtKB:P78545|UniProtKB:P14921|UniProtKB:Q99607	P	ELK3, member of ETS oncogene family	D430049E23Rik|Erp|Net|Sap-2	protein	taxon:10090	20170228	GOC		
MGI	MGI:101765	Cdk5		GO:0006468	PMID:21873635	IBA	PANTHER:PTN000623091|dictyBase:DDB_G0272813|FB:FBgn0013762|RGD:70486|UniProtKB:P06493|PomBase:SPAC2F3.15|PomBase:SPAC23H4.17c|TAIR:locus:2011761|UniProtKB:O94921|SGD:S000006365|RGD:2319|UniProtKB:Q00534|RGD:621124|MGI:MGI:88351|WB:WBGene00000405|ZFIN:ZDB-GENE-081022-110|MGI:MGI:88357|ZFIN:ZDB-GENE-010131-2|PomBase:SPCC16C4.11|PomBase:SPBC11B10.09|UniProtKB:P11802|dictyBase:DDB_G0288677|SGD:S000001622|PomBase:SPBC19F8.07|MGI:MGI:104772|UniProtKB:P24941|UniProtKB:P50750|SGD:S000005963|UniProtKB:P61075|UniProtKB:Q8IJQ1|FB:FBgn0019949|CGD:CAL0000191263|PomBase:SPBC32H8.10|SGD:S000005952|FB:FBgn0005640|RGD:621120|UniProtKB:A0A1D8PDA6|UniProtKB:Q00646|FB:FBgn0263237|SGD:S000000364|UniProtKB:C9K505|RGD:70514|FB:FBgn0004106	P	cyclin-dependent kinase 5	Crk6	protein	taxon:10090	20201206	GOC		
MGI	MGI:101765	Cdk5		GO:0018215	PMID:21873635	IBA	PANTHER:PTN000623091|dictyBase:DDB_G0272813|FB:FBgn0013762|RGD:70486|UniProtKB:P06493|PomBase:SPAC2F3.15|PomBase:SPAC23H4.17c|TAIR:locus:2011761|UniProtKB:O94921|SGD:S000006365|RGD:2319|UniProtKB:Q00534|RGD:621124|MGI:MGI:88351|WB:WBGene00000405|ZFIN:ZDB-GENE-081022-110|MGI:MGI:88357|ZFIN:ZDB-GENE-010131-2|PomBase:SPCC16C4.11|PomBase:SPBC11B10.09|UniProtKB:P11802|dictyBase:DDB_G0288677|SGD:S000001622|PomBase:SPBC19F8.07|MGI:MGI:104772|UniProtKB:P24941|UniProtKB:P50750|SGD:S000005963|UniProtKB:P61075|UniProtKB:Q8IJQ1|FB:FBgn0019949|CGD:CAL0000191263|PomBase:SPBC32H8.10|SGD:S000005952|FB:FBgn0005640|RGD:621120|UniProtKB:A0A1D8PDA6|UniProtKB:Q00646|FB:FBgn0263237|SGD:S000000364|UniProtKB:C9K505|RGD:70514|FB:FBgn0004106	P	cyclin-dependent kinase 5	Crk6	protein	taxon:10090	20201206	GOC		
MGI	MGI:101765	Cdk5		GO:0051726	PMID:21873635	IBA

ukemi · 2021-02-10T13:33:05Z

Thanks Eric! So this one is still in gaf2.0. We will not change our load.

vanaukenk · 2021-02-16T15:36:49Z

Checking the WB files (input GAF2.2, output GAF2.2, output GPAD1.2), annotations look okay except for the ones with annotation extensions which are missing in the output files.

I assume that issue is being fixed with this PR: geneontology/go-site#1618

so will continue to check other annotations until that fix percolates through.

kltm · 2021-02-17T19:14:54Z

@vanaukenk It looks like a snapshot has passed through the pipeline.

vanaukenk · 2021-02-18T14:38:24Z

Thanks @kltm
I'll do some more QC checks later today.

vanaukenk · 2021-02-18T21:12:45Z

@kltm @dougli1sqrd @dustine32

I've come across two other issues, one of which may be outside the scope of GAF2.2, but I'll put them both here, just in case.

If groups submit annotations to root node that don't use the default root relations as defined in the spec, i.e. 'involved_in' for BP; 'enables' for MF; and 'is_active_in' for CC, it doesn't look like we're repairing those annotations. Can we do that?
It looks like some information originating from the PAINT GAF2.2 source file is not carried forward to the production GAF2.2 or is transformed in a way that I'm not sure makes sense. See columns L, M, and N in lines 21 and 22; 42 and 43; and 46 and 47, of my test spreadsheet. This might be outside of the GAF2.2 testing, but I wasn't sure why the information in column L doesn't go into the production file, why a PTN is used as a synonym in column M, and why protein gets transformed to gene_product in column N. I can put this into a separate ticket, if need be.

Update: after talking with @ukemi , I'd like to confirm exactly what the PAINT source file is that is used to go into production. Maybe the changes I noted above are because I'm looking at the wrong source file.

Thx; I'll continue testing.....

vanaukenk · 2021-02-19T00:34:36Z

Almost finished testing the WB files.
Right now, I think the only other thing we'll need to discuss is whether we also want to repair any relations for IEA annotations. I'd like to discuss this with @pgaudet as GOA is a major source of IEAs for many groups and we need to make sure they're okay with whatever we decide to do.

ukemi · 2021-03-03T17:40:13Z

But I cannot find this in the mgi gaf in the annotations file.

ukemi · 2021-03-03T17:54:11Z

Note that this annotation originates in the mgi_predictions file:

hdrabkin · 2021-03-03T18:37:17Z

is this an inference annotation?
This might be treated as a duplicate when we try to get them into MGI (it's a PAINT annotation) I think we have a hard time loading such a long 'inferred from' field.

hdrabkin · 2021-03-03T18:38:20Z

That is the field gets truncated on loading and it might result in it looking like another annotation with fewer items in the field?

ukemi · 2021-03-03T18:41:17Z

It IS in the MGI source file, it IS NOT in the GOC output file. It IS in the prediction (inference) file. I suspect that part of the processing on the GOC side is to prevent tail-eating by stripping all PAINT annotations from the MGI file and then injecting them back as part of the GOC pipeline. The predictions that are based on PAINT are being stripped, but are not reinjected. Is the pipeline stripping based on PMID?

hdrabkin · 2021-03-03T18:42:45Z

Give me a few minutes to check the wiki

ukemi · 2021-03-03T18:43:10Z

The problem is not on the MGI side.

hdrabkin · 2021-03-03T18:45:47Z

BY 'MGI source file" do you mean the one MGI supplies (which if there ia a PAINT annotation, I thought it's stripped: does it use the PMID (gaudet paper) or GO_Central I wonder?

ukemi · 2021-03-03T18:46:07Z

I also notice that PAINT annotations in the MGI file have the MGI reference for the PAINT paper, MGI:MGI:6201960, but this is not injected as part of the GO pipeline, so it is missing in the file provided by the GOC. I guess this is ok, but should be noted here as technically a discrepancy.

ukemi · 2021-03-03T18:50:40Z

@hdrabkin, you are correct. If the pipeline used both the PMID and the provider to distinguish PAINT annotations then it could distinguish those directly from PAINT versus those from predictions based on the PAINT annotations. PAINT gets GO_Central and the predictions get GOC in the provider field.
See my spreadsheet here:
https://docs.google.com/spreadsheets/d/1kf9mvxMmY-zapsHQK9qfRt7O9OV0dc6Daes5TpyrEf0/edit#gid=712573202

lines 86 and 87 versus lines 110 and 111.

hdrabkin · 2021-03-03T18:55:45Z

yep here is how MGI pulls them

Source the configuration file to establish the environment. (this is the mgi.gaf.gz in snapshot)
Create annotation load (sw:annotload) input file from the GO/PAINT mirror_ftp file.
rows where field 6 (DB:References(s)) == PMID:21873635 are processed < Gaudet paper)
rows where MGI:xxx is of type "gene"
rows where field 8 (With (or) From) contain Panther IDs are processed
Call the annotation loader.
Call the inferred-from cache update

So I don't see that MGI looks for GOC_Central vs GOC when MGj loads them.

ukemi · 2021-03-03T18:59:25Z

Yes, but this is the other direction, this is for our load. We will also need to consider the provider to distinguish the annotations that are directly from PAINT versus those from prediction. Unless of course the GO pipeline injected the prediction annotations into the main file and we took both PAINT and the predictions from it and no longer loaded the file from the products directory (hint hint).

ukemi · 2021-03-03T19:02:42Z

So at MGI what we call the 'GO/CFP (Component, Function and Process)' load would be rolled into the 'GO/PAINT' load and we would get both the PAINT and prediction annotations from http://snapshot.geneontology.org/annotations/mgi.gaf.gz.

hdrabkin · 2021-03-03T19:10:11Z

No we have a separate load for GO?CFP
user name = "GOC"
uses reference/J: from the GOC input file
http://snapshot.geneontology.org/products/annotations/mgi-prediction.gaf <<< we pull from here.

ukemi · 2021-03-03T19:27:24Z

Right! What I'm saying is that IMO the best solution would be to roll all this into one load from full file of mouse annotations (noctua too). So in other words, rather than do three separate loads from the GOC (PAINT, Predictions, Noctua), we have a one-stop shop. We being MGI.

dougli1sqrd · 2021-03-04T04:57:10Z

For gorule-0000061 implementation:
biolink/ontobio#533

vanaukenk · 2021-03-05T18:04:30Z

@dougli1sqrd is gorule-0000061 implemented for the Thu Mar 4 00:01:38 PST 2021 snapshot build?
QCing the WB files would suggest it's not, so I just wanted to make sure before I do any more testing.
Thx.

vanaukenk · 2021-03-05T19:02:47Z

@dougli1sqrd @kltm

The headers in the GAF2.2 files produced by the pipeline don't conform to our specs :-)

Here is what's in our spec (and most groups have been very good about this formatting in the src files):

generated-by: database listed in dbxrefs.yaml
date-generated: YYYY-MM-DD or YYYY-MM-DDTHH:MM

But here is what's in the annotation file produced:

!Generated by GO Central
!
!Date Generated by GOC: 2021-03-05

ukemi · 2021-03-05T19:04:54Z

Also note that the date format in the header is not the same as the date format in the annotation data (presence and absence of hyphens). We recently 'fixed' this in our file.

ukemi · 2021-03-08T15:28:05Z

I noticed this morning that most of our CC annotations from Noctua will be filtered or flagged because they use the part_of relation for all CC annotations. Recently this has changed to use located_in for cellular anatomical structures and part_of for protein complexes. We will need to update all the models that were made using the former standards in order for the annotations to be up to the new annotation practice.

hdrabkin · 2021-03-08T15:29:44Z

Any way this can be computationally automated?

ukemi · 2021-03-08T15:30:09Z

I see this not only with MGI models, but in SynGO annotations as well.

ukemi · 2021-03-08T15:31:11Z

'Any way this can be computationally automated?'
I hope so. It would be a lot of stuff to do by hand.

vanaukenk · 2021-03-08T15:36:50Z

We will have to update the relations in GO-CAM models computationally, along with some other relation changes.

vanaukenk · 2021-03-08T16:09:31Z

It doesn't look like rule 61 is being implemented correctly.

For example, an input annotation to root cellular component that uses 'located in' is not being repaired to 'is active in'.

Source WB GAF:
WB WBGene00000245 bca-1 located_in GO:0005575 GO_REF:0000015 ND C T13C5.5 gene taxon:6239 20090611 UniProt

GOC-produced GAF:
WB WBGene00000245 bca-1 located_in GO:0005575 GO_REF:0000015 ND C T13C5.5 gene taxon:6239 20090611 UniProt

GOC-produced GPAD:
WB WBGene00000245 located_in GO:0005575 GO_REF:0000015 ECO:0000307 20090611 UniProt

dougli1sqrd · 2021-03-08T21:35:58Z

Oh! Okay thanks for pointing this out! I'll check it out.

suzialeksander · 2024-07-12T23:53:59Z

closing, as 2.2 has been made for several years at this point

dougli1sqrd self-assigned this Feb 1, 2021

dougli1sqrd mentioned this issue Feb 3, 2021

threading gaf output version number through validate and small bugfix in data output biolink/ontobio#515

Merged

dougli1sqrd mentioned this issue Feb 4, 2021

update ontobio to 2.3.0 for GAF 2.2 output by default in pipeline geneontology/go-site#1613

Merged

kltm added the bug (A: showstopper) label Feb 25, 2021

suzialeksander closed this as completed Jul 12, 2024

Switch pipeline to read and output gaf 2.2 #211

Switch pipeline to read and output gaf 2.2 #211

Comments

dougli1sqrd commented Feb 1, 2021 • edited Loading

Checklist

kltm commented Feb 1, 2021

kltm commented Feb 1, 2021

kltm commented Feb 1, 2021

dougli1sqrd commented Feb 1, 2021

dougli1sqrd commented Feb 1, 2021

dustine32 commented Feb 1, 2021

kltm commented Feb 2, 2021

kltm commented Feb 3, 2021

dougli1sqrd commented Feb 3, 2021

dustine32 commented Feb 3, 2021

dougli1sqrd commented Feb 4, 2021

kltm commented Feb 5, 2021

kltm commented Feb 8, 2021 • edited Loading

dougli1sqrd commented Feb 8, 2021

ukemi commented Feb 8, 2021

ukemi commented Feb 8, 2021

dougli1sqrd commented Feb 9, 2021

ukemi commented Feb 9, 2021

dougli1sqrd commented Feb 9, 2021

ukemi commented Feb 9, 2021

dougli1sqrd commented Feb 9, 2021

ukemi commented Feb 10, 2021

vanaukenk commented Feb 16, 2021

kltm commented Feb 17, 2021

vanaukenk commented Feb 18, 2021

vanaukenk commented Feb 18, 2021 • edited Loading

vanaukenk commented Feb 19, 2021

ukemi commented Mar 3, 2021

ukemi commented Mar 3, 2021

hdrabkin commented Mar 3, 2021

hdrabkin commented Mar 3, 2021

ukemi commented Mar 3, 2021

hdrabkin commented Mar 3, 2021

ukemi commented Mar 3, 2021

hdrabkin commented Mar 3, 2021

ukemi commented Mar 3, 2021

ukemi commented Mar 3, 2021

hdrabkin commented Mar 3, 2021

ukemi commented Mar 3, 2021

ukemi commented Mar 3, 2021

hdrabkin commented Mar 3, 2021

ukemi commented Mar 3, 2021 • edited Loading

dougli1sqrd commented Mar 4, 2021

vanaukenk commented Mar 5, 2021

vanaukenk commented Mar 5, 2021

ukemi commented Mar 5, 2021

ukemi commented Mar 8, 2021

hdrabkin commented Mar 8, 2021

ukemi commented Mar 8, 2021

ukemi commented Mar 8, 2021

vanaukenk commented Mar 8, 2021

vanaukenk commented Mar 8, 2021

dougli1sqrd commented Mar 8, 2021

suzialeksander commented Jul 12, 2024

dougli1sqrd commented Feb 1, 2021 •

edited

Loading

kltm commented Feb 8, 2021 •

edited

Loading

vanaukenk commented Feb 18, 2021 •

edited

Loading

ukemi commented Mar 3, 2021 •

edited

Loading