Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensembl IEA pipeline #5411

Open
ValWood opened this issue Aug 12, 2024 · 14 comments
Open

Ensembl IEA pipeline #5411

ValWood opened this issue Aug 12, 2024 · 14 comments

Comments

@ValWood
Copy link
Contributor

ValWood commented Aug 12, 2024

This Ensembl pipeline is probably one of the largest sources of incorrect IEA annotation.

One major issue that could be fixed is that the relationships appear to be ignored.

So in this case

UniProtKB:P40630 | Tfam | acts_upstream_of_or_within | GO:0033108    mitochondrial respiratory chain complex assembly | ECO:0000315   IMP | PMID:11259653 | MGI:MGI:1860962 more... | 10090 Mus musculus | MGI | occurs_in (EMAPA:16105)

is instantiated in human as

UniProtKB:Q00059 | TFAM | involved_in | GO:0033108    mitochondrial respiratory chain complex assembly | ECO:0000265   IEA | GO_REF:0000107 | UniProtKB:P40630 more... | 9606 Homo sapiens | Ensembl

This is VERY bad.

I'm not sure who manages this pipeline these days, but can we ask that only involved in, enables, part_of located_in annotations are transferred?

Although, since we now have PAINT, and PAINT transfers only the core annotations avoiding indirect, do we still need this pipeline? It's clearly a large source of erroneous propagation. I see a lot, but I don't report them because there is no way to get these fixed on a case-by-case basis.

@pgaudet
@alexsign
@thomaspd
@cmungall

@alexsign
Copy link

@ValWood Hi Val, I'm running EnsEMBL IEA pipeline. I don't think I can stop the pipeline because EnsEMBL relies on the annotation data from it. It most likely has different coverage and runs on the different set of organisms compare to PAINT. However, if we all agree on specific rules and restrictions, I would be happy tho make changes into it.

@alexsign
Copy link

@ValWood btw. where do you see it? I checked the QuickGO and cannot find this human annotation.

@ValWood
Copy link
Contributor Author

ValWood commented Aug 12, 2024

@alexsign
Copy link

@ValWood sorry I missed it.

@ValWood
Copy link
Contributor Author

ValWood commented Aug 12, 2024

It's easy to miss, there are too many ;)

This might not be possible, but it would be nice if QuickGO could display EXP first (that would not have helped you here though, but it would be better if evidence codes were grouped!)

@alexsign
Copy link

@ValWood Should I propagate " acts_upstream_of_or_within" relation to Q00059 annotation or do not touch this kind of relations at all? We also not propagating annotation extension, which I'm not sure possible all the time.

@ValWood
Copy link
Contributor Author

ValWood commented Aug 12, 2024

I think it should not be propagated at all , but this decision should have input from others. I suggest GO-managers discuss this and suggest a solution.
val

@cmungall
Copy link
Member

FWIW, MGI has 23k IBAs where an acts_X is the sole MGI support

E.g.

getgaf mgi | egrep '\tMGI:96605' | grep 0007160
MGI	MGI:96605	Itga6	involved_in	GO:0007160	GO_REF:0000033	IBA	PANTHER:PTN001144284|UniProtKB:P20701|UniProtKB:P13612|UniProtKB:P06756|MGI:MGI:96605|UniProtKB:P53708|UniProtKB:P56199|UniProtKB:A0A0G2K470|UniProtKB:O75578|MGI:MGI:96601|UniProtKB:Q9UKX5|UniProtKB:P17301	P	integrin alpha 6	Cd49f|5033401O05Rik	protein_coding_gene	taxon:10090	20211108	GO_Central
MGI	MGI:96605	Itga6	acts_upstream_of_or_within	GO:0007160	PMID:16365040	IMP		P	integrin alpha 6	Cd49f|5033401O05Rik	protein_coding_gene	taxon:10090	20060706	MGI	occurs_in(EMAPA:17922)

This suggests that MGI under-uses involved_in, we know this as when many MODs moved to using relations the more conservative interpretation was applied to historic annotations

Putting my query here for my own notes:

SELECT
    u.gene,
    a.ontology_class_ref,
    a.*
FROM
    gaf_association a,
    UNNEST(a.with_or_from_list) AS u(gene),
    acts_upstream au
WHERE
    a.evidence_type = 'IBA'
    AND a.qualifiers = 'involved_in'
    AND u.gene LIKE 'MGI%'
    AND au.db_object_id = truncate_prefix(u.gene)
    AND au.ontology_class_ref = a.ontology_class_ref
    AND NOT EXISTS (
        SELECT 1
        FROM involved_in ii
        WHERE ii.db_object_id = truncate_prefix(u.gene)
        AND ii.ontology_class_ref = a.ontology_class_ref
        AND evidence_type != 'IBA'
    )
    ;
    

@ValWood
Copy link
Contributor Author

ValWood commented Aug 13, 2024

Probably for PAINT the transfer isn't such a problem as the transfer is curated by @marcfeuermann .
Perhaps this information could be used by MGI to elevate these annotation to 'involved_in'

The Ensembl pipeline is transferring everything, and so it defeats the purpose of distinguishing involved_in from causally_upstram_or _within (and makes it difficult to recommend prediction pipelines not to do the same). Although it also provides another way to game the system......

@cmungall
Copy link
Member

Perhaps this information could be used by MGI to elevate these annotation to 'involved_in'

Definitely! And also training for AI that can help us with the manual promotion

The Ensembl pipeline is transferring everything, and so it defeats the purpose of distinguishing involved_in from causally_upstram_or _within (and makes it difficult to recommend prediction pipelines not to do the same). Although it also provides another way to game the system.....

Yes, auto propagating these and promoting the predicate is the worst scenario. If we had the predicate used consistently then dropping during propagation makes sense since these kinds of downstream effects are less likely to be preserved(?). It may still be a good idea to drop them altogether, but in some cases e.g. MGI the dropping may be partly arbitrary. Maybe not such a big deal as we get such good propagation from paint..

@ValWood
Copy link
Contributor Author

ValWood commented Aug 19, 2024

Here is an example where lots off target annotations are transferred to human
https://www.ebi.ac.uk/QuickGO/annotations?geneProductId=Q8BH31

A transmembrane transporter (probably chloride) becomes annotated to glycolytic process, glycolipids metabolic process, autophagy, apoptotic process, mitochondrion organization, lysosome organization, gene expression, TORC1 signaling, cellular respiration, motor behavior, and many, many , many more

@pgaudet
Copy link
Contributor

pgaudet commented Aug 20, 2024

We should get some input from @ukemi and @LiNiMGI but AFAIK the PAINT and the GO-CAM MGI annotation use 'involved in', while others use 'acts upstream of or within'. @alexsign 's suggestion to only use 'involved in' for IEA annotations seems a very reasonable idea, it would certainly reduce false positives.

@vanaukenk
Copy link
Contributor

I agree about getting input from MGI curators.

What we really want is meaningful gp2term relations that users can employ to better understand and filter annotations, but when we started requiring a gp2term value, it wasn't feasible for groups to manually review all their existing BP annotations (where I think this issue most manifests, although also to some extent with CC).

Outlining, and then hopefully implementing, a computational strategy to help select appropriate gp2term relations for manual annotations seems the most promising way to move forward with this. Then restricting automatically propagated annotations to source annotations using the more granular gp2term relations seems reasonable.

@ukemi
Copy link
Contributor

ukemi commented Aug 28, 2024

MGI annotations that are from causal GO-CAM models use involved_in if the MF that the gene product enables is part of the BP. For standard annotations and legacy annotations, the more general GP to GO-BP was used since we couldn't confidently call the involved_in.
I think that trying to deepen those relations is a laudable goal, but any automated method would need to be sanity checked by a biologist. I would also be concerned that the deepening might not be supported by the evidence and reference given in the annotation. What does the evidence in a conventional GO annotation cover? If it is both the GP-to-term assignment and the association of a GP with a BP then annotation evidence would often be insufficient.
My 2c is that from making causal GO-CAM models, GP to BP relationships are the result of taking the entire model into consideration and require thought about which MFs are 'in' and which MFs are 'out'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants