Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate RO-Biolink predicate mappings based on a particular Biolink model #104

Open
wants to merge 39 commits into
base: master
Choose a base branch
from

Conversation

gaurav
Copy link
Member

@gaurav gaurav commented Jul 30, 2023

Adds scripts/generate_ro_biolink_mapping.sc, a Scala CLI script for generating a list of mappings between RDF predicates and Biolink predicates downloaded from two sources:

  1. The Biolink model (https://github.com/biolink/biolink-model/blob/68d4e3d7612275d0d7e832a9919bf8666e1d5fde/biolink-model.yaml)
  2. The Biolink model's predicate mappings file (https://github.com/biolink/biolink-model/blob/68d4e3d7612275d0d7e832a9919bf8666e1d5fde/predicate_mapping.yaml)
  3. A few manual annotations from cam-kp-api PR 640.

These are written into the ro-to-biolink-predicate-mappings.tsv file (which I've included in this PR). If you want to see all the predicate mappings (not just the RO/GOREL ones), they are in the ro-to-biolink-predicate-mappings-all.tsv (https://github.com/ExposuresProvider/cam-pipeline/blob/e1d6dd063c43de31ac736dbd0ce1ee57008f64fc/ro-to-biolink-predicate-mappings-all.tsv).

This file is then used by scripts/kg_edges.dl to add "qualifiers" to kg.tsv. This does seem to work currently, producing output like:

GO:0004842      biolink:regulates       GO:0004842      http://model.geneontology.org/R-HSA-9645460     infores:go-cam
GO:0004842      biolink:regulates       GO:0004842      http://model.geneontology.org/R-HSA-9645460     infores:go-cam  {"biolink:object_direction_qualifier":"upregulated"}
GO:0004842      biolink:regulates       GO:0004842      http://model.geneontology.org/R-HSA-937042      infores:go-cam
GO:0004842      biolink:regulates       GO:0004842      http://model.geneontology.org/R-HSA-937042      infores:go-cam  {"biolink:object_direction_qualifier":"upregulated"}
GO:0004842      biolink:regulates       GO:0004842      http://model.geneontology.org/R-HSA-983168      infores:go-cam
GO:0004842      biolink:regulates       GO:0004842      http://model.geneontology.org/R-HSA-983168      infores:go-cam  {"biolink:object_direction_qualifier":"upregulated"}
GO:0004842      biolink:regulates       GO:0004674      http://model.geneontology.org/62b4ffe300004589  infores:go-cam
GO:0004842      biolink:regulates       GO:0004674      http://model.geneontology.org/62b4ffe300004589  infores:go-cam  {"biolink:object_direction_qualifier":"upregulated"}
[...]
GO:0022857	biolink:affects	CHEBI:641	http://model.geneontology.org/5d29221b00001552	infores:go-cam	{"biolink:qualified_predicate":"biolink:causes"}||{"biolink:object_aspect_qualifier":"transport"}||{"biolink:object_direction_qualifier":"increased"}
GO:0051640	biolink:affects	GO:0140494	http://model.geneontology.org/5ee8120100001898	infores:go-cam	{"biolink:qualified_predicate":"biolink:causes"}||{"biolink:object_aspect_qualifier":"transport"}||{"biolink:object_direction_qualifier":"increased"}
GO:0031503	biolink:affects	ComplexPortal:CPX-532	http://model.geneontology.org/5df932e000000551	infores:go-cam	{"biolink:qualified_predicate":"biolink:causes"}||{"biolink:object_aspect_qualifier":"transport"}||{"biolink:object_direction_qualifier":"increased"}
GO:0034504	biolink:affects	MGI:MGI:3036269	http://model.geneontology.org/5df932e000003298	infores:go-cam	{"biolink:qualified_predicate":"biolink:causes"}||{"biolink:object_aspect_qualifier":"transport"}||{"biolink:object_direction_qualifier":"increased"}
GO:0016197	biolink:affects	GO:0005770	http://model.geneontology.org/5ee8120100000250	infores:go-cam	{"biolink:qualified_predicate":"biolink:causes"}||{"biolink:object_aspect_qualifier":"transport"}||{"biolink:object_direction_qualifier":"increased"}

Things to do:

  • I came up with a hacky JSON export because I couldn't get .asJson from Circe to work. Help?
  • We have some redundant predicates between ro-to-biolink-local-mappings.tsv and ro-to-biolink-predicate-mappings.tsv -- any examples in the original list should be deleted so that only the qualified predicate is used.
  • We should check ro-to-biolink-local-mappings.tsv for any predicates that have been deleted -- we can temporarily add those directly to scripts/generate_ro_biolink_mappings.sc, but eventually we should get those into the Biolink model.
  • We need to talk to Evan about how to transform kg.tsv into a format that he can import into ORION -- that final column is currently a custom format (JSONL, except with "||" instead of newlines to separate entries). We could transform that into a full JSON list fairly easily if we need to.

This PR also adds the command for generating ro-to-biolink-predicate-mappings.tsv, although at the moment this will never be run, as the GitHub repo includes the predicate mappings file.

WIP: will close #95 once implemented.

gaurav added 30 commits July 30, 2023 00:31
Also deleted ro-to-biolink-mappings.tsv, which contains all the
mappings.
Also added ro-to-biolink-predicate-mappings-all.tsv.
@gaurav
Copy link
Member Author

gaurav commented Aug 28, 2023

@balhoff I've now added checks that (1) look for duplication between the local mappings file and generated predicate files, and (2) look for Biolink predicates that are not present in the Biolink model. So far, I'm just printing out concerning PredicateMappings (which is based on the predicate mappings file generated as part of the Biolink model), so unfortunately this isn't very readable. Here's what the output looks like right now with 15 warnings:

01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ -- Found 15 mapping warnings:
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps CTD:increases_secretion_of to multiple Biolink terms: List(PredicateMappingRow(Some(increases secretion of),Some(secretion),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_secretion_of)),None,None,None), PredicateMappingRow(Some(increases secretion of),Some(secretion),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_secretion_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps CTD:increases_splicing_of to multiple Biolink terms: List(PredicateMappingRow(Some(increases splicing of),Some(splicing),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_splicing_of, CTD:increases_RNA_splicing)),None,None,None), PredicateMappingRow(Some(increases splicing of),Some(splicing),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_splicing_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps CTD:affects_secretion_of to multiple Biolink terms: List(PredicateMappingRow(Some(affects secretion of),Some(secretion),None,biolink:affects,None,Some(Set(CTD:affects_secretion_of)),None,Some(Set(CTD:affects_export)),None), PredicateMappingRow(Some(affects secretion of),Some(secretion),None,biolink:affects,None,Some(Set(CTD:affects_secretion_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps CTD:decreases_secretion_of to multiple Biolink terms: List(PredicateMappingRow(Some(decreases secretion of),Some(secretion),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_secretion_of)),None,None,None), PredicateMappingRow(Some(decreases secretion of),Some(secretion),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_secretion_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps CTD:decreases_splicing_of to multiple Biolink terms: List(PredicateMappingRow(Some(decreases splicing of),Some(splicing),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_splicing_of, CTD:decreases_RNA_splicing)),None,None,None), PredicateMappingRow(Some(decreases splicing of),Some(splicing),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_splicing_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps CTD:affects_splicing_of to multiple Biolink terms: List(PredicateMappingRow(Some(affects splicing of),Some(splicing),None,biolink:affects,None,Some(Set(CTD:affects_splicing_of)),None,None,None), PredicateMappingRow(Some(affects splicing of),Some(splicing),None,biolink:affects,None,Some(Set(CTD:affects_splicing_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps RO:0002212 to multiple Biolink terms: List(PredicateMappingRow(Some(entity negatively regulates entity),None,Some(downregulated),biolink:regulates,None,Some(Set(RO:0002212, RO:0002449)),None,None,None), PredicateMappingRow(Some(process negatively regulates process),None,Some(downregulated),biolink:regulates,None,Some(Set(RO:0002212)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps RO:0002313 to multiple Biolink terms: List(PredicateMappingRow(None,None,None,biolink:affects,None,None,None,None,Some(Set(RO:0002313))), PredicateMappingRow(Some(increases transport of),Some(transport),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_transport_of)),None,None,Some(HashSet(RO:0002313, GAMMA:transporter, RO:0002340, GAMMA:carrier, RO:0002345))))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps CTD:increases_secretion_of to multiple Biolink terms: List(PredicateMappingRow(Some(increases secretion of),Some(secretion),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_secretion_of)),None,None,None), PredicateMappingRow(Some(increases secretion of),Some(secretion),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_secretion_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps CTD:increases_splicing_of to multiple Biolink terms: List(PredicateMappingRow(Some(increases splicing of),Some(splicing),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_splicing_of, CTD:increases_RNA_splicing)),None,None,None), PredicateMappingRow(Some(increases splicing of),Some(splicing),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_splicing_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps CTD:affects_secretion_of to multiple Biolink terms: List(PredicateMappingRow(Some(affects secretion of),Some(secretion),None,biolink:affects,None,Some(Set(CTD:affects_secretion_of)),None,Some(Set(CTD:affects_export)),None), PredicateMappingRow(Some(affects secretion of),Some(secretion),None,biolink:affects,None,Some(Set(CTD:affects_secretion_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps CTD:decreases_secretion_of to multiple Biolink terms: List(PredicateMappingRow(Some(decreases secretion of),Some(secretion),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_secretion_of)),None,None,None), PredicateMappingRow(Some(decreases secretion of),Some(secretion),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_secretion_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps CTD:decreases_splicing_of to multiple Biolink terms: List(PredicateMappingRow(Some(decreases splicing of),Some(splicing),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_splicing_of, CTD:decreases_RNA_splicing)),None,None,None), PredicateMappingRow(Some(decreases splicing of),Some(splicing),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_splicing_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps CTD:affects_splicing_of to multiple Biolink terms: List(PredicateMappingRow(Some(affects splicing of),Some(splicing),None,biolink:affects,None,Some(Set(CTD:affects_splicing_of)),None,None,None), PredicateMappingRow(Some(affects splicing of),Some(splicing),None,biolink:affects,None,Some(Set(CTD:affects_splicing_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps RO:0002212 to multiple Biolink terms: List(PredicateMappingRow(Some(entity negatively regulates entity),None,Some(downregulated),biolink:regulates,None,Some(Set(RO:0002212, RO:0002449)),None,None,None), PredicateMappingRow(Some(process negatively regulates process),None,Some(downregulated),biolink:regulates,None,Some(Set(RO:0002212)),None,None,None))

We can ignore the CTD mappings since we currently don't export those as all.

However, it looks like the following terms are duplicated:

  • RO:0002212 ("negatively regulates"): mapped to two pre-existing terms in the Biolink v3.5.4 predicate_mappings.yaml file file, both of which map to "biolink:regulates (direction: downregulated). However, we should figure out how we should handle this in code so we don't add this to the output file twice (which is what we currently do it).
    • Suggested action: the duplication is unnecessary in this case, since it's only there to handle some previous Biolink predicates that have been deprecated. We should eliminate all perfect duplicates (i.e. where they map to the same Biolink predicate).
  • RO:0002313 ("transports or maintains localization of"): mapped to biolink:affects (aspect: transport, direction: increased, qualified predicate: causes) by the predicate mappings file, but to biolink:affects in the local mappings file.
    • Suggested action: we should remove RO:0002313 from the local mappings file.

@gaurav
Copy link
Member Author

gaurav commented Aug 28, 2023

I've deleted RO:0002313 from local mappings in 797ff28.

@gaurav gaurav marked this pull request as ready for review October 6, 2023 15:28
@gaurav gaurav requested a review from balhoff October 6, 2023 15:28
@gaurav
Copy link
Member Author

gaurav commented Nov 6, 2023

Hi @balhoff -- just wanted to poke you to review this PR. If you need help in incorporating it into the changes you've made to re-adding CTD, please let me know.

@gaurav
Copy link
Member Author

gaurav commented Feb 25, 2024

Hi @balhoff -- just wanted to poke you to review this PR. If you need help in incorporating it into the changes you've made to re-adding CTD, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

map RO terms to qualified Biolink predicates
1 participant