Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possibilities to extend content for OBODB UniProt #172

Closed
realmarcin opened this issue Mar 21, 2024 · 5 comments · Fixed by #173
Closed

possibilities to extend content for OBODB UniProt #172

realmarcin opened this issue Mar 21, 2024 · 5 comments · Fixed by #173

Comments

@realmarcin
Copy link

Hi @cthoyt ,

Greetings from LBNL!

We would like to ingest microbial protein function from UniProt for KG-Microbe and a host-associated microbiome KG. This will also serve the UniProt protein ingest for KG-Hub KGs. We started with using the UniProt REST API to download a few fields for all proteins from our microbial taxon set (~100k).

Here is the repo for our UniProt2S3 download jenkins pipeline:
https://github.com/Knowledge-Graph-Hub/uniprot2s3

We would happy with just a minimal set of UniProt API fields - which covers the semantic namespaces for CHEBI, GO, EC, Rhea:
fields: ["organism_id", "id", "accession", "protein_name", "ec", "ft_binding", "go", "xref_proteomes", "rhea", "reviewed"]

This data will the be a perfect complement to your obo-db-ingest Rhea resource, which we already found very easy to use. If this same process could work for UniProt data a lot of people would be happy as its challenging to get it otherwise (eg we started exploring DAT files which would be bespoke parsing).

Would you consider adding these few extra fields (minimal set) to your [UniProtGetter] (https://github.com/biopragmatics/pyobo/blob/78b34bc85cccae4ec7a47ba777eed37130c4e48e/src/pyobo/sources/uniprot/uniprot.py#L25C7-L25C20) class?

@hrshdhgd @cmungall @bsantan

@cthoyt
Copy link
Member

cthoyt commented Mar 21, 2024

Hi @realmarcin, this is definitely possible. What are the relationships you want to use for each field?

@realmarcin
Copy link
Author

Here is our schema diagram -- all biolink conformant. Let me know if you have any thoughts or if looks good!
Screen Shot 2024-03-21 at 2 21 31 PM

And here is the slide in case text is helpful:
https://docs.google.com/presentation/d/1VIT06ROr-WusqJuvya8rj8kpUvLbH0gY-E66JVwYD5Y/edit#slide=id.g26c476712ee_1_0

@cthoyt
Copy link
Member

cthoyt commented Mar 22, 2024

Thanks @realmarcin for the share, but PyOBO is using RO relations wherever possible. Luckily, this has a high overlap with Biolink most of what's in this diagram can be translated to RO.

Also, it would be helpful if you could provide explanations of what each of the fields you want are. I don't know what ft_binding or xref_proteomes are, what kind of data is in them, or how I should use them.

@realmarcin
Copy link
Author

realmarcin commented Mar 23, 2024

Hi @cthoyt -- here is a gdoc with metadata and explanation of the different fields in the UniProt API request. Let me know if this answers your questions.
https://docs.google.com/document/d/1OEZvDgGu1xOvHRTUDWEvbz3bGFrp4s_u6qXx8y35ZGk/edit?usp=sharing

cthoyt added a commit that referenced this issue Mar 24, 2024
@cmungall
Copy link

cmungall commented Mar 25, 2024

@realmarcin - I don't think it is biolink conformant, but no worries :-)

Can I simplify the ask here?

The existing pyobo and obo-db-ingest for uniprot is very useful. But it's hardcoded for getting reviewed (swissprot) only. A number of groups have written duplicative ingest code for uniprot - using dat files, using sparql, etc. I think we should converge on pyobo. But I am told that the REST call doesn't scale for including say all GCRPs. If we can solve the general strategy then I think we can make it such that people can get the nodes and edges they need (many of which should not be put in the obo, see biopragmatics/obo-db-ingest#13)

cthoyt added a commit that referenced this issue Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants