-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possibilities to extend content for OBODB UniProt #172
Comments
Hi @realmarcin, this is definitely possible. What are the relationships you want to use for each field? |
Here is our schema diagram -- all biolink conformant. Let me know if you have any thoughts or if looks good! And here is the slide in case text is helpful: |
Thanks @realmarcin for the share, but PyOBO is using RO relations wherever possible. Luckily, this has a high overlap with Biolink most of what's in this diagram can be translated to RO. Also, it would be helpful if you could provide explanations of what each of the fields you want are. I don't know what ft_binding or xref_proteomes are, what kind of data is in them, or how I should use them. |
Hi @cthoyt -- here is a gdoc with metadata and explanation of the different fields in the UniProt API request. Let me know if this answers your questions. |
@realmarcin - I don't think it is biolink conformant, but no worries :-) Can I simplify the ask here? The existing pyobo and obo-db-ingest for uniprot is very useful. But it's hardcoded for getting reviewed (swissprot) only. A number of groups have written duplicative ingest code for uniprot - using dat files, using sparql, etc. I think we should converge on pyobo. But I am told that the REST call doesn't scale for including say all GCRPs. If we can solve the general strategy then I think we can make it such that people can get the nodes and edges they need (many of which should not be put in the obo, see biopragmatics/obo-db-ingest#13) |
Hi @cthoyt ,
Greetings from LBNL!
We would like to ingest microbial protein function from UniProt for KG-Microbe and a host-associated microbiome KG. This will also serve the UniProt protein ingest for KG-Hub KGs. We started with using the UniProt REST API to download a few fields for all proteins from our microbial taxon set (~100k).
Here is the repo for our UniProt2S3 download jenkins pipeline:
https://github.com/Knowledge-Graph-Hub/uniprot2s3
We would happy with just a minimal set of UniProt API fields - which covers the semantic namespaces for CHEBI, GO, EC, Rhea:
fields: ["organism_id", "id", "accession", "protein_name", "ec", "ft_binding", "go", "xref_proteomes", "rhea", "reviewed"]
This data will the be a perfect complement to your obo-db-ingest Rhea resource, which we already found very easy to use. If this same process could work for UniProt data a lot of people would be happy as its challenging to get it otherwise (eg we started exploring DAT files which would be bespoke parsing).
Would you consider adding these few extra fields (minimal set) to your [UniProtGetter] (https://github.com/biopragmatics/pyobo/blob/78b34bc85cccae4ec7a47ba777eed37130c4e48e/src/pyobo/sources/uniprot/uniprot.py#L25C7-L25C20) class?
@hrshdhgd @cmungall @bsantan
The text was updated successfully, but these errors were encountered: