-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensembl transcript ENST00000617537.5 sequence is genomic not cdna #75
Comments
Well, that's very surprising. I'll need to investigate. FWIW, UCSC appears to have fallen into the same issue: |
Okay, found it. Not a bug. Ensembl itself returns this sequence:
(+1 char for the newline) Also, the code path that you're using is bioutils.seqfetcher, via hgvs dataproviders. I double checked seqrepo and it's fine:
|
@reece the Ensembl REST API endpoint https://rest.ensembl.org/documentation/info/sequence_id has an argument
If you look, I'm using So should SeqRepo should use the cdna rather than genomic? That looks like how it's done with RefSeq? |
This issue is totally unrelated to SeqRepo. SeqFetcher is part of bioutils. I've just moved this issue from biocommons.seqrepo to bioutils. The underlying cause is that an ENST id refers to a family of sequences, not a single sequence. The cDNA sequence is what I consider to be the main sequence, and that's what's in SeqRepo. It's also what's shown on Ensembl web pages for a given ENST. And it's what corresponds to RefSeq transcripts. It is very unfortunate that a single id refers to three sequences. Honestly, this all makes me think that we should remove support for fetching ENSTs with SeqFetcher (here) since it can't be done unambiguously. It's better to not have this support than to try to chase Ensembl's intentions with identifiers. Alternatively, we could assume cdna and add type=cdna to the http get.
|
OK with you moving issue to wherever you think is best. I've re-opened it as I think it's an open question on what to do regarding ENST's - hope that's ok |
I'm fine with you reopening it if you think it needs more discussion. The whole point of seqfetcher is to create a simple interface for fetching a sequence for a given identifier (and just the identifier). Identifiers by definition should map to only one value. ENSTs are not actually identifiers at all in this sense. So, my hot take is that we should just remove support for ENSTs from seqfetcher. If we're going to leave this open, would you please suggest what outcome you would like to see from this issue? |
Just pitching in: I'd be a little bummed if we lost support for Ensembl transcripts. Could we just assume cDNA and log a |
Yep we could assume cDNA. That's definitely the best solution. |
I also vote for cdna by default. My main goal is I want Ensembl HGVS to resolve correctly, or failing that, not at all (better to not support) |
https://asia.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000136250;r=7:36512941-36724494;t=ENST00000617537
Web page Reports that the sequence is 2385 bases long
Ensembl API is in agreement:
SeqRepo returns much longer sequence:
The text was updated successfully, but these errors were encountered: