-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-human genomes in HGVS/UTA #598
Comments
Hi @sachalau : Your request was very clear. Thank you for raising this issue so thoughtfully! You've analyzed the problem well. These are the barriers:
Finally, let me say that loading UTA is not easy. I desperately want to overhaul and generalize it in order to support more species and custom builds, but there's no financial support for that (yet). |
Thanks for your answer. Here is an example I have been working on, just a gene of my organism of interest, defined in gbk. The reference is NC_000962.3
So I created a test/ folder in loading/data/ with two very simple files : exonset.gz:
and txinfo.gz:
However I'm a bit stuck for loading into the DB now because I can't install the utils scripts "uta" for loading into the database. So the make load-XXX fail. I think there is some python2/3 compatibility problems that I can't resolve. Then before going to the codon translation problem, I think I need a couple of other information to understand what needs to be done for bacteria, given that the only difference is for start codons.
For comparison, in snpEff, as you give the full set of start codons, all mutations in the first codon are given by p.(Met1?) although some could be defined as start_lost in the sequence ontology whereas other as start_retrained_variants http://sequenceontology.org/browser/current_svn/term/SO:0002019 I realize that HGVS probably never had to care for these cases because humans only have one start codon so my problem regarding genetic code at the moment is irrelevant... |
Hi @sachalau: Please try pulling UTA again. @andreasprlic and I have been loading new data and came across similar issues. We're using Python 3.7 and I think HEAD should now work. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been stalled for 7 days with no activity. |
This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been stalled for 7 days with no activity. |
Hello @reece,
Following issue #569, I was wondering whether you had any news regarding the work of incorporating other genome references into UTA then using it for HGVS Validation with this package.
At the moment I'm using snpEff to annotate my variants and HGVS for Parse() and IntrinsicValidation(). However I'm wondering whether I could use UTA/HGVS directly because for some variant snpEff gives incorrect results.
Regarding my particular bacterial genome of interest, one thing that is different from the annotation is that it does not have transcript data. In the genbank/gff you go directly from "gene" with a "locus_tag" as accession to a CDS for protein with "NP_" accessions. For non coding genes, the accession for the gene and for the transcript ("rRNA", "tRNA" or "ncRNA" in the genbank/gff) is identical.
I guess one trick would be to use the same accession for transcript and gene as intermediate between gene and CDS. It would be a good enough approximation of the reality for prokaryotic genomes.
The alignment with spalign would then be trivial but I suspect that UTA needs the spalign output files for loading so it might still need to be performed. However is that the Accession for the transcript will not be formatted as "NM_" but as LocusTag, which theoritically won't be allowed by HGVS, although LocusTags are stable.
The last point is that although human and bacterial genetic code is almost equivalent, there are more start codons allowed in bacterial genomes. Compare for humans :
and bacterias :
Bacterias have 3 additional start codons.
To sum up, for using bacterial genome with UTA/HGVS I would need to :
My estimation is that 1) and 2) would not be too hard (could I use documentation provided here to load data https://pythonhosted.org/uta/db_loading.html ?) but 3) might be because I suspect that the way start codons are handled is coded somewhere in hgvs or in bioutils.sequences.
Does that make sense for you?
Thanks for any help you can provide and all the good work already with python HGVS that is very valuable.
(I hope my message still make sense after the multiple edits)
The text was updated successfully, but these errors were encountered: