Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modified RNA residues not working correctly #9

Open
marcom opened this issue Jun 1, 2022 · 3 comments
Open

modified RNA residues not working correctly #9

marcom opened this issue Jun 1, 2022 · 3 comments

Comments

@marcom
Copy link

marcom commented Jun 1, 2022

Hi,
i wanted to say thank you for making this alignment viewer, i really like it!

I have had some problems viewing MSAs of RNAs with modified residues.
The problem seems to be that with all the modified RNA residues that exist, the sequences contain a lot of unusual characters.

A database of modified RNA residues can be found here:
https://iimcb.genesilico.pl/modomics/modifications

There seem to be two problems as far as i can see

  • unicode characters, e.g. °
>tdbR00000266|Bos_taurus|9913|Leu|°AA
-GUCAGLAUGLCMGAGU--GGDCPAAGGCLCCAGACU°AAKPPCUGGJ-CPCC---GUAU----GGAGG-?GUGGGTPCG"AUCCCACUUCUGACACCA

Error message:

ERROR when loading FASTA: Not all input sequences are the same length. Expected sequence length 99, seq "tdbR00000266|Bos_taurus|9913|Leu|°AA" has length 100.
  • autodetection trying to interpret sequences as amino-acids
>tdbR00000270|Avian_myeloblastosis_virus|11866|Met|BAU
-GCCUCCUUALCGCAGDA-GGN--AGCGCRPCAGPCUBAU6APCUGAAG-------------------7D??UGAGTPCG"ACCUCAGAGGGGGCACCA

Error message:

ERROR when loading FASTA: Sequence "tdbR00000270|Avian_myeloblastosis_virus|11866|Met|BAU" cannot be understood as amino acids.

My suggestion would be to have additional command-line options to

  • allow unicode
  • force dna, rna, aa mode without autodetect

Not sure how to deal with it in visualisation. I guess one color for any nonstandard RNA base would be ok.

What do you think? I could try and make the changes if you agree

Data source:

Data was downloaded from tRNAdb:

http://trna.bioinf.uni-leipzig.de/DataOutput/Search

On the Search Database page, choose tRNA sequences and then press the search database button. This will return RNA sequences with modified sequences. You can save them by selecting the select all sequences of search checkbox on the search results page, and then saving the sequences by choosing Download alignment in the drop down button next to the checkbox.

@jakobnissen
Copy link
Owner

Thanks!

So, I'm torn on this. Allowing non-ASCII characters introduces a lot of complexity - in particular when it comes to figuring out the textwidth, and hence how to display it, but also for coloring, and sequence validation.
I do this for the sequence names since one can reasonably expect non-ASCII characters in those names, but I'm not sure it's a good idea for the sequences themselves.
Like so many other problems with FASTA, it comes down to the fact that there is no agreed on definition of that a FASTA record is. If I allow some Unicode characters, which ones? Saying "all of them" is not feasible, because correctly dealing with arbitrary unicode textwidth is a nightmare. If I don't surely someone will just invent another character to put in a FASTA record.

So my response is: No, this is too complicated, sorry. I'll keep this issue open though to think more about it.

The error messages could be improved, though, to make it clearer what is happening.

@marcom
Copy link
Author

marcom commented Jun 2, 2022

Sorry i put multiple things in one issue, if you want to i can open two separate issues.
I'm also offering to implement these things, but wanted to discuss with you first.

Disregarding the unicode question, my suggestion would be:

  • allow any printable non-whitespace ASCII character as amino acid/nucleobase
    • either gated behind a --nonstd command-line option or print warnings on nonstandard things so one can still easily detect strange characters in sequences
    • maybe a character histogram mode in the viewer so one can also see anything unusual
  • don't color the nonstandard characters
  • allow the user to explicitly choose amino acid/nucleobase mode for the colors (and other things?)

Re: autodetection / nonstandard ASCII nucleobases

Even after removing/modifying sequences with unicode, i wasn't able to view the alignment:

>tdbR00000270|Avian_myeloblastosis_virus|11866|Met|BAU
-GCCUCCUUALCGCAGDA-GGN--AGCGCRPCAGPCUBAU6APCUGAAG-------------------7D??UGAGTPCG"ACCUCAGAGGGGGCACCA

This is pure ASCII and fails with the error message:

ERROR when loading FASTA: Sequence "tdbR00000270|Avian_myeloblastosis_virus|11866|Met|BAU" cannot be understood as amino acids.

Real world RNA (and DNA to a lesser extent) has quite a few modified bases and modern sequencing techniques such as nanopore sequencing are starting to detect them routinely, so this is something that is going to become more common.

Re: unicode

The problem is that there aren't enough ASCII characters to represent all modified RNA residues with one ASCII letter.

Not sure what the pragmatic solution is. I agree all of Unicode is overkill.

Maybe any printable grapheme that can be displayed with the same width as an ASCII character. That's what makes sense in the context of multiple sequence alignments IMHO.

There is a rust crate that can answer this question:
https://github.com/unicode-rs/unicode-width

Re: FASTA format

In my very humble opinion, the most liberal specification of the FASTA format seems the most useful to me:

  • there are newlines and > symbols, and everything else is up to the user

  • FASTA files can be in ASCII or in Unicode, but the user has to tell the program, and the default is ASCII

In any case, an alignment viewer that can deal with "strange" FASTA files will always be more useful than a viewer that can't.

@jakobnissen
Copy link
Owner

jakobnissen commented Jun 2, 2022

Reasonable suggestions. So:

  • Have non-standard (i.e. IUPAC ambiguous) bases/aa error, unless --nostd is passed
  • There is already the -a flag to force parsing as aa - -n for nucleotide could be added as well.
  • Allow all single-char symbols with a textwidth of 1. This is NOT bulletproof, since it's ultimately up to the terminal to decide how to render unicode characters, and a character can absolutely have a textwidth of 1, yet take up 2 columns. So that opens up the possibility for nasty rendering bugs.
    • This implies sequences should be stored in an enum of either Vec<u8> if ASCII, or Vec<char> if not. It won't be good to have 4x the memory usage for ASCII sequences by always storing them as char.
  • W.r.t. the histogram viewer, that seem like feature creep to me. I'm not yet convinced.

@jakobnissen jakobnissen mentioned this issue Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants