modified RNA residues not working correctly #9

marcom · 2022-06-01T20:23:25Z

Hi,
i wanted to say thank you for making this alignment viewer, i really like it!

I have had some problems viewing MSAs of RNAs with modified residues.
The problem seems to be that with all the modified RNA residues that exist, the sequences contain a lot of unusual characters.

A database of modified RNA residues can be found here:
https://iimcb.genesilico.pl/modomics/modifications

There seem to be two problems as far as i can see

unicode characters, e.g. °

>tdbR00000266|Bos_taurus|9913|Leu|°AA
-GUCAGLAUGLCMGAGU--GGDCPAAGGCLCCAGACU°AAKPPCUGGJ-CPCC---GUAU----GGAGG-?GUGGGTPCG"AUCCCACUUCUGACACCA

Error message:

ERROR when loading FASTA: Not all input sequences are the same length. Expected sequence length 99, seq "tdbR00000266|Bos_taurus|9913|Leu|°AA" has length 100.

autodetection trying to interpret sequences as amino-acids

>tdbR00000270|Avian_myeloblastosis_virus|11866|Met|BAU
-GCCUCCUUALCGCAGDA-GGN--AGCGCRPCAGPCUBAU6APCUGAAG-------------------7D??UGAGTPCG"ACCUCAGAGGGGGCACCA

Error message:

ERROR when loading FASTA: Sequence "tdbR00000270|Avian_myeloblastosis_virus|11866|Met|BAU" cannot be understood as amino acids.

My suggestion would be to have additional command-line options to

allow unicode
force dna, rna, aa mode without autodetect

Not sure how to deal with it in visualisation. I guess one color for any nonstandard RNA base would be ok.

What do you think? I could try and make the changes if you agree

Data source:

Data was downloaded from tRNAdb:

http://trna.bioinf.uni-leipzig.de/DataOutput/Search

On the Search Database page, choose tRNA sequences and then press the search database button. This will return RNA sequences with modified sequences. You can save them by selecting the select all sequences of search checkbox on the search results page, and then saving the sequences by choosing Download alignment in the drop down button next to the checkbox.

The text was updated successfully, but these errors were encountered:

jakobnissen · 2022-06-02T07:55:28Z

Thanks!

So, I'm torn on this. Allowing non-ASCII characters introduces a lot of complexity - in particular when it comes to figuring out the textwidth, and hence how to display it, but also for coloring, and sequence validation.
I do this for the sequence names since one can reasonably expect non-ASCII characters in those names, but I'm not sure it's a good idea for the sequences themselves.
Like so many other problems with FASTA, it comes down to the fact that there is no agreed on definition of that a FASTA record is. If I allow some Unicode characters, which ones? Saying "all of them" is not feasible, because correctly dealing with arbitrary unicode textwidth is a nightmare. If I don't surely someone will just invent another character to put in a FASTA record.

So my response is: No, this is too complicated, sorry. I'll keep this issue open though to think more about it.

The error messages could be improved, though, to make it clearer what is happening.

marcom · 2022-06-02T09:40:20Z

Sorry i put multiple things in one issue, if you want to i can open two separate issues.
I'm also offering to implement these things, but wanted to discuss with you first.

Disregarding the unicode question, my suggestion would be:

allow any printable non-whitespace ASCII character as amino acid/nucleobase
- either gated behind a --nonstd command-line option or print warnings on nonstandard things so one can still easily detect strange characters in sequences
- maybe a character histogram mode in the viewer so one can also see anything unusual
don't color the nonstandard characters
allow the user to explicitly choose amino acid/nucleobase mode for the colors (and other things?)

Re: autodetection / nonstandard ASCII nucleobases

Even after removing/modifying sequences with unicode, i wasn't able to view the alignment:

>tdbR00000270|Avian_myeloblastosis_virus|11866|Met|BAU
-GCCUCCUUALCGCAGDA-GGN--AGCGCRPCAGPCUBAU6APCUGAAG-------------------7D??UGAGTPCG"ACCUCAGAGGGGGCACCA

This is pure ASCII and fails with the error message:

ERROR when loading FASTA: Sequence "tdbR00000270|Avian_myeloblastosis_virus|11866|Met|BAU" cannot be understood as amino acids.

Real world RNA (and DNA to a lesser extent) has quite a few modified bases and modern sequencing techniques such as nanopore sequencing are starting to detect them routinely, so this is something that is going to become more common.

Re: unicode

The problem is that there aren't enough ASCII characters to represent all modified RNA residues with one ASCII letter.

Not sure what the pragmatic solution is. I agree all of Unicode is overkill.

Maybe any printable grapheme that can be displayed with the same width as an ASCII character. That's what makes sense in the context of multiple sequence alignments IMHO.

There is a rust crate that can answer this question:
https://github.com/unicode-rs/unicode-width

Re: FASTA format

In my very humble opinion, the most liberal specification of the FASTA format seems the most useful to me:

there are newlines and > symbols, and everything else is up to the user
FASTA files can be in ASCII or in Unicode, but the user has to tell the program, and the default is ASCII

In any case, an alignment viewer that can deal with "strange" FASTA files will always be more useful than a viewer that can't.

jakobnissen · 2022-06-02T10:42:57Z

Reasonable suggestions. So:

Have non-standard (i.e. IUPAC ambiguous) bases/aa error, unless --nostd is passed
There is already the -a flag to force parsing as aa - -n for nucleotide could be added as well.
Allow all single-char symbols with a textwidth of 1. This is NOT bulletproof, since it's ultimately up to the terminal to decide how to render unicode characters, and a character can absolutely have a textwidth of 1, yet take up 2 columns. So that opens up the possibility for nasty rendering bugs.
- This implies sequences should be stored in an enum of either Vec<u8> if ASCII, or Vec<char> if not. It won't be good to have 4x the memory usage for ASCII sequences by always storing them as char.
W.r.t. the histogram viewer, that seem like feature creep to me. I'm not yet convinced.

jakobnissen mentioned this issue Oct 17, 2022

allow dots #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modified RNA residues not working correctly #9

modified RNA residues not working correctly #9

marcom commented Jun 1, 2022

jakobnissen commented Jun 2, 2022

marcom commented Jun 2, 2022

jakobnissen commented Jun 2, 2022 •

edited

Loading

modified RNA residues not working correctly #9

modified RNA residues not working correctly #9

Comments

marcom commented Jun 1, 2022

jakobnissen commented Jun 2, 2022

marcom commented Jun 2, 2022

Re: autodetection / nonstandard ASCII nucleobases

Re: unicode

Re: FASTA format

jakobnissen commented Jun 2, 2022 • edited Loading

jakobnissen commented Jun 2, 2022 •

edited

Loading