Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate support of full IUPAC ambiguous nucleotide support + followup actions #3563

Open
4 tasks
corneliusroemer opened this issue Jan 21, 2025 · 3 comments

Comments

@corneliusroemer
Copy link
Contributor

Loculus/Pathoplexus docs currently (somewhat implicitly) state that only ACTGN and - are valid bases.

  • Check whether we actually do allow all ambiguous characters
  • If yes, fix the docs
  • If no, document lack of support more broadly, e.g. in sequence submission guides
  • If no, consider whether we want to support full IUPAC in the future

Based on a comment in

the Loculus docs state we don't support ambiguous characters

Here for example: https://pathoplexus.org/docs/how-to/search-sequences-website
A nucleotide mutation has the format or <base_ref>. A can be one of the four nucleotides A, T, C, and G. It can also be - for deletion and N for unknown. For example if the reference sequence is A at position 23 both: 23T and A23T will yield the same results.

Image

And same here:

https://loculus.org/for-users/search-sequences/#nucleotide-mutations-and-insertions

This might just be wrong docs and we support ambiguous fine. Or LAPIS doesn't support but everything up to LAPIS does. Not sure, we should find out and decide what to do. I'll make a separate issue.

Originally posted by @corneliusroemer in #3560

@theosanderson
Copy link
Member

theosanderson commented Jan 21, 2025

Well, this page is about searching using mutation queries. I wouldn't say this implies much about what Loculus supports for submission. Suppose this (query docs) said that the nucleotide could be any IUPAC code. My expectation then would be that a search for 123R would search for 123G and for 123A. I.e. support here wouldn't really say anything about support at submission. So IMO this docs page is fairly OK.

But regardless, I'm not against us saying in some submission docs that we support IUPAC ambiguity bases, which I think we do support (but yeah, should re-check).

@chaoran-chen
Copy link
Member

chaoran-chen commented Jan 21, 2025

As far as I can see, the docs is not accurate here but both LAPIS and Loculus already support ambiguous nucleotides. It is possible to search for 123R: https://pathoplexus.org/ebola-zaire/search?mutation=123R. In comparison, if you search for 123Z, LAPIS will return an error: https://pathoplexus.org/ebola-zaire/search?mutation=123Z.

Suppose this (query docs) said that the nucleotide could be any IUPAC code. My expectation then would be that a search for 123R would search for 123G and for 123A.

In this case, LAPIS would only return sequences that have an R in the sequence at 123. Your logic would also make sense and I thought about it back in the days but chose the current logic and I think it's quite useful as this allows you to really find sequences with an R or, probably more useful, with an N at a certain position. If you want to search for 123G or 123A, on CoV-Spectrum, you can use the advanced variant queries – a feature that we are planning to extend to all LAPIS instances eventually (GenSpectrum/LAPIS#1045).

@theosanderson
Copy link
Member

theosanderson commented Jan 21, 2025

Thanks for explaining the current LAPIS behaviour. I still don't think this makes implications about what people can submit. If we do amend this bit I think we should try to do so in a relatively careful way that (A) maintains overall clarity about the simple ACGT case (B) avoids implying the interpretation I gave above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants