Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sequence-level annotations #113

Closed
3 tasks
ahwagner opened this issue Jun 23, 2023 · 3 comments
Closed
3 tasks

Add sequence-level annotations #113

ahwagner opened this issue Jun 23, 2023 · 3 comments

Comments

@ahwagner
Copy link
Member

ahwagner commented Jun 23, 2023

It would be useful for supporting downstream methods (e.g. circular sequence support #70) to store some basic characteristics about a sequence at the sequence level. This MAY be accomplished by adding these annotations to the FASTA key fields.

I think we would minimally like to have:

  • Sequence Alphabet (amino acid vs. nucleic acid)

and in the event it is nucleic acid:

  • circular / linear sequence
  • single-stranded / double-stranded sequence

To accomplish this @ccaitlingo and I discussed extending the store and fetch methods of FastaDir to add these annotations to FASTA keys, in the following format:

>{digest}|{aa / na}|{linear / circular}|{single / double}

or a compressed version of the above (i.e. bitflags). Making this issue for discussion and progress.

@ahwagner
Copy link
Member Author

Related issue for refget is still open (samtools/hts-specs#626), but conversation with @andrewyatz confirmed that this will not be addressed in upcoming RefGet v2 release, and it is not clear if there are plans for a RefGet v3 in the near term.

@andrewyatz
Copy link

Question for me is if seqcol would solve the issue for you or not. If not then we need to consider a next step.

@reece
Copy link
Member

reece commented Jul 17, 2023

Based on a discussion with @andreasprlic and @ahwagner, we have decided to shelve this project. The rationale follows.

A core assumption of seqrepo is that sequences are referenced by computed identifiers and nothing else. It is impossible to preserve this feature while also making sequence identifiers aware of other properties like sequence type, topology/circularity, taxonomy, or anything else. Sequences need to remain as verbatim strings.

In principle, properties could be added to the sequence alias records. For example, the alias record could track whether the sequence type, circularity, strandedness, or anything else. This raises a slew of challenge issues:

  • Adding new schema elements would bump the schema version, which would make prior releases incompatible (or we'd have to build in backward compatibility).
  • Where do the properties come from? We currently load from fastq files which don't have this info, so we'd need to identify sources for this info and figure out the logic for missing missing data. We'd also have to backfill existing records. This led to a suggestion to infer properties from the accession (by heuristic or lookup), but if we're going to do that, we should just do it outside of seqrepo.
  • Even if we got through the above issues, reverse lookups (sequence identifier → alias) would be broken because we'd now have the possibility that a single sequence identifier would map to two alias records that differed by some property.

For all of these reasons, we will not be adding sequence properties to seqrepo. Instead, if consumers need to know the sequence type, circularity, or strandedness, they will have to find another source for that info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants