Add sequence-level annotations #113

ahwagner · 2023-06-23T19:56:07Z

It would be useful for supporting downstream methods (e.g. circular sequence support #70) to store some basic characteristics about a sequence at the sequence level. This MAY be accomplished by adding these annotations to the FASTA key fields.

I think we would minimally like to have:

Sequence Alphabet (amino acid vs. nucleic acid)

and in the event it is nucleic acid:

circular / linear sequence
single-stranded / double-stranded sequence

To accomplish this @ccaitlingo and I discussed extending the store and fetch methods of FastaDir to add these annotations to FASTA keys, in the following format:

>{digest}|{aa / na}|{linear / circular}|{single / double}

or a compressed version of the above (i.e. bitflags). Making this issue for discussion and progress.

The text was updated successfully, but these errors were encountered:

ahwagner · 2023-07-13T15:03:37Z

Related issue for refget is still open (samtools/hts-specs#626), but conversation with @andrewyatz confirmed that this will not be addressed in upcoming RefGet v2 release, and it is not clear if there are plans for a RefGet v3 in the near term.

andrewyatz · 2023-07-14T08:51:43Z

Question for me is if seqcol would solve the issue for you or not. If not then we need to consider a next step.

reece · 2023-07-17T18:42:28Z

Based on a discussion with @andreasprlic and @ahwagner, we have decided to shelve this project. The rationale follows.

A core assumption of seqrepo is that sequences are referenced by computed identifiers and nothing else. It is impossible to preserve this feature while also making sequence identifiers aware of other properties like sequence type, topology/circularity, taxonomy, or anything else. Sequences need to remain as verbatim strings.

In principle, properties could be added to the sequence alias records. For example, the alias record could track whether the sequence type, circularity, strandedness, or anything else. This raises a slew of challenge issues:

Adding new schema elements would bump the schema version, which would make prior releases incompatible (or we'd have to build in backward compatibility).
Where do the properties come from? We currently load from fastq files which don't have this info, so we'd need to identify sources for this info and figure out the logic for missing missing data. We'd also have to backfill existing records. This led to a suggestion to infer properties from the accession (by heuristic or lookup), but if we're going to do that, we should just do it outside of seqrepo.
Even if we got through the above issues, reverse lookups (sequence identifier → alias) would be broken because we'd now have the possibility that a single sequence identifier would map to two alias records that differed by some property.

For all of these reasons, we will not be adding sequence properties to seqrepo. Instead, if consumers need to know the sequence type, circularity, or strandedness, they will have to find another source for that info.

korikuzma mentioned this issue Jul 4, 2023

Sequence type annotations biocommons/hackathon-2023#10

Closed

ahwagner mentioned this issue Jul 13, 2023

Sequence characteristic metadata ga4gh/vrs#431

Closed

reece closed this as completed Jul 17, 2023

reece mentioned this issue Jul 17, 2023

Support circular sequences #70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sequence-level annotations #113

Add sequence-level annotations #113

ahwagner commented Jun 23, 2023 •

edited

Loading

ahwagner commented Jul 13, 2023

andrewyatz commented Jul 14, 2023

reece commented Jul 17, 2023

Add sequence-level annotations #113

Add sequence-level annotations #113

Comments

ahwagner commented Jun 23, 2023 • edited Loading

ahwagner commented Jul 13, 2023

andrewyatz commented Jul 14, 2023

reece commented Jul 17, 2023

ahwagner commented Jun 23, 2023 •

edited

Loading