Add support for derived sequences #92

reece · 2020-11-26T21:57:26Z

Problem

It's often necessary to derive sequences from others, especially genomic sequences used for alignment (see Heng Li's blog post).

As a specific example, GRCh38 official sequences use lower case to indicated masked regions and use IUPAC ambiguity characters. The sequence lengths are the same, but the content is different. Nonetheless, people want to refer to them using names like "GRCh38:1" or perhaps even "NC_000001.11".

Example transformations are 1) uppercase, 2) replace ambiguity with N, 3) reverse complement, and combinations of these.

Possible solutions

#. Precompute all transformed sequences and treat like all other sequences. This will be very expensive in space.

#. Enable transformations on read during connection (e.g., SeqRepo(root=..., uppercase=True)). This works well for uppercasing and ambiguity replacement (which are likely constant for the session), but is impractical for reverse complement. By changing python list semantics, we could co-opt negative coordinates for this (e.g., sr["NM_01234.5"][-1000:-900] would provide the rev comp of that range)

#. Create namespaces that imply certain transformations. For example, GRCh38uc (or GRCh38/uc) might indicate a uppercase transform of GRCh38.

Notes / Challenges / Constraints

The most ineluctable challenge is what to call derived sequences. For example, despite the GRCh38 authoritative sequence presentation, most users will prefer the uppercase form (at least), but will want to call it GRCh38.
How should transformed sequences be named? Is the transformation part of the namespace, the identifier, or neither (ie., the API just provides support for it).
It's important to maintain the key-value interface of SeqRepo. Flags cannot be easily passed using array syntax.

The text was updated successfully, but these errors were encountered:

reece · 2022-02-22T05:12:59Z

See #94

reece closed this as completed Feb 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for derived sequences #92

Add support for derived sequences #92

reece commented Nov 26, 2020 •

edited

Loading

reece commented Feb 22, 2022

Add support for derived sequences #92

Add support for derived sequences #92

Comments

reece commented Nov 26, 2020 • edited Loading

Problem

Possible solutions

Notes / Challenges / Constraints

reece commented Feb 22, 2022

reece commented Nov 26, 2020 •

edited

Loading