Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for derived sequences #92

Closed
reece opened this issue Nov 26, 2020 · 1 comment
Closed

Add support for derived sequences #92

reece opened this issue Nov 26, 2020 · 1 comment

Comments

@reece
Copy link
Member

reece commented Nov 26, 2020

Problem

It's often necessary to derive sequences from others, especially genomic sequences used for alignment (see Heng Li's blog post).

As a specific example, GRCh38 official sequences use lower case to indicated masked regions and use IUPAC ambiguity characters. The sequence lengths are the same, but the content is different. Nonetheless, people want to refer to them using names like "GRCh38:1" or perhaps even "NC_000001.11".

Example transformations are 1) uppercase, 2) replace ambiguity with N, 3) reverse complement, and combinations of these.

Possible solutions

#. Precompute all transformed sequences and treat like all other sequences. This will be very expensive in space.

#. Enable transformations on read during connection (e.g., SeqRepo(root=..., uppercase=True)). This works well for uppercasing and ambiguity replacement (which are likely constant for the session), but is impractical for reverse complement. By changing python list semantics, we could co-opt negative coordinates for this (e.g., sr["NM_01234.5"][-1000:-900] would provide the rev comp of that range)

#. Create namespaces that imply certain transformations. For example, GRCh38uc (or GRCh38/uc) might indicate a uppercase transform of GRCh38.

Notes / Challenges / Constraints

  • The most ineluctable challenge is what to call derived sequences. For example, despite the GRCh38 authoritative sequence presentation, most users will prefer the uppercase form (at least), but will want to call it GRCh38.

  • How should transformed sequences be named? Is the transformation part of the namespace, the identifier, or neither (ie., the API just provides support for it).

  • It's important to maintain the key-value interface of SeqRepo. Flags cannot be easily passed using array syntax.

@reece
Copy link
Member Author

reece commented Feb 22, 2022

See #94

@reece reece closed this as completed Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant