You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's often necessary to derive sequences from others, especially genomic sequences used for alignment (see Heng Li's blog post).
As a specific example, GRCh38 official sequences use lower case to indicated masked regions and use IUPAC ambiguity characters. The sequence lengths are the same, but the content is different. Nonetheless, people want to refer to them using names like "GRCh38:1" or perhaps even "NC_000001.11".
Example transformations are 1) uppercase, 2) replace ambiguity with N, 3) reverse complement, and combinations of these.
Possible solutions
#. Precompute all transformed sequences and treat like all other sequences. This will be very expensive in space.
#. Enable transformations on read during connection (e.g., SeqRepo(root=..., uppercase=True)). This works well for uppercasing and ambiguity replacement (which are likely constant for the session), but is impractical for reverse complement. By changing python list semantics, we could co-opt negative coordinates for this (e.g., sr["NM_01234.5"][-1000:-900] would provide the rev comp of that range)
#. Create namespaces that imply certain transformations. For example, GRCh38uc (or GRCh38/uc) might indicate a uppercase transform of GRCh38.
Notes / Challenges / Constraints
The most ineluctable challenge is what to call derived sequences. For example, despite the GRCh38 authoritative sequence presentation, most users will prefer the uppercase form (at least), but will want to call it GRCh38.
How should transformed sequences be named? Is the transformation part of the namespace, the identifier, or neither (ie., the API just provides support for it).
It's important to maintain the key-value interface of SeqRepo. Flags cannot be easily passed using array syntax.
The text was updated successfully, but these errors were encountered:
Problem
It's often necessary to derive sequences from others, especially genomic sequences used for alignment (see Heng Li's blog post).
As a specific example, GRCh38 official sequences use lower case to indicated masked regions and use IUPAC ambiguity characters. The sequence lengths are the same, but the content is different. Nonetheless, people want to refer to them using names like "GRCh38:1" or perhaps even "NC_000001.11".
Example transformations are 1) uppercase, 2) replace ambiguity with N, 3) reverse complement, and combinations of these.
Possible solutions
#. Precompute all transformed sequences and treat like all other sequences. This will be very expensive in space.
#. Enable transformations on read during connection (e.g.,
SeqRepo(root=..., uppercase=True)
). This works well for uppercasing and ambiguity replacement (which are likely constant for the session), but is impractical for reverse complement. By changing python list semantics, we could co-opt negative coordinates for this (e.g.,sr["NM_01234.5"][-1000:-900]
would provide the rev comp of that range)#. Create namespaces that imply certain transformations. For example, GRCh38uc (or GRCh38/uc) might indicate a uppercase transform of GRCh38.
Notes / Challenges / Constraints
The most ineluctable challenge is what to call derived sequences. For example, despite the GRCh38 authoritative sequence presentation, most users will prefer the uppercase form (at least), but will want to call it GRCh38.
How should transformed sequences be named? Is the transformation part of the namespace, the identifier, or neither (ie., the API just provides support for it).
It's important to maintain the key-value interface of SeqRepo. Flags cannot be easily passed using array syntax.
The text was updated successfully, but these errors were encountered: