-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add strand support to AnnDataset and fix deterministic_shift #48
Conversation
Now that we've swapped to always using augmented sequences, it makes handling the stranded logic easier and more robust, meaning all downstream functioning should work fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested, looks good. Can be merged, thanks for the fixes!
Adds support for stranded datasets (so anndatas with indices (chr:start-end:strand), like gene data) while preserving support for non-stranded datasets (like ATAC data). I checked whether loading data works with stranded and unstranded data in-memory and not in-memory and with/without always_reverse_complement.
To do still:
Also has two small bugfixes:
get_embeddings()
was trying to save per-gene embeddings inanndata.obsm
(which is for per-pseudobulk observations). Changed this to.varm
and tested that it works.Crested.enhancer_design_*
functions were deriving start and end from the var columns, but that's not robust: the the anndataset object always reads the full region string to get start and end, and indeed in my code (and I believe also if you use the expand region width function?) the regions and the start/end might be out of sync. Anyway, this was to get the model input size, so we can just useself.model.input_shape
and not have to worry about it.Edit: and a bigger bugfix: it fixes 'deterministic_shift', since it was broken before and didn't actually shift the sequences.