Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add strand support to AnnDataset and fix deterministic_shift #48

Merged
merged 14 commits into from
Nov 13, 2024

Conversation

casblaauw
Copy link
Collaborator

@casblaauw casblaauw commented Nov 3, 2024

Adds support for stranded datasets (so anndatas with indices (chr:start-end:strand), like gene data) while preserving support for non-stranded datasets (like ATAC data). I checked whether loading data works with stranded and unstranded data in-memory and not in-memory and with/without always_reverse_complement.

To do still:

  • Check whether the downstream functions (crested object functions, like preds and explanations) still work. I looked at the code and nothing should need changes, but I haven't actually tested them yet - I can do that or someone else can quickly rerun an analysis notebook with this branch to see if everything works.

Also has two small bugfixes:

  • get_embeddings() was trying to save per-gene embeddings in anndata.obsm (which is for per-pseudobulk observations). Changed this to .varm and tested that it works.
  • the Crested.enhancer_design_* functions were deriving start and end from the var columns, but that's not robust: the the anndataset object always reads the full region string to get start and end, and indeed in my code (and I believe also if you use the expand region width function?) the regions and the start/end might be out of sync. Anyway, this was to get the model input size, so we can just use self.model.input_shape and not have to worry about it.

Edit: and a bigger bugfix: it fixes 'deterministic_shift', since it was broken before and didn't actually shift the sequences.

@casblaauw casblaauw marked this pull request as ready for review November 4, 2024 12:38
@casblaauw
Copy link
Collaborator Author

Now that we've swapped to always using augmented sequences, it makes handling the stranded logic easier and more robust, meaning all downstream functioning should work fine.
Since it passes all checks (which do also try the downstream options), I think this is ready for review and merging.

@casblaauw casblaauw changed the title Add strand support to AnnDataset Add strand support to AnnDataset and fix deterministic_shift Nov 4, 2024
@LukasMahieu LukasMahieu self-requested a review November 13, 2024 15:11
Copy link
Collaborator

@LukasMahieu LukasMahieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, looks good. Can be merged, thanks for the fixes!

@casblaauw casblaauw merged commit 904fa13 into main Nov 13, 2024
10 checks passed
@casblaauw casblaauw deleted the stranded_dataloader branch November 13, 2024 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants