Add strand support to AnnDataset and fix deterministic_shift #48

casblaauw · 2024-11-03T16:05:17Z

Adds support for stranded datasets (so anndatas with indices (chr:start-end:strand), like gene data) while preserving support for non-stranded datasets (like ATAC data). I checked whether loading data works with stranded and unstranded data in-memory and not in-memory and with/without always_reverse_complement.

To do still:

Check whether the downstream functions (crested object functions, like preds and explanations) still work. I looked at the code and nothing should need changes, but I haven't actually tested them yet - I can do that or someone else can quickly rerun an analysis notebook with this branch to see if everything works.

Also has two small bugfixes:

get_embeddings() was trying to save per-gene embeddings in anndata.obsm (which is for per-pseudobulk observations). Changed this to .varm and tested that it works.
the Crested.enhancer_design_* functions were deriving start and end from the var columns, but that's not robust: the the anndataset object always reads the full region string to get start and end, and indeed in my code (and I believe also if you use the expand region width function?) the regions and the start/end might be out of sync. Anyway, this was to get the model input size, so we can just use self.model.input_shape and not have to worry about it.

Edit: and a bigger bugfix: it fixes 'deterministic_shift', since it was broken before and didn't actually shift the sequences.

This reverts commit a150bbc.

This reverts commit bd27269.

casblaauw · 2024-11-04T12:38:59Z

Now that we've swapped to always using augmented sequences, it makes handling the stranded logic easier and more robust, meaning all downstream functioning should work fine.
Since it passes all checks (which do also try the downstream options), I think this is ready for review and merging.

LukasMahieu

Tested, looks good. Can be merged, thanks for the fixes!

cblaauw and others added 14 commits November 3, 2024 15:49

Add support for stranded data in AnnDataset

09007e2

Save get_embeddings data in correct slot (.varm)

308b9de

Make enhancer design seq_len use model instead of var start/end columns

1187608

Tell user if reverse complementing stranded data

0bf9f1e

Clarify that strand should be - or +

0d289da

Fix test: missing argument name in get_sequence call

c0b94f7

Fix not actually using augmented index when getting sequence

dba8e7d

Remove deterministic shift

a150bbc

Remove leftover deterministic shift function

bd27269

Simplify stranded handling now __getitem__ always gives stranded

d9250f4

Revert "Remove deterministic shift"

2f8a4c6

This reverts commit a150bbc.

Revert "Remove leftover deterministic shift function"

57a910e

This reverts commit bd27269.

Fix deterministic shift

87ae027

Slight clean up _load_sequences_into_memory

60c7169

casblaauw marked this pull request as ready for review November 4, 2024 12:38

casblaauw changed the title ~~Add strand support to AnnDataset~~ Add strand support to AnnDataset and fix deterministic_shift Nov 4, 2024

LukasMahieu self-requested a review November 13, 2024 15:11

LukasMahieu approved these changes Nov 13, 2024

View reviewed changes

casblaauw merged commit 904fa13 into main Nov 13, 2024
10 checks passed

casblaauw deleted the stranded_dataloader branch November 13, 2024 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add strand support to AnnDataset and fix deterministic_shift #48

Add strand support to AnnDataset and fix deterministic_shift #48

casblaauw commented Nov 3, 2024 •

edited

Loading

casblaauw commented Nov 4, 2024

LukasMahieu left a comment

Add strand support to AnnDataset and fix deterministic_shift #48

Add strand support to AnnDataset and fix deterministic_shift #48

Conversation

casblaauw commented Nov 3, 2024 • edited Loading

casblaauw commented Nov 4, 2024

LukasMahieu left a comment

Choose a reason for hiding this comment

casblaauw commented Nov 3, 2024 •

edited

Loading