AnnData Backed Mode Support #131

kennypavan · 2024-08-27T16:52:09Z

Hello,

I'm attempting to train a large model from a AnnData object; however, memory issues persist when opening the file on our HPC with 512Gb of RAM. naturally, I've attempted to open a stream using the Anndata "backed" parameter and received the error:

> train.py:Line 341 
> flag = indata.sum(axis = 0) == 0
> AttributeError: 'Dataset' object has no attribute 'sum'

This error seems reasonable as many of the aggregating functions wouldn't have access to the entire AnnData object. Increasing memory beyond 512Gb for this task is a critical resource limitation. Before attempting to mitigate this by extending the train function to support the backed mode, I'm wondering if there's a solution for processing large scale atlas level datasets with >4 million cells?

Thank you,

The text was updated successfully, but these errors were encountered:

ChuanXu1 · 2024-09-04T20:54:43Z

@kennypavan, CellTypist does not support backed mode for the time being. You could load your raw count data for example, normalize+log1p the data, subset into HVGs, write it out as a new anndata, and load it for training. Note you need to use check_expression = False and feature_selection = False for this data during training. In addition, you can also subset cells.

kennypavan · 2024-09-05T19:49:31Z

@ChuanXu1 Thank you for the suggestions—I'll explore if preprocessing and removing non-HVGs will work for our use case. Much appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AnnData Backed Mode Support #131

AnnData Backed Mode Support #131

kennypavan commented Aug 27, 2024

ChuanXu1 commented Sep 4, 2024

kennypavan commented Sep 5, 2024

AnnData Backed Mode Support #131

AnnData Backed Mode Support #131

Comments

kennypavan commented Aug 27, 2024

ChuanXu1 commented Sep 4, 2024

kennypavan commented Sep 5, 2024