Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial support for Dask DataFrames in obsm/varm #1880

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ilia-kats
Copy link
Contributor

@ilia-kats ilia-kats commented Feb 26, 2025

This PR adds support for Dask DataFrames in .obsm/.varm.

  • Closes #
  • Tests added
  • Release note added (or unnecessary)

@ilia-kats ilia-kats force-pushed the dask_dataframe branch 3 times, most recently from 8a4761f to 2c3b39e Compare February 26, 2025 16:44
@ilia-kats ilia-kats marked this pull request as draft February 26, 2025 16:54
@ilia-kats
Copy link
Contributor Author

The minimum_versions test is failing due to an incompatibility between the old dask version and Python 3.11.9 specifically, not sure what to do here.

Copy link

codecov bot commented Feb 26, 2025

Codecov Report

Attention: Patch coverage is 81.81818% with 4 lines in your changes missing coverage. Please review.

Project coverage is 83.60%. Comparing base (b2c7a21) to head (982f882).

Files with missing lines Patch % Lines
src/anndata/compat/__init__.py 20.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1880      +/-   ##
==========================================
- Coverage   86.11%   83.60%   -2.51%     
==========================================
  Files          40       40              
  Lines        6242     6258      +16     
==========================================
- Hits         5375     5232     -143     
- Misses        867     1026     +159     
Files with missing lines Coverage Δ
src/anndata/_core/aligned_mapping.py 93.41% <100.00%> (ø)
src/anndata/_core/file_backing.py 88.79% <100.00%> (+0.09%) ⬆️
src/anndata/_core/storage.py 100.00% <100.00%> (ø)
src/anndata/_io/specs/methods.py 88.50% <100.00%> (-0.27%) ⬇️
src/anndata/tests/helpers.py 81.44% <100.00%> (-11.12%) ⬇️
src/anndata/utils.py 83.62% <100.00%> (-0.68%) ⬇️
src/anndata/compat/__init__.py 77.39% <20.00%> (-5.72%) ⬇️

... and 4 files with indirect coverage changes

@ilia-kats ilia-kats marked this pull request as ready for review February 26, 2025 17:39
@ilan-gold
Copy link
Contributor

ilan-gold commented Feb 27, 2025

@ilia-kats What is the use-case here that #1247 would not address? I see this is about just sticking the object onto an AnnData, not reading from disk

Furthermore, we have reports from SpatialData that dask dataframes are rather difficult to use, as was our experience motivating the above PR to use xarray instead.

I had previously tried to do this, but there were several issues that made things quite unusable

@ilan-gold
Copy link
Contributor

ilan-gold commented Feb 27, 2025

I like that the PR is basically write-only, but want to understand more.

@ilia-kats
Copy link
Contributor Author

So my usecase is that I have several AnnData objects that I need to concatenate, and I want to do that as lazily as possible, without allocating memory, so what I'm doing is I'm converting everything in the AnnData objects to Dask equivalents and then running ad.concat. I can't use AnnCollection because a) I need outer join on vars and b) AnnCollectionView fully materializes X upon access, while I typically don't need the full matrix, I just need to run some dimensionality-reducing calculations on it (e.g. mean or var along an axis). The only thing preventing me from sticking Dask DataFrames into obsm/varm is that one needs to call compute on the shape, everything else (storage etc.) I added for completeness' sake. Admittedly, I have not yet tested actually concatenating AnnDatas with Dask DataFrames in obsm/varm.

@ilan-gold
Copy link
Contributor

ilan-gold commented Feb 28, 2025

call compute on the shape

This was one of our stumbling blocks. It required a full pass over the data the last time we checked which kind of defeats the purpose. The above PR I linked to has full lazy concatenation features.

@ilia-kats
Copy link
Contributor Author

That is good to know, I'll wait for that to be merged then, I suppose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants