Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU support with dask #179

Open
wants to merge 100 commits into
base: main
Choose a base branch
from
Open

Multi-GPU support with dask #179

wants to merge 100 commits into from

Conversation

Intron7
Copy link
Member

@Intron7 Intron7 commented Apr 25, 2024

This adds dask support

Functions to add:

  • calculate_qc_metrics
  • normalize_total
  • log1p
  • highly_variable_genes with seurat and cell_ranger
  • scale
  • PCA
  • neighbors

@Intron7 Intron7 marked this pull request as draft April 25, 2024 13:08
@Intron7 Intron7 changed the title add first functions Multi-GPU support with dask Apr 30, 2024
@Intron7 Intron7 added the run-gpu-ci runs GPU CI label May 3, 2024
@Intron7 Intron7 marked this pull request as ready for review May 13, 2024 14:27
@Intron7 Intron7 added invalid This doesn't seem right run-gpu-ci runs GPU CI labels Nov 12, 2024
@github-actions github-actions bot removed the run-gpu-ci runs GPU CI label Nov 12, 2024
Copy link

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, also CSC is just not really mentioned? It's not supported but maybe we should throw an error or something?

@@ -37,6 +37,7 @@ doc = [
"scanpydoc[typehints,theme]>=0.9.4",
"readthedocs-sphinx-ext",
"sphinx_copybutton",
"dask",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be added to the dependencies.

Comment on lines 432 to 436
adata_subset = adata[adata.obs[batch_key] == batch].copy()

calculate_qc_metrics(adata_subset, layer=layer)
filt = adata_subset.var["n_cells_by_counts"].to_numpy() > 0
adata_subset = adata_subset[:, filt]
adata_subset = adata_subset[:, filt].copy()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is unrelated to this PR?

Comment on lines 432 to 436
adata_subset = adata[adata.obs[batch_key] == batch].copy()

calculate_qc_metrics(adata_subset, layer=layer)
filt = adata_subset.var["n_cells_by_counts"].to_numpy() > 0
adata_subset = adata_subset[:, filt]
adata_subset = adata_subset[:, filt].copy()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for the "weird stuff?"

"""


def _sparse_qc_csr_dask_cells(dtype):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I 100% agree with this. If it is a separate PR, that's fine by me. But we need to be able to maintain this library too if a fix needs to be made or something. This sort of thing goes a long way towards making things easier for the next person.


def _normalize_total(X: ArrayTypesDask, target_sum: int):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we just completely punting on CSC matrices?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For csc we transform to csr in normalize and thats always done.

from ._kernels._norm_kernel import _mul_csr

mul_kernel = _mul_csr(X.dtype)
mul_kernel.compile()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Phil here. Calling compile should be universally applied on first-access as neeeded instead of manually having to remember to do it. We should be trying to make sure that if someone wants to add a new feature to RSC and forgets to call compile, that can't happen if you're out-of-commission.

Comment on lines +162 to +171
def _get_target_sum_csr(X: sparse.csr_matrix) -> int:
from ._kernels._norm_kernel import _get_sparse_sum_major

counts_per_cell = cp.zeros(X.shape[0], dtype=X.dtype)
sum_kernel = _get_sparse_sum_major(X.dtype)
sum_kernel(
(X.shape[0],),
(64,),
(X.indptr, X.data, counts_per_cell, X.shape[0]),
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refactor these lines into their own function that can then be applied to each block of the dask array and then do the masking + median at the end after either you do map_blocks or _get_target_sum_csr. We do this "recursive" map_blocks in scanpy a lot and it works very well and makes things very easy to reason about. These functions are basically identical. Especially if compiling really costs nothing extra as you say, then this should be doable. Maybe I'm missing something though. This seems like a good point in favor of "add compile to some always-called function" as Phil was saying

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where you are coming from with this one. However I feel the compile might make this challenging. I have an idea on how to maybe fix this but this needs some testing and investigations. I could wrap the cuda kernel factory to call compile on the returned kernel. But I don't know if this than executed on the worker or host.

chunks=(X.chunksize[0],),
drop_axis=1,
)
counts_per_cell = target_sum_chunk_matrices.compute()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Too much for this PR? Make an issue maybe?

svd_solver = "jacobi"
pca_func = PCA(n_components=n_comps, svd_solver=svd_solver, whiten=False)
X_pca = pca_func.fit_transform(X)
X_pca = X_pca.compute_chunk_sizes()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to compute chunk sizes here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment for PCA X_pca = X_pca.compute_chunk_sizes().

For the copying in HVG. Since I use calculate_qc for to find where to subset. I wanted to be sure that the original data doesn't get overwritten. I addition to that the dask gpu views are sometimes a bit weird. So copy makes this more solid in general. That applies most to slicing against the minor axis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also dont want dask to be a dependency because cuml handels that. I dont want get into problems there.

src/rapids_singlecell/preprocessing/_scale.py Outdated Show resolved Hide resolved
@Intron7 Intron7 removed the invalid This doesn't seem right label Nov 13, 2024
@Intron7 Intron7 added the run-gpu-ci runs GPU CI label Nov 13, 2024
@Intron7
Copy link
Member Author

Intron7 commented Nov 13, 2024

I renamed the functions for QC and renamed some of the variables so its a bit clearer whats happening.

@github-actions github-actions bot removed the run-gpu-ci runs GPU CI label Nov 13, 2024
Copy link

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/scverse/rapids_singlecell/pull/179/files#r1838498091 is not done and from what I can tell #179 (review) has not been addressed. What happens if you pass a csc dask array to pca?

@Intron7
Copy link
Member Author

Intron7 commented Nov 14, 2024

https://github.com/scverse/rapids_singlecell/pull/179/files#r1838498091 is not done and from what I can tell #179 (review) has not been addressed. What happens if you pass a csc dask array to pca?

That will just error. And tell the user to please give me dense or csr as meta. I updated _check_gpu_X to reflect that.

The median I'll test today

@ilan-gold
Copy link

We should look into the cost of allocating ahead of time for all operations that are currently in-place

@Intron7
Copy link
Member Author

Intron7 commented Nov 21, 2024

Median out of core is a bad choice. Uses way more memory and is slower. Loose Loose

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants