Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster logmap labels #209

Merged
merged 3 commits into from
Dec 9, 2024
Merged

faster logmap labels #209

merged 3 commits into from
Dec 9, 2024

Conversation

mihem
Copy link
Contributor

@mihem mihem commented May 27, 2024

Issue initially reported here:
satijalab/seurat#7879

Integration preprocessing took long (~10min in my case with 120 000 cells and 61 layers). Time was nearly completely spent on Seurat:::CreateIntegrationGroup , more specifically on:

    as.data.frame(x = labels(
      object = cmap,
      values = Cells(x = object[[assay]], layer = scale.layer)

Slow computing here was caused by the sapply function in seurat-object

seurat-object/R/logmap.R

Lines 238 to 247 in 58bf437

obs <- sapply(
X = values,
FUN = function(x) {
vals <- colnames(x = object)[which(x = object[x, , drop = TRUE])]
p()
return(vals)
},
simplify = FALSE,
USE.NAMES = TRUE
)

I rewrote this (also thanks to ChatGPT) using logical indexing.
This speeds up computation > 1000x in my use case from 10 min to less than 1 s. So "real" integration steps start nearly instantaneously.

For a reproducible example use:

library(Seurat)
library(SeuratData)
options(future.globals.maxSize = 1e9)

remotes::install_github("mihem/seurat-object@faster_logmap")

InstallData("pbmcsca")
obj <- LoadData("pbmcsca")

obj[["RNA"]] <- split(obj[["RNA"]], f = obj$Method)

obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)

layers <- Layers(object = obj, assay = "RNA", search = "data")
scale.layer <- Layers(object = obj, search = "scale.data")

system.time(
    groups <- Seurat:::CreateIntegrationGroups(obj[["RNA"]], layers = layers, scale.layer = scale.layer)
)

@mihem
Copy link
Contributor Author

mihem commented May 28, 2024

@Gesmira or @dcollins15
Would be nice if you could review that,. simple changes but enormous speedup.

@mihem
Copy link
Contributor Author

mihem commented Jul 18, 2024

An news here? already some weeks old @Gesmira @igrabski?

@mihem
Copy link
Contributor Author

mihem commented Dec 4, 2024

@dcollins15 @mojaveazure @rsatija anybody still working on seurat or seurat objects? There are so many great PRs, e.g. satijalab/seurat#8197 or satijalab/seurat#8271 or this one, which wait for review and which would really make Seurat faster, safer and better. I find this quite sad that we don't see any reaction from anyone of the teams for more than 6 months. Seurat is such a great package (and I've had great support e.g. by Gesmira), but Seurat would really require some more care since a year or so.

@dcollins15
Copy link
Collaborator

Hi @mihem,

Thanks for the input.

As I’m sure you can imagine, trying to balance development, maintenance, and user support for such widely used tools can be challenging for such a small group of developers. We appreciate your enthusiasm and your patience.

Your example is helpful as a simple integration test and for showing off the speedup. Since I wasn’t involved in the development of the LogMap datatype, it’d be nice to add a more focussed test to sanity check that label.LogMap is working as expected—I’ve taken the liberty:

# Tests for the LogMap class

test_that("`labels` generic works as expected for `LogMap` instances", {
  # Instantiate and populate a LogMap instance for testing.
  map <- LogMap(paste0("value_", 1:6))
  map[["label_a"]] <- c(1, 3)
  map[["label_b"]] <- c(2, 4)
  map[["label_c"]] <- c(2, 4, 6)
  map[["label_d"]] <- c(2, 4, 6)
  map[["label_e"]] <- 2
  map[["label_f"]] <- 4

  # Labels can be fetched for specified values.
  values <- c("value_1", "value_3")
  result_key <- c(value_1 = "label_a", value_3 = "label_a")
  expect_identical(result_key, labels(map, values = values))

  # For values with multiple labels, the first label is returned by default.
  values <- c("value_2", "value_4")
  result_key <- c(value_2 = "label_b", value_4 = "label_b")
  result <- labels(map, values = values)
  expect_identical(result_key, result)

  # The last value for each label can also be fetched.
  values <- c("value_1", "value_2", "value_4")
  result_key <- c(value_1 = "label_a", value_2 = "label_e", value_4 = "label_f")
  result <- labels(map, values = values, select = "last")
  expect_identical(result_key, result)

  # It is also possible to fetch the label that is shared by the most values
  # in the requested set. If multiple labels are equally common, the first
  # is returned.
  values <- c("value_2", "value_4", "value_6")
  result_key <- c(value_2 = "label_c", value_4 = "label_c", value_6 = "label_c")
  result <- labels(map, values = values, select = "common")
  expect_identical(result_key, result)

  # Label resolution is based on the column order of the underlying matrix
  label_order <- c(
    "label_a",
    "label_e",
    "label_b",
    "label_c",
    "label_e",
    "label_f"
  )
  map <- map[, label_order, drop = FALSE]
  values <- c("value_2", "value_4")
  result_key <- c(value_2 = "label_e", value_4 = "label_b")
  result <- labels(map, values = values)
  expect_identical(result_key, result)

  # The output order is taken from the `values` parameter.
  values <- c("value_3", "value_1")
  result_key <- c(value_3 = "label_a", value_1 = "label_a")
  result <- labels(map, values = values)
  expect_identical(result_key, result)

  # Values without labels are excluded in the return.
  values <- c("value_1", "value_5", "value_6")
  result_key <- c(value_1 = "label_a", value_6 = "label_c")
  result <- labels(map, values = values)
  expect_identical(result_key, result)

  # If no labels are found, an error is raised.
  expect_error(labels(map, "value_5"))
})

Before this PR is ready to merge, please:

  1. Update the base branch for this PR to main and then rebase accordingly.
  2. Add tests/testthat/test_logmap.R with the test described above.
  3. Bump the package version in the DESCRIPTION file (e.g. 5.0.99.9002 -> 5.0.99.9003)

Once those updates are complete I’ll be more than happy to merge this in—should not be too long before the next CRAN release 🚀

@mihem mihem force-pushed the faster_logmap branch 3 times, most recently from 58bc6d5 to 9f8b465 Compare December 7, 2024 17:28
@mihem
Copy link
Contributor Author

mihem commented Dec 7, 2024

@dcollins15 Thanks, this is great.

I absolutely understand that this is a large project with lots of user requests that require a lot of resources. And I still think that Seurat is the best single cell analysis tool out there by far. I was just worried because there hasn't been much response from the Seurat team in the last year (whereas in the three years or so before that, it worked pretty well) And @igrabski wrote in June that the team will have a look soon: satijalab/seurat#7879 (comment) . So I guess you can also imagine that at least for most PR (like this one), users also put some time and resources and would like to see this merged so that their work pays off and Seurat improves.

But now: great that you had a look and even provided tests, perfect!
I had some issues with rebasing because the default branch was master when i forked this repo, now you use main. But i hope that this is all good now. Otherwise happy to provide further input.

And I know that there is a lot of work and that you cannot do everything on your own and that I repeat myself. But there are so many great PRs out there and reviewing them should be much less work than implementing the feature. E.g. :
1)DietSeurat does not work, PR is here satijalab/seurat#8197 and this has been mentioned several times, e.g. here satijalab/seurat#9464
2) Increased compatibility with BPcells for RunPCA satijalab/seurat#8271.

@mihem mihem changed the base branch from develop to main December 7, 2024 18:21
@dcollins15
Copy link
Collaborator

Closing and re-opening to trigger CI checks.

@dcollins15 dcollins15 closed this Dec 9, 2024
@dcollins15 dcollins15 reopened this Dec 9, 2024
@dcollins15 dcollins15 merged commit e840bab into satijalab:main Dec 9, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants