-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Theta doesn't converge #235
Comments
I'm not sure what's going on with theta in your example. I'll try to get back to you on that, but in the meantime here's an example of some example code that will run MCMC across different subsets of genes and combine the results - it should be a good deal faster library("BASiCS")
library("BiocParallel")
## register the number of cores here as normal with biocparallel
## you cant use multicoreparam currently with BASiCS
register(SnowParam(workers = 2))
## make some mock data
sce <- BASiCS_MockSCE(NGenes = 4000, NCells = 100)
## fit a model normally
fit_joint <- BASiCS_MCMC(sce,
## N,Thin,Burn here are far too low, as this is just for a demonstration
N = 2000, Thin = 10, Burn = 1000, Regression = TRUE, WithSpikes = TRUE
)
## fit a model across 4 cores, each with a subset of the genes
fit_split <- BASiCS_MCMC(sce,
## N,Thin,Burn here are far too low, as this is just for a demonstration
N = 2000, Thin = 10, Burn = 1000, Regression = TRUE, WithSpikes = TRUE,
## split the data into batches with 1000 genes each
SubsetBy = "gene", NSubsets = 4
)
## compare
BASiCS_ShowFit(fit_joint)
BASiCS_ShowFit(fit_split) |
Hi @binyaminZ. You mention that you have 20-30K cells per sample. Is there a strong sub-structure in the data (i.e. do you have very distinct sub-types within those cells). BASiCS is designed for supervised analysis in which you have sorted distinct populations a priori and therefore strong substructure will break some of the model assumptions. If that's the case, perhaps you can run BASiCS separately for each distinct cell type? That will also speed up things a lot as you can run them in parallel. |
Hi @catavallejos, |
Dear @alanocallaghan, Thanks! |
That's right, and no it probably won't provide any benefit over just doing 2 subsets. That's because you'll still just have two cores running sequentially through basically the same amount of work.
I don't know how well Threads will work with the divide and conquer approach above, honestly to be safe I'd just set Threads = 1 and use the workers for parallelisation.
The main issue I saw when prototyping and testing this stuff was that estimating the mean vs overdispersion trend becomes very unstable with a small amount of genes. The code tries to divide into batches semi-intelligently, but even so if you have only a few hundred genes the trend is very noisy. Based on what I saw in testing, 500 genes per subset is about the minimum, but if you can run with 1000, then that should be a bit more reliable. |
Thanks!! I get it better now. (I have no experience with parallel processing in R, sorry for the confusion)
|
That's interesting! I don't know for sure what's going on there as different BiocParallel backends can be a bit odd. Can you post the rest of your code (just the stuff setting up BiocParallel and BASiCS)? |
|
Sorry for the delay getting back to this - for future reference it's easier to debug stuff like this if you give a reproducible example (ie something I can run locally without modifying, and not something that references files you have on your computer that I don't have access to). I'll patch out the code that causes this error in the next Bioc version As a quick fix, you can leave out the library("BASiCS")
library("Seurat")
library("data.table")
library("BiocParallel")
library("parallelly")
ncors <- availableCores()
register(SnowParam(workers = ncors))
data <- BASiCS_MockSCE()
samp <- "Run1"
basics_results <- BASiCS_MCMC(
data, N = 200, Thin = 20, Burn = 100, WithSpikes = FALSE,
Regression = FALSE, PrintProgress = FALSE,
SubsetBy = "gene", NSubsets = ncors
)
saveRDS(basics_results, paste0(samp, ".rds")) |
Hi!
I'm running BASiCS with a large dataset, with 20-30K cells per sample. For noise quantification, I down-sampled it to ~5K cells. This data has no spike-ins. If I understand correctly, batch information is somehow used instead of spike-ins. The problem is that in my MCMC chains all parameters converge nicely (with N = 20000, Thin = 20, Burn = 10000) except for Theta, which obviously doesn't and is approaching Zero. I don't care so much about this, it seems like my replicates are very similar and basically there is no batch contribution to mean or noise. May this have some negative impact on the estimates of the other parameters? Or can I ignore this?
Finally, I would like to run BASiCS on the complete data set. Problem is that the down-sampled dataset runs for ~70 hours, and doing the same for the complete dataset is very long. What may help? Decreasing the number of genes quantified? Splitting the two batches into smaller groups (probably not optimal for many genes with sparse expression)?
Thanks!
The text was updated successfully, but these errors were encountered: