Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long runtime for simulating spatial data #22

Open
yollct opened this issue Dec 12, 2023 · 14 comments
Open

Long runtime for simulating spatial data #22

yollct opened this issue Dec 12, 2023 · 14 comments

Comments

@yollct
Copy link

yollct commented Dec 12, 2023

Hi scDesign3 developers,

Thank you for maintaining the amazing tool!
When I was simulating the data from https://www.10xgenomics.com/resources/datasets/mouse-brain-serial-section-2-sagittal-anterior-1-standard-1-1-0, scDesign3 seems to stuck in 'Marginal Fitting' for a long time. Doesn't matter if I just simulate for 10 genes, it stucks at Marginal Fitting. I will very much appreciate your help!! Thanks :)

Best,
Chit Tong

Here is the code I used:

print("read reference data")
brain <- Load10X_Spatial(data.dir="/anterior_data/",
                        filename="filtered_feature_bc_matrix.h5",
                        assay="Spatial",
                        slice="slice1"
)
brain <- SCTransform(brain, assay = "Spatial", verbose = FALSE)
brain <- RunPCA(brain, assay = "SCT", verbose = FALSE)
brain <- FindNeighbors(brain, reduction = "pca", dims = 1:30)
brain <- FindClusters(brain, verbose = FALSE)
brain <- RunUMAP(brain, reduction = "pca", dims = 1:30)
SpatialFeaturePlot(brain, features = "nCount_Spatial") + theme(legend.position = "right")

brain[['spatial1']] <- GetTissueCoordinates(brain)$imagerow 
brain[['spatial2']] <- GetTissueCoordinates(brain)$imagecol
brain.sce <- as.SingleCellExperiment(brain)

brain.sce <- brain.sce[1:10]

print("simulation")
set.seed(123)

example_simu <- scdesign3(
    sce = brain.sce,
    assay_use = "counts",
    celltype = "ident", ##ident or cell_type
    pseudotime = NULL,
    spatial = c("spatial1", "spatial2"),
    other_covariates = NULL,
    mu_formula = "s(spatial1, spatial2, bs = 'gp', k= 400)",
    sigma_formula = "1",
    family_use = "nb",
    n_cores = 30,
    usebam = FALSE,
    corr_formula = "1",
    copula = "gaussian",
    DT = TRUE,
    pseudo_obs = FALSE,
    return_model = FALSE,
    nonzerovar = FALSE,
    parallelization = "pbmcapply"
  )
@yollct
Copy link
Author

yollct commented Dec 12, 2023

Update, when I run for only 10 genes, it works. It is only taking longer than I expected. For the full dataset, it was running for four days and stopped by our server due to the time limit.

@SONGDONGYUAN1994
Copy link
Owner

Hi,
Sorry for the delay in response. Unfortunately the estimation of spatial data can be time consuming. Two solutions you can try:

  1. Set k = 100 instead of 400. The time increases almost quadratically with k. Smaller k may result in slight under-fit but usually 100 is large enough.
  2. Set usebam = TRUE. It will use a faster estimation.

Please let me know if it works for you!

Best,
Dongyuan

@lucas02061
Copy link

Hi,

When I try to run the tutorial code with the changes you mentioned I get the following warning?

Warning in estimate.theta(theta, family, y, mu, scale = scale1, wt = G$w, :
step failure in theta estimation

Is this a problem?

Best,
Lucas

@SONGDONGYUAN1994
Copy link
Owner

@lucas02061 Hi Lucas,
This is an issue with bam: https://stats.stackexchange.com/questions/401278/step-failure-in-theta-estimation-using-bam-in-mgcv
Based on my experience, the function does not converge, but usually, it will not be too bad. I think you are probably OK with the result.

Best,
Dongyuan

@lucas02061
Copy link

Hi @SONGDONGYUAN1994,

Are there any plans of adding a progress bar to scDesign3? The Marginal Fitting step is very slow on spatial transcriptomics data even with the settings you mentioned earlier. It is hard to know if it is unfeasible to do a run on a given dataset or it is almost done running without a progress bar.

Best,
Lucas

@SONGDONGYUAN1994
Copy link
Owner

@lucas02061 Thanks for raising this problem. Are you still using Windows right now? On Linux, you can use pbmcapply, but on Windows, I am not sure (there should be one); @qw130 Qingyang, do you have any idea?

Unfortunately, I am traveling in the next two weeks but I will begin on improving the speed after that. Could you first subsample ~ 100 genes and tell me the running time? Thanks!

Best,
Dongyuan

@lucas02061
Copy link

For only a 100 genes using the following settings it takes me approximatly15 minutes to run. Once I start going higher on genes though it starts taking many hours.

example_simu <- scdesign3(
sce = SCD_Input,
assay_use = "counts",
celltype = "cell_type",
pseudotime = NULL,
spatial = c("X", "Y"),
other_covariates = NULL,
mu_formula = "s(X, Y, bs = 'gp', k= 100)", # should be 100-400
sigma_formula = "1",
family_use = "nb",
n_cores = 2,
usebam = TRUE,
corr_formula = "1",
copula = "gaussian",
DT = TRUE,
pseudo_obs = FALSE,
return_model = FALSE,
nonzerovar = FALSE,
parallelization = "bpmapply"
)

Thanks for your help.

Best,
Lucas

@lucas02061
Copy link

Hi @SONGDONGYUAN1994,

I tried running it on our server (which is quite heavy) for 24 hours and it is still stuck on marginal fitting. It still seems incredibly slow with the settings provided in this thread.

I used the following dataset:
class: SingleCellExperiment
dim: 18078 684
metadata(0):
assays(1): counts
rownames(18078): Xkr4 Sox17 ... CAAA01118383.1 CAAA01147332.1
rowData names(1): gene_id
colnames(684): AAAGACCCAAGTCGCG-1 AAAGGGATGTAGCAAG-1 ... TTGTGGCCCTGACAGT-1 TTGTTCAGTGTGCTAC-1
colData names(3): X Y cell_type
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
class: SingleCellExperiment
dim: 18078 684
metadata(0):
assays(1): counts
rownames(18078): Xkr4 Sox17 ... CAAA01118383.1 CAAA01147332.1
rowData names(1): gene_id
colnames(684): AAAGACCCAAGTCGCG-1 AAAGGGATGTAGCAAG-1 ... TTGTGGCCCTGACAGT-1 TTGTTCAGTGTGCTAC-1
colData names(3): X Y cell_type
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

And I used the following settings.
example_simu <- scdesign3(
sce = SCD_Input,
assay_use = "counts",
celltype = "cell_type",
pseudotime = NULL,
spatial = c("X", "Y"),
other_covariates = NULL,
mu_formula = "s(X, Y, bs = 'gp', k= 100)", # should be 100-400
sigma_formula = "1",
family_use = "nb",
n_cores = 2,
usebam = TRUE,
corr_formula = "1",
copula = "gaussian",
DT = TRUE,
pseudo_obs = FALSE,
return_model = FALSE,
nonzerovar = FALSE,
parallelization = "pbmcapply",
BPPARAM=NULL
)

Am I doing something wrong?

Best regards,
Lucas

@SONGDONGYUAN1994
Copy link
Owner

Hi Lucas,
@lucas02061 I feel that it is probably caused by some very sparse genes; models on those genes might be hard to converge. That is not your fault - could you check if the running time is related to this? I can fix it later. @qw130 Hi Qingyang, could you help Lucas on this? Sorry that I am traveling next week.

@fbao-fudan
Copy link

Dear Dongyuan,

I also have the same issue in fitting spatial data. Can I feed in 10 genes each time, and repeat the fitting multiple times in independent runs to get the result?

Best,
Feng

@SONGDONGYUAN1994
Copy link
Owner

Hi Feng,
If you ignore the gene-gene correlation structure (or you believe that all variations should be explained by spatial coordinates), this is totally OK. It would be better that you can first divide genes into gene modules/groups then do this fitting rather than randomly split genes into 10-gene groups.

@fbao-fudan
Copy link

Thank you. That is a great point. I will use gene module groups instead.

@TotallyRealHolm
Copy link

Hi developers, I want to add scDesign3 to a benchmarking paper that my research group is working on, but I also have trouble with running the tool for spatial data on a decently beefy linux server, where the server gets stuck on marginal fitting. It took approximately one hour and 24 minutes to run on the dataset from the tutorial for spatial data, is it expected for the tool to take a long time running on the this dataset?

I make use of the following parameters with multiple cores to speed it up.
example_simu <- scdesign3( sce = SCD_Input, assay_use = "counts", celltype = "cell_type", pseudotime = NULL, spatial = c("X", "Y"), other_covariates = NULL, mu_formula = "s(X, Y, bs = 'gp', k= 400)", sigma_formula = "1", family_use = "nb", n_cores = 16, usebam = FALSE, corr_formula = "1", copula = "gaussian", DT = TRUE, pseudo_obs = FALSE, return_model = FALSE, nonzerovar = FALSE, parallelization = "pbmcapply" ))

Best Regards Nikolaj.

@SONGDONGYUAN1994
Copy link
Owner

@TotallyRealHolm
Dear Nikolaj,
Thank you for your interest in our work. Unfortunately, the running time for spatial data is much longer since the spatial pattern is like a 2-D surface, which is much more complicated. We do have some ideas for boosting the simulation.

  1. Reduce the gene number. First, very sparse genes are very difficult to fit and you may set a threshold for gene expression. Second, genes with clear patterns are easier to converge. Therefore, if you would like to have DE/not DE setting, you probably only need to fit genes with high spatial variability as the true DE genes.
  2. Reduce the parameter k. Now k = 400, which is large. The fitting time is almost quadratic to k. You can try a smaller k, which will give you simpler spatial patterns - but the pattern is still there. Based on this idea, we also add a new parameter in function fit_marginal(edf_flexible = TRUE), which will automatically select the k.

Please let me know if it works for you. Thanks!

Best,
Dongyuan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants