Long runtime for simulating spatial data #22

yollct · 2023-12-12T11:56:57Z

Hi scDesign3 developers,

Thank you for maintaining the amazing tool!
When I was simulating the data from https://www.10xgenomics.com/resources/datasets/mouse-brain-serial-section-2-sagittal-anterior-1-standard-1-1-0, scDesign3 seems to stuck in 'Marginal Fitting' for a long time. Doesn't matter if I just simulate for 10 genes, it stucks at Marginal Fitting. I will very much appreciate your help!! Thanks :)

Best,
Chit Tong

Here is the code I used:

print("read reference data")
brain <- Load10X_Spatial(data.dir="/anterior_data/",
                        filename="filtered_feature_bc_matrix.h5",
                        assay="Spatial",
                        slice="slice1"
)
brain <- SCTransform(brain, assay = "Spatial", verbose = FALSE)
brain <- RunPCA(brain, assay = "SCT", verbose = FALSE)
brain <- FindNeighbors(brain, reduction = "pca", dims = 1:30)
brain <- FindClusters(brain, verbose = FALSE)
brain <- RunUMAP(brain, reduction = "pca", dims = 1:30)
SpatialFeaturePlot(brain, features = "nCount_Spatial") + theme(legend.position = "right")

brain[['spatial1']] <- GetTissueCoordinates(brain)$imagerow 
brain[['spatial2']] <- GetTissueCoordinates(brain)$imagecol
brain.sce <- as.SingleCellExperiment(brain)

brain.sce <- brain.sce[1:10]

print("simulation")
set.seed(123)

example_simu <- scdesign3(
    sce = brain.sce,
    assay_use = "counts",
    celltype = "ident", ##ident or cell_type
    pseudotime = NULL,
    spatial = c("spatial1", "spatial2"),
    other_covariates = NULL,
    mu_formula = "s(spatial1, spatial2, bs = 'gp', k= 400)",
    sigma_formula = "1",
    family_use = "nb",
    n_cores = 30,
    usebam = FALSE,
    corr_formula = "1",
    copula = "gaussian",
    DT = TRUE,
    pseudo_obs = FALSE,
    return_model = FALSE,
    nonzerovar = FALSE,
    parallelization = "pbmcapply"
  )

The text was updated successfully, but these errors were encountered:

yollct · 2023-12-12T14:13:41Z

Update, when I run for only 10 genes, it works. It is only taking longer than I expected. For the full dataset, it was running for four days and stopped by our server due to the time limit.

SONGDONGYUAN1994 · 2023-12-14T16:56:41Z

Hi,
Sorry for the delay in response. Unfortunately the estimation of spatial data can be time consuming. Two solutions you can try:

Set k = 100 instead of 400. The time increases almost quadratically with k. Smaller k may result in slight under-fit but usually 100 is large enough.
Set usebam = TRUE. It will use a faster estimation.

Please let me know if it works for you!

Best,
Dongyuan

lucas02061 · 2024-02-12T08:43:55Z

Hi,

When I try to run the tutorial code with the changes you mentioned I get the following warning?

Warning in estimate.theta(theta, family, y, mu, scale = scale1, wt = G$w, :
step failure in theta estimation

Is this a problem?

Best,
Lucas

SONGDONGYUAN1994 · 2024-02-12T15:20:06Z

@lucas02061 Hi Lucas,
This is an issue with bam: https://stats.stackexchange.com/questions/401278/step-failure-in-theta-estimation-using-bam-in-mgcv
Based on my experience, the function does not converge, but usually, it will not be too bad. I think you are probably OK with the result.

Best,
Dongyuan

lucas02061 · 2024-02-19T15:52:29Z

Hi @SONGDONGYUAN1994,

Are there any plans of adding a progress bar to scDesign3? The Marginal Fitting step is very slow on spatial transcriptomics data even with the settings you mentioned earlier. It is hard to know if it is unfeasible to do a run on a given dataset or it is almost done running without a progress bar.

Best,
Lucas

SONGDONGYUAN1994 · 2024-02-19T16:00:16Z

@lucas02061 Thanks for raising this problem. Are you still using Windows right now? On Linux, you can use pbmcapply, but on Windows, I am not sure (there should be one); @qw130 Qingyang, do you have any idea?

Unfortunately, I am traveling in the next two weeks but I will begin on improving the speed after that. Could you first subsample ~ 100 genes and tell me the running time? Thanks!

Best,
Dongyuan

lucas02061 · 2024-02-19T19:46:41Z

For only a 100 genes using the following settings it takes me approximatly15 minutes to run. Once I start going higher on genes though it starts taking many hours.

example_simu <- scdesign3(
sce = SCD_Input,
assay_use = "counts",
celltype = "cell_type",
pseudotime = NULL,
spatial = c("X", "Y"),
other_covariates = NULL,
mu_formula = "s(X, Y, bs = 'gp', k= 100)", # should be 100-400
sigma_formula = "1",
family_use = "nb",
n_cores = 2,
usebam = TRUE,
corr_formula = "1",
copula = "gaussian",
DT = TRUE,
pseudo_obs = FALSE,
return_model = FALSE,
nonzerovar = FALSE,
parallelization = "bpmapply"
)

Thanks for your help.

Best,
Lucas

lucas02061 · 2024-03-20T07:45:05Z

Hi @SONGDONGYUAN1994,

I tried running it on our server (which is quite heavy) for 24 hours and it is still stuck on marginal fitting. It still seems incredibly slow with the settings provided in this thread.

I used the following dataset:
class: SingleCellExperiment
dim: 18078 684
metadata(0):
assays(1): counts
rownames(18078): Xkr4 Sox17 ... CAAA01118383.1 CAAA01147332.1
rowData names(1): gene_id
colnames(684): AAAGACCCAAGTCGCG-1 AAAGGGATGTAGCAAG-1 ... TTGTGGCCCTGACAGT-1 TTGTTCAGTGTGCTAC-1
colData names(3): X Y cell_type
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
class: SingleCellExperiment
dim: 18078 684
metadata(0):
assays(1): counts
rownames(18078): Xkr4 Sox17 ... CAAA01118383.1 CAAA01147332.1
rowData names(1): gene_id
colnames(684): AAAGACCCAAGTCGCG-1 AAAGGGATGTAGCAAG-1 ... TTGTGGCCCTGACAGT-1 TTGTTCAGTGTGCTAC-1
colData names(3): X Y cell_type
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

And I used the following settings.
example_simu <- scdesign3(
sce = SCD_Input,
assay_use = "counts",
celltype = "cell_type",
pseudotime = NULL,
spatial = c("X", "Y"),
other_covariates = NULL,
mu_formula = "s(X, Y, bs = 'gp', k= 100)", # should be 100-400
sigma_formula = "1",
family_use = "nb",
n_cores = 2,
usebam = TRUE,
corr_formula = "1",
copula = "gaussian",
DT = TRUE,
pseudo_obs = FALSE,
return_model = FALSE,
nonzerovar = FALSE,
parallelization = "pbmcapply",
BPPARAM=NULL
)

Am I doing something wrong?

Best regards,
Lucas

SONGDONGYUAN1994 · 2024-03-20T19:45:04Z

Hi Lucas,
@lucas02061 I feel that it is probably caused by some very sparse genes; models on those genes might be hard to converge. That is not your fault - could you check if the running time is related to this? I can fix it later. @qw130 Hi Qingyang, could you help Lucas on this? Sorry that I am traveling next week.

fbao-fudan · 2024-07-22T16:50:45Z

Dear Dongyuan,

I also have the same issue in fitting spatial data. Can I feed in 10 genes each time, and repeat the fitting multiple times in independent runs to get the result?

Best,
Feng

SONGDONGYUAN1994 · 2024-07-23T01:55:34Z

Hi Feng,
If you ignore the gene-gene correlation structure (or you believe that all variations should be explained by spatial coordinates), this is totally OK. It would be better that you can first divide genes into gene modules/groups then do this fitting rather than randomly split genes into 10-gene groups.

fbao-fudan · 2024-07-23T04:34:43Z

Thank you. That is a great point. I will use gene module groups instead.

TotallyRealHolm · 2024-09-10T11:36:12Z

Hi developers, I want to add scDesign3 to a benchmarking paper that my research group is working on, but I also have trouble with running the tool for spatial data on a decently beefy linux server, where the server gets stuck on marginal fitting. It took approximately one hour and 24 minutes to run on the dataset from the tutorial for spatial data, is it expected for the tool to take a long time running on the this dataset?

I make use of the following parameters with multiple cores to speed it up.
example_simu <- scdesign3( sce = SCD_Input, assay_use = "counts", celltype = "cell_type", pseudotime = NULL, spatial = c("X", "Y"), other_covariates = NULL, mu_formula = "s(X, Y, bs = 'gp', k= 400)", sigma_formula = "1", family_use = "nb", n_cores = 16, usebam = FALSE, corr_formula = "1", copula = "gaussian", DT = TRUE, pseudo_obs = FALSE, return_model = FALSE, nonzerovar = FALSE, parallelization = "pbmcapply" ))

Best Regards Nikolaj.

SONGDONGYUAN1994 · 2024-09-12T19:10:24Z

@TotallyRealHolm
Dear Nikolaj,
Thank you for your interest in our work. Unfortunately, the running time for spatial data is much longer since the spatial pattern is like a 2-D surface, which is much more complicated. We do have some ideas for boosting the simulation.

Reduce the gene number. First, very sparse genes are very difficult to fit and you may set a threshold for gene expression. Second, genes with clear patterns are easier to converge. Therefore, if you would like to have DE/not DE setting, you probably only need to fit genes with high spatial variability as the true DE genes.
Reduce the parameter k. Now k = 400, which is large. The fitting time is almost quadratic to k. You can try a smaller k, which will give you simpler spatial patterns - but the pattern is still there. Based on this idea, we also add a new parameter in function fit_marginal(edf_flexible = TRUE), which will automatically select the k.

Please let me know if it works for you. Thanks!

Best,
Dongyuan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long runtime for simulating spatial data #22

Long runtime for simulating spatial data #22

yollct commented Dec 12, 2023

yollct commented Dec 12, 2023

Uh oh!

SONGDONGYUAN1994 commented Dec 14, 2023

Uh oh!

lucas02061 commented Feb 12, 2024

Uh oh!

SONGDONGYUAN1994 commented Feb 12, 2024

Uh oh!

lucas02061 commented Feb 19, 2024

Uh oh!

SONGDONGYUAN1994 commented Feb 19, 2024

Uh oh!

lucas02061 commented Feb 19, 2024

Uh oh!

lucas02061 commented Mar 20, 2024

Uh oh!

SONGDONGYUAN1994 commented Mar 20, 2024

Uh oh!

fbao-fudan commented Jul 22, 2024

Uh oh!

SONGDONGYUAN1994 commented Jul 23, 2024

Uh oh!

fbao-fudan commented Jul 23, 2024

Uh oh!

TotallyRealHolm commented Sep 10, 2024

Uh oh!

SONGDONGYUAN1994 commented Sep 12, 2024

Uh oh!

Long runtime for simulating spatial data #22

Long runtime for simulating spatial data #22

Comments

yollct commented Dec 12, 2023

yollct commented Dec 12, 2023

Uh oh!

SONGDONGYUAN1994 commented Dec 14, 2023

Uh oh!

lucas02061 commented Feb 12, 2024

Uh oh!

SONGDONGYUAN1994 commented Feb 12, 2024

Uh oh!

lucas02061 commented Feb 19, 2024

Uh oh!

SONGDONGYUAN1994 commented Feb 19, 2024

Uh oh!

lucas02061 commented Feb 19, 2024

Uh oh!

lucas02061 commented Mar 20, 2024

Uh oh!

SONGDONGYUAN1994 commented Mar 20, 2024

Uh oh!

fbao-fudan commented Jul 22, 2024

Uh oh!

SONGDONGYUAN1994 commented Jul 23, 2024

Uh oh!

fbao-fudan commented Jul 23, 2024

Uh oh!

TotallyRealHolm commented Sep 10, 2024

Uh oh!

SONGDONGYUAN1994 commented Sep 12, 2024

Uh oh!