Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copula model underfitting #6

Open
RoseYuan opened this issue Mar 24, 2023 · 1 comment
Open

Copula model underfitting #6

RoseYuan opened this issue Mar 24, 2023 · 1 comment

Comments

@RoseYuan
Copy link

RoseYuan commented Mar 24, 2023

Hello there,

I'm trying to use scDesign3 to simulate single-cell ATAC-seq data, and I like your cool results shown in FigureS6 and S7 of the paper. But so far I cannot repeat that on another dataset, because the copula model is always underfitting. I attached my code and some results here. Do you have any idea why the model is not working?

My code:
simu <- scdesign3( sce = sce, assay_use = "counts", celltype = "cell_type", pseudotime = NULL, spatial = NULL, other_covariates = NULL, mu_formula = "cell_type", sigma_formula = "1", family_use = "zip", n_cores = 2, usebam = FALSE, corr_formula = "cell_type", copula = "gaussian", DT = TRUE, pseudo_obs = FALSE, return_model = TRUE, nonzerovar = FALSE )

My results:

  1. The BIC and AIC for the copula model are Inf
  2. The marginal BIC and AIC is also very large (with aic.marginal=4431524, bic.marginal=4819040)
  3. The similarity of peak-peak correlation matrices between the real (training) data and the synthetic data is low (see below attachment).

selected.pdf

My training data:
They're two cell groups from a public sci-ATAC-seq atlas, where I select about 7000 peaks and 1000 cells.

Could you let me know for simulating ATAC-seq data, how many peaks you usually use/ would recommend to use?

@JSB-UCLA
Copy link

Hi Siyuan,
Thank you for your interest in our work! The correlation does look unsatisfying. In our study, we use 1133 peaks/3836 peaks for ATAC/SCIATAC, respectively (see Table S2). Ideally, your feature should be smaller than the cell number (due to the curse of dimensionality for correlation estimation), but we can not always guarantee this.

For your question:

  1. Since you use Gaussian copula and your feature number is larger than cell number, AIC/BIC will always be Inf. Using vine coupla can give you AIC/BIC but it will also be very slow for > 1000 features.
  2. Marginal AIC/BIC seems Ok to me.
  3. Yes, it looks weird. Some things you may want to try: (a) Start from a smaller set of peaks and check the result (e.g., < 1000 peaks); (b) Try NB instead of ZIP.

Would you mind sharing your sce data with me (I guess it is Ok since it comes from public data)? I can do a very quick check. My email is [email protected]

Best,
Dongyuan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants