Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about simulated data #1

Open
yhtgrace opened this issue Oct 29, 2020 · 2 comments
Open

Question about simulated data #1

yhtgrace opened this issue Oct 29, 2020 · 2 comments

Comments

@yhtgrace
Copy link

yhtgrace commented Oct 29, 2020

Hi! Thanks for the cool software. I downloaded the example simulated data from processed_data/sim_MOB_pattern2_fc3_tau35_count_power1.csv and when I ran SpatialDE and SPARK on the data, I got 0 (SpatialDE) and 112 (SPARK) significant genes at qval < 0.05. Is this (a single replicate of) the same data used for figure 1c/the expected result? If I understand the plot correctly, it seems that at FDR = 0.05, more genes should be recovered.

Code for spatialDE

counts = pd.read_csv("../SPARK-Analysis/processed_data/sim_MOB_pattern2_fc3_tau35_count_power1.csv", index_col = 0)
counts = counts.T[(counts > 0).sum(0) >= 3].T

x, y = zip(*[pos.split('x') for pos in counts.index])
sample_info = pd.DataFrame({
    'x': np.array(x).astype(float), 
    'y': np.array(y).astype(float), 
    'total_counts': counts.sum(1)
})

norm_expr = NaiveDE.stabilize(counts.T).T
resid_expr = NaiveDE.regress_out(sample_info, norm_expr.T, 'np.log(total_counts)').T

X = sample_info[['x', 'y']]
results = SpatialDE.run(X, resid_expr)

(results.qval < 0.05).sum() # returns 0

Code for SPARK

countdata <- read.csv("../SPARK-Analysis/processed_data/sim_MOB_pattern2_fc3_tau35_count_power1.csv", row.names = 1)

rn <- row.names(countdata)
info <- cbind.data.frame(x = as.numeric(sapply(strsplit(rn, split = "x"), "[", 1)), 
                         y = as.numeric(sapply(strsplit(rn, split = "x"), "[", 2)))
rownames(info) <- row.names(countdata)

spark <- CreateSPARKObject(counts = t(countdata), location = info[,1:2], 
    percentage = 0.1, min_total_counts = 10)
spark@lib_size <- apply(spark@counts, 2, sum)

spark <- spark.vc(spark, covariates = NULL, lib_size = spark@lib_size, 
    num_core = 1, verbose = T, fit.maxiter = 500)
spark <- spark.test(spark, check_positive = T, verbose = T)

spark <- spark.test(spark, check_positive = T, verbose = F)

sum(spark@res_mtest$adjusted_pvalue < 0.05) # returns 102

@jakezhusph
Copy link
Contributor

Hi! Thanks for the cool software. I downloaded the example simulated data from processed_data/sim_MOB_pattern2_fc3_tau35_count_power1.csv and when I ran SpatialDE and SPARK on the data, I got 0 (SpatialDE) and 112 (SPARK) significant genes at qval < 0.05. Is this (a single replicate of) the same data used for figure 1c/the expected result? If I understand the plot correctly, it seems that at FDR = 0.05, more genes should be recovered.

Code for spatialDE

counts = pd.read_csv("../SPARK-Analysis/processed_data/sim_MOB_pattern2_fc3_tau35_count_power1.csv", index_col = 0)
counts = counts.T[(counts > 0).sum(0) >= 3].T

x, y = zip(*[pos.split('x') for pos in counts.index])
sample_info = pd.DataFrame({
    'x': np.array(x).astype(float), 
    'y': np.array(y).astype(float), 
    'total_counts': counts.sum(1)
})

norm_expr = NaiveDE.stabilize(counts.T).T
resid_expr = NaiveDE.regress_out(sample_info, norm_expr.T, 'np.log(total_counts)').T

X = sample_info[['x', 'y']]
results = SpatialDE.run(X, resid_expr)

(results.qval < 0.05).sum() # returns 0

Code for SPARK

countdata <- read.csv("../SPARK-Analysis/processed_data/sim_MOB_pattern2_fc3_tau35_count_power1.csv", row.names = 1)

rn <- row.names(countdata)
info <- cbind.data.frame(x = as.numeric(sapply(strsplit(rn, split = "x"), "[", 1)), 
                         y = as.numeric(sapply(strsplit(rn, split = "x"), "[", 2)))
rownames(info) <- row.names(countdata)

spark <- CreateSPARKObject(counts = t(countdata), location = info[,1:2], 
    percentage = 0.1, min_total_counts = 10)
spark@lib_size <- apply(spark@counts, 2, sum)

spark <- spark.vc(spark, covariates = NULL, lib_size = spark@lib_size, 
    num_core = 1, verbose = T, fit.maxiter = 500)
spark <- spark.test(spark, check_positive = T, verbose = T)

spark <- spark.test(spark, check_positive = T, verbose = F)

sum(spark@res_mtest$adjusted_pvalue < 0.05) # returns 102

Hi, this is not the way to calculate the power in the simulation. In the simulation, we know the true signal genes, and therefore we calculate the power directly by counting the number of true signal and false signals among the top genes. The power is essentially the number of true signals detected given certain number of false signals detected. let's say we simulate 1000 genes with 100 signals and 900 non signals. Then you apply both methods to the data, and order the pvalues first, then count how many top genes are signals and how many are non signals. Given detected one false signal, how many true signal you can get, that's the power.

The q-value or the adjusted p values are mainly for the real data, where we don't know the fundamental truth.

Also, the example data is just a toy example, for the detail simulation setting, you can check the supplementary material in our paper.

Let me know if you have any further questions.

@yhtgrace
Copy link
Author

Thanks for the clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants