You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was unable to recover any perturbations with significant effects using the e-distance and corresponding test in the Norman et al. dataset following paper methods and suggested baselines. Following from the preprint manuscript:
As a baseline for significant perturbations, as defined by the E-test, we suggest at least 300 cells per perturbation (Fig 5B) and 1000 average UMI counts per cell (Fig 5D) as an experimental guideline for distinguishable perturbations.
And additional details from methods section:
Cells were kept if they had a minimum of 1000 UMI counts, and genes with a minimum of 50 cells. 2000 highly variable genes were selected using scanpy.pp.find_variable_genes with flavor ‘seurat_v3’. We normalized the count matrix using scanpy.pp.normalize_total and log-transformed the data using scanpy.pp.log1p; We did not z-scale the data. Next, we computed PCA based on the highly variable genes. The E-distances were computed in that PCA space using 50 components and Euclidean distance. To avoid problems due to different numbers of cells per perturbation, we subsampled each dataset such that all perturbations had the same number of cells. We removed all perturbations with fewer than 50 cells and then subsampled to the number of cells in the smallest perturbation left after filtering.
With min_cells=300 (cell per pert), subsampling/filtering the perturbation conditions resulted in 153 conditions (including control) with 300 total cells. The control cells condition is also subsampled. Upon usage of the e-distance test, 0 passed adjusted significance. I also tried this with min_cells=50 (cell per pert), with the same result of 0 passing significance. This is in contrast to Supp Table 4.
Can you please clarify this discrepancy? Pointing to the code would help. Thanks.
Here is the code I used -
import scanpy as sc
from scperturb import *
def filter_and_subsample(adata, perturbation_key, min_cells=300):
# Remove perturbations with fewer than n cells
perturbation_counts = adata.obs[perturbation_key].value_counts()
valid_perturbations = perturbation_counts[perturbation_counts >= min_cells].index
adata = adata[adata.obs[perturbation_key].isin(valid_perturbations)]
# Subsample each perturbation to the number of cells in the smallest perturbation left
min_cells_in_perturbation = min(adata.obs[perturbation_key].value_counts())
subsampled_indices = []
for pert in valid_perturbations:
pert_indices = adata.obs[adata.obs[perturbation_key] == pert].sample(min_cells_in_perturbation, random_state=42).index
subsampled_indices.extend(pert_indices)
adata_subsampled = adata[subsampled_indices]
return adata_subsampled
adata = sc.read_h5ad("data/NormanWeissman2019_filtered.h5ad")
adata = adata[adata.X.sum(axis=1) >= 1000]
sc.pp.filter_genes(adata, min_cells=50)
# or use min_cells=50
adata = filter_and_subsample(adata, 'perturbation', min_cells=300)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor='seurat_v3')
adata = adata[:, adata.var['highly_variable']]
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.tl.pca(adata, n_comps=50)
estats = edist_to_control(adata, obs_key='perturbation', control='control',
obsm_key='X_pca', dist='sqeuclidean', )
df = etest(adata, obs_key='perturbation', obsm_key='X_pca', dist='sqeuclidean', control='control', alpha=0.05, runs=1000, n_jobs=-1)
df['significant_adj'].sum()
The text was updated successfully, but these errors were encountered:
There were some changes made when switching to the final Nature Methods version of the paper, most notably the introduction of a sample correction. Here's an access link for the final version of the paper/figure.
Let me know if this material isn't enough to help and I'll take a closer look at what you posted.
I was unable to recover any perturbations with significant effects using the e-distance and corresponding test in the Norman et al. dataset following paper methods and suggested baselines. Following from the preprint manuscript:
And additional details from methods section:
With min_cells=300 (cell per pert), subsampling/filtering the perturbation conditions resulted in 153 conditions (including control) with 300 total cells. The control cells condition is also subsampled. Upon usage of the e-distance test, 0 passed adjusted significance. I also tried this with min_cells=50 (cell per pert), with the same result of 0 passing significance. This is in contrast to Supp Table 4.
Can you please clarify this discrepancy? Pointing to the code would help. Thanks.
Here is the code I used -
The text was updated successfully, but these errors were encountered: