Single cell module detection for ATAC-seq, multiomics, and peak-gene linking
You may need to create a CONDA environment as specified in epiclust.yml.
After dependencies are met, this sequence of commands installs the library.
git clone [email protected]:KellisLab/epiclust.git
cd epiclust && pip install .
Make sure your peaks are in an AnnData (H5AD) file.
Using a Gene accessiblity matrix (e.g. as estimated by epiclust_gene_estimation.py) helps aggregate nearby peaks better than by alone. This helps even with multiomic ATAC+GEX: use pseudo triple-omic (Peaks, Gene Accessiblity, Gene Expression) So, for combining these:
from sklearn.feature_extraction.text import TfidfTransformer
import anndata
import scanpy as sc
### Use TF-IDF (non-logTF) on peak matrix
peaks.X = TfidfTransformer().fit_transform(peaks.X)
### Normalize peak matrix to 10000 counts per cell to facilitate integration with gene accessibility
sc.pp.normalize_total(peaks, target_sum=10000)
### Normalize gene accessibility to 10000 counts per cell
sc.pp.normalize_total(gacc, target_sum=10000)
### Normalize gene expression (if multi-omic GEX+ATAC) to 10000 counts per cell
sc.pp.normalize_total(gexp, target_sum=10000)
### Concatenate modalities
adata = anndata.concat({
"Peaks": peaks,
"Gene Expression": gexp, ### if multi-omic. else omit
"Gene Accessibility": gacc
}, axis=1, label="feature_types", merge="same")
### Rename accessibility genes if expression genes are already present
adata.var_names_make_unique("-GAcc-")
### Log-scale
sc.pp.log1p(adata)
### Calculate QC metrics
sc.pp.calculate_qc_metrics(adata, inplace=True, percent_top=[])
### PCA
sc.pp.pca(adata, n_comps=100)
Note that the n_neighbors parameter can be changed, as n_neighbors is the per-batch number of neighbors
import epiclust as ec
import numpy as np
import seaborn as sns
for power in np.linspace(0, 1, 5):
print("Calculating power:", power)
ec.fit(adata, power=power, margin="log1p_total_counts", batch="feature_types")
ec.neighbors(adata, n_neighbors=5)
graphs = ["pow_%.2f" % power for power in np.linspace(0, 1, 5)]
ec.filter_var(adata, graphs, min_cells=3)
ec.leiden(adata, graphs, resolution=1., max_comm_size=2500, min_comm_size=3)
ec.infomap(adata, graphs, min_comm_size=3)
ec.umap(adata, graphs)
ax = sns.scatterplot(x=adata.varm["X_umap"][:, 0], y=adata.varm["X_umap"][:, 1], hue=adata.var["leiden"])
gtf = ec.gtf.load_gtf("gencode.gtf.gz")
dw = ec.gene_distance.distance_weight_all(ec.gene_distance.peak_names_to_var(adata.var.index.values), gtf)
dw = dw.loc[dw["gene"].isin(adata.var.index.values), :]
links = {}
for power in np.linspace(0, 1, 5):
print("Calculating power:", power)
ec.fit(adata, power=power, margin="log1p_total_counts", batch="feature_types",
squared_correlation=True,
covariates=["batch", "pmi", "log1p_total_counts"]) ### technical covariates in .obs
links[power] = ec.linking(adata, dw["gene"].values, dw["peak"].values)
### TODO combine table
import epiclust as ec
for power in np.linspace(0, 1, 5):
print("Calculating power:", power)
ec.fit(adata, power=power, margin="log1p_total_counts")
ec.neighbors(adata, key_added="pow_%.2f" % power)
ec.filter_var(adata, ["pow_%.2f" % power for power in np.linspace(0, 1, 5)], min_cells=3)
ec.leiden(adata, ["pow_%.2f" % power for power in np.linspace(0, 1, 5)], resolution=1., max_comm_size=2500, min_comm_size=3)
ec.infomap(adata, ["pow_%.2f" % power for power in np.linspace(0, 1, 5)], min_comm_size=3)
- typing
- documentation
- translate to muon (multi-omic) and anndata (single-omic) framework
- .umap() without filtering giant matrix
- cleanup function that removes varm, uns ???
- batched gene_estimation for lower memory footprint
- iterative expansion to rest of features
- centrality-based module naming
- pygreat integration?