Skip to content

Latest commit

 

History

History
100 lines (93 loc) · 4.13 KB

README.org

File metadata and controls

100 lines (93 loc) · 4.13 KB

EpiClust

Single cell module detection for ATAC-seq, multiomics, and peak-gene linking

To install

You may need to create a CONDA environment as specified in epiclust.yml.

After dependencies are met, this sequence of commands installs the library.

git clone [email protected]:KellisLab/epiclust.git
cd epiclust && pip install .

Preprocessing

Make sure your peaks are in an AnnData (H5AD) file.

Using a Gene accessiblity matrix (e.g. as estimated by epiclust_gene_estimation.py) helps aggregate nearby peaks better than by alone. This helps even with multiomic ATAC+GEX: use pseudo triple-omic (Peaks, Gene Accessiblity, Gene Expression) So, for combining these:

from sklearn.feature_extraction.text import TfidfTransformer
import anndata
import scanpy as sc
### Use TF-IDF (non-logTF) on peak matrix
peaks.X = TfidfTransformer().fit_transform(peaks.X)
### Normalize peak matrix to 10000 counts per cell to facilitate integration with gene accessibility
sc.pp.normalize_total(peaks, target_sum=10000)
### Normalize gene accessibility to 10000 counts per cell
sc.pp.normalize_total(gacc, target_sum=10000)
### Normalize gene expression (if multi-omic GEX+ATAC) to 10000 counts per cell
sc.pp.normalize_total(gexp, target_sum=10000)
### Concatenate modalities
adata = anndata.concat({
  "Peaks": peaks,
  "Gene Expression": gexp, ### if multi-omic. else omit
  "Gene Accessibility": gacc
}, axis=1, label="feature_types", merge="same")
### Rename accessibility genes if expression genes are already present
adata.var_names_make_unique("-GAcc-")
### Log-scale
sc.pp.log1p(adata)
### Calculate QC metrics
sc.pp.calculate_qc_metrics(adata, inplace=True, percent_top=[])
### PCA
sc.pp.pca(adata, n_comps=100)

multi-omic workflow (RECOMMENDED)

Note that the n_neighbors parameter can be changed, as n_neighbors is the per-batch number of neighbors

import epiclust as ec
import numpy as np
import seaborn as sns
for power in np.linspace(0, 1, 5):
	print("Calculating power:", power)
	ec.fit(adata, power=power, margin="log1p_total_counts", batch="feature_types")
	ec.neighbors(adata, n_neighbors=5)

graphs = ["pow_%.2f" % power for power in np.linspace(0, 1, 5)]
ec.filter_var(adata, graphs, min_cells=3)
ec.leiden(adata, graphs, resolution=1., max_comm_size=2500, min_comm_size=3)
ec.infomap(adata, graphs, min_comm_size=3)
ec.umap(adata, graphs)
ax = sns.scatterplot(x=adata.varm["X_umap"][:, 0], y=adata.varm["X_umap"][:, 1], hue=adata.var["leiden"])

Peak-Gene / Peak-Peak / Gene-Gene linking sample workflow

gtf = ec.gtf.load_gtf("gencode.gtf.gz")
dw = ec.gene_distance.distance_weight_all(ec.gene_distance.peak_names_to_var(adata.var.index.values), gtf)
dw = dw.loc[dw["gene"].isin(adata.var.index.values), :]
links = {}
for power in np.linspace(0, 1, 5):
	print("Calculating power:", power)
	ec.fit(adata, power=power, margin="log1p_total_counts", batch="feature_types",
	       squared_correlation=True,
               covariates=["batch", "pmi", "log1p_total_counts"]) ### technical covariates in .obs
	links[power] = ec.linking(adata, dw["gene"].values, dw["peak"].values)

### TODO combine table

single-omic workflow (DEPRECATED except for RNA-seq)

import epiclust as ec
for power in np.linspace(0, 1, 5):
	print("Calculating power:", power)
	ec.fit(adata, power=power, margin="log1p_total_counts")
	ec.neighbors(adata, key_added="pow_%.2f" % power)

ec.filter_var(adata, ["pow_%.2f" % power for power in np.linspace(0, 1, 5)], min_cells=3)
ec.leiden(adata, ["pow_%.2f" % power for power in np.linspace(0, 1, 5)], resolution=1., max_comm_size=2500, min_comm_size=3)
ec.infomap(adata, ["pow_%.2f" % power for power in np.linspace(0, 1, 5)], min_comm_size=3)

wishlist

  1. typing
  2. documentation
  3. translate to muon (multi-omic) and anndata (single-omic) framework
  4. .umap() without filtering giant matrix
  5. cleanup function that removes varm, uns ???
  6. batched gene_estimation for lower memory footprint
  7. iterative expansion to rest of features
  8. centrality-based module naming
  9. pygreat integration?