Skip to content

Latest commit

 

History

History
275 lines (205 loc) · 13.2 KB

README.md

File metadata and controls

275 lines (205 loc) · 13.2 KB

Comprehensive mapping of tissue cell architecture via integrated single cell and spatial transcriptomics (cell2location model)

Stars Build Status Documentation Status Downloads Open In Colab Docker image on quay.io

If you use cell2location please cite our paper:

Kleshchevnikov, V., Shmatko, A., Dann, E. et al. Cell2location maps fine-grained cell types in spatial transcriptomics. Nat Biotechnol (2022). https://doi.org/10.1038/s41587-021-01139-4 https://www.nature.com/articles/s41587-021-01139-4

Please note that cell2locations requires 2 user-provided hyperparameters (N_cells_per_location and detection_alpha) - for detailed guidance on setting these hyperparameters and their impact see the flow diagram and the note. Many real datasets (especially human) show within-slide variability in RNA detection sensitivity - requiring you to try both recommended settings of the detection_alpha parameter: detection_alpha=200 for low within-slide technical variability and detection_alpha=20 for high within-slide technical variability.

Cell2location is a principled Bayesian model that can resolve fine-grained cell types in spatial transcriptomic data and create comprehensive cellular maps of diverse tissues. Cell2location accounts for technical sources of variation and borrows statistical strength across locations, thereby enabling the integration of single cell and spatial transcriptomics with higher sensitivity and resolution than existing tools. This is achieved by estimating which combination of cell types in which cell abundance could have given the mRNA counts in the spatial data, while modelling technical effects (platform/technology effect, contaminating RNA, unexplained variance).

Overview of the spatial mapping approach and the workflow enabled by cell2location. From left to right: Single-cell RNA-seq and spatial transcriptomics profiles are generated from the same tissue (1). Cell2location takes scRNA-seq derived cell type reference signatures and spatial transcriptomics data as input (2, 3). The model then decomposes spatially resolved multi-cell RNA counts matrices into the reference signatures, thereby establishing a spatial mapping of cell types (4).

Usage and Tutorials

The tutorial covering the estimation of expresson signatures of reference cell types, spatial mapping with cell2location and the downstream analysis can be found here and tried on Google Colab: https://cell2location.readthedocs.io/en/latest/

Please report bugs via https://github.com/BayraktarLab/cell2location/issues and ask any usage questions about cell2location, scvi-tools or Visium data in scverse community discourse.

Cell2location package is implemented in a general way (using https://pyro.ai/ and https://scvi-tools.org/) to support multiple related models - both for spatial mapping, estimating reference cell type signatures and downstream analysis.

Installation

We suggest using a separate conda environment for installing cell2location.

Create conda environment and install cell2location package

conda create -y -n cell2loc_env python=3.10

conda activate cell2loc_env
pip install cell2location[tutorials]

Finally, to use this environment in jupyter notebook, add jupyter kernel for this environment:

conda activate cell2loc_env
python -m ipykernel install --user --name=cell2loc_env --display-name='Environment (cell2loc_env)'

If you do not have conda please install Miniconda first:

cd /path/to/software
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# use prefix /path/to/software/miniconda3

Before installing cell2location and it's dependencies, it could be necessary to make sure that you are creating a fully isolated conda environment by telling python to NOT use user site for installing packages by running this line before creating conda environment and every time before activatin conda environment in a new terminal session:

export PYTHONNOUSERSITE="literallyanyletters"

Documentation and API details

User documentation is availlable on https://cell2location.readthedocs.io/en/latest/.

Cell2location architecture is designed to simplify extended versions of the model that account for additional technical and biologial information. We plan to provide a tutorial showing how to add new model classes but please get in touch if you would like to contribute or build on top our package.

Acknowledgements

We thank all paper authors for their contributions: Vitalii Kleshchevnikov, Artem Shmatko, Emma Dann, Alexander Aivazidis, Hamish W King, Tong Li, Artem Lomakin, Veronika Kedlian, Mika Sarkin Jain, Jun Sung Park, Lauma Ramona, Liz Tuck, Anna Arutyunyan, Roser Vento-Tormo, Moritz Gerstung, Louisa James, Oliver Stegle, Omer Ali Bayraktar

We also thank Pyro developers (Fritz Obermeyer, Martin Jankowiak), Krzysztof Polanski, Luz Garcia Alonso, Carlos Talavera-Lopez, Ni Huang for feedback on the package, Martin Prete for dockerising cell2location and other software support.

FAQ

See https://github.com/BayraktarLab/cell2location/discussions

Future development and experimental features

Future developments of cell2location are focused on 1) scalability to 100k-mln+ locations using amortised inference of cell abundance (same ideas as used in VAE), 2) extending cell2location to related spatial analysis tasks that require modification of the model (such as using cell type hierarchy information), and 3) incorporating features presented by more recently proposed methods (such as CAR spatial proximity modelling). We are also experimenting with Numpyro and JAX (https://github.com/vitkl/cell2location_numpyro).

Tips

Conda environment for A100 GPUs

export PYTHONNOUSERSITE="True"
conda create -y -n cell2location_cuda118_torch22 python=3.10
conda activate cell2location_cuda118_torch22

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

pip3 install scvi-tools==1.1.2

pip install git+https://github.com/BayraktarLab/cell2location.git#egg=cell2location[tutorials,dev]
python -m ipykernel install --user --name=cell2location_cuda118_torch22 --display-name='Environment (cell2location_cuda118_torch22)'

Issues with package version mismatches often originate from python user site rather than conda environment being used to install a subset of packages

Before installing cell2location and it's dependencies, it could be necessary to make sure that you are creating a fully isolated conda environment by telling python to NOT use user site for installing packages by running this line before creating conda environment and every time before activatin conda environment in a new terminal session:

export PYTHONNOUSERSITE="True"

Useful code for reading and combining multiple Visium sections

Keeping info on distinct sections in a csv file (Google Sheet).

sample_annot = pd.read_csv('./sample_annot.csv')

from glob import glob
sample_annot['path'] = pd.Series(
    glob(f'{sp_data_folder}*'),
    index=[sub('^.+WTSI_', '', sub('_GRCh38-2020-A$', '', i)) for i in glob(f'{sp_data_folder}*')]
)[sample_annot['Sample_ID']].values
import os
sample_annot['file'] = [os.path.basename(i) for i in sample_annot['path']]

sample_annot['Sample_ID'].unique()

Reading and concatenating samples.

def read_and_qc(sample_name, file, path=sp_data_folder):
    """
    Read one Visium file and add minimum metadata and QC metrics to adata.obs
    NOTE: var_names is ENSEMBL ID as it should be, you can always plot with sc.pl.scatter(gene_symbols='SYMBOL')
    """
    
    adata = sc.read_visium(path + str(file) +'/',
                           count_file='filtered_feature_bc_matrix.h5',
                           load_images=True)
    adata.obs['sample'] = sample_name
    adata.var['SYMBOL'] = adata.var_names
    adata.var.rename(columns={'gene_ids': 'ENSEMBL'}, inplace=True)
    adata.var_names = adata.var['ENSEMBL']
    adata.var.drop(columns='ENSEMBL', inplace=True)
    
    # just in case there are non-unique ENSEMBL IDs
    adata.var_names_make_unique()

    # Calculate QC metrics
    sc.pp.calculate_qc_metrics(adata, inplace=True)
    adata.var['mt'] = [gene.startswith('mt-') for gene in adata.var['SYMBOL']]
    adata.obs['mt_frac'] = adata[:, adata.var['mt'].tolist()].X.sum(1).A.squeeze()/adata.obs['total_counts']
    
    # add sample name to obs names
    adata.obs["sample"] = [str(i) for i in adata.obs['sample']]
    adata.obs_names = 's' + adata.obs["sample"] \
                          + '_' + adata.obs_names
    adata.obs.index.name = 'spot_id'
    
    file = list(adata.uns['spatial'].keys())[0]
    adata.uns['spatial'][sample_name] = adata.uns['spatial'][file].copy()
    del adata.uns['spatial'][file]
    print(adata.uns['spatial'].keys())
    
    return adata

def read_all_and_qc(
    sample_annot, Sample_ID_col, file_col, sp_data_folder, 
    count_file='filtered_feature_bc_matrix.h5',
):
    """
    Read and concatenate all Visium files.
    """
    # read first sample
    adata = read_and_qc(
        sample_annot[Sample_ID_col][0], sample_annot[file_col][0], 
        path=sp_data_folder
    ) 

    # read the remaining samples
    slides = {}
    for i, s in enumerate(sample_annot[Sample_ID_col][1:]):
        adata_1 = read_and_qc(s, sample_annot[file_col][i], path=sp_data_folder) 
        slides[str(s)] = adata_1

    adata_0 = adata.copy()

    # combine individual samples
    #adata = adata.concatenate(list(slides.values()), index_unique=None)
    adata = adata.concatenate(
        list(slides.values()),
        batch_key="sample",
        uns_merge="unique",
        batch_categories=sample_annot[Sample_ID_col], 
        index_unique=None
    )

    sample_annot.index = sample_annot[Sample_ID_col]
    for c in sample_annot.columns:
        sample_annot.loc[:, c] = sample_annot[c].astype(str)
    adata.obs[sample_annot.columns] = sample_annot.reindex(index=adata.obs['sample']).values
    
    return adata
    
adata = read_all_and_qc(
    sample_annot=sample_annot, 
    Sample_ID_col='Sample_ID', 
    file_col='file', 
    sp_data_folder=sp_data_folder, 
    count_file='filtered_feature_bc_matrix.h5',
)

adata_incl_nontissue = read_all_and_qc(
    sample_annot=sample_annot, 
    Sample_ID_col='Sample_ID', 
    file_col='file', 
    sp_data_folder=sp_data_folder, 
    count_file='raw_feature_bc_matrix.h5',
)

Since Version 0.9.0 (released on 2023-04-11), the function AnnData.concatenate() has been deprecated in favour of anndata.concat() as per the official release notes (Reference). Here is the updated code snippet of read_all_and_qc:

from anndata import concat

def read_all_and_qc(
    sample_annot, Sample_ID_col, file_col, sp_data_folder, 
    count_file='filtered_feature_bc_matrix.h5',
):
    """
    Read and concatenate all Visium files.
    """

    # read all samples and store them in a list
    adatas = []
    for i, s in enumerate(sample_annot[Sample_ID_col]):
        adata_i = read_and_qc(s, Sample_ID_col[file_col][i], path=sp_data_folder) 
        adatas.append(adata_i)
    # combine individual samples
    adata = concat(
        adatas,
        merge="unique",
        uns_merge="unique",
        label="batch",
        keys=sample_annot[Sample_ID_col].tolist(), 
        index_unique=None
    )

    sample_annot.index = sample_annot[Sample_ID_col]
    for c in sample_annot.columns:
        sample_annot.loc[:, c] = sample_annot[c].astype(str)
    adata.obs[sample_annot.columns] = sample_annot.reindex(index=adata.obs['sample']).values

    return adata

adata = read_all_and_qc(
    sample_annot=sample_annot, 
    Sample_ID_col='Sample_ID', 
    file_col='file', 
    sp_data_folder=sp_data_folder, 
    count_file='filtered_feature_bc_matrix.h5',
)

cell2location.models.Cell2location.setup_anndata(
    adata=adata_vis,
    batch_key="batch")