ViVAE (vee-vay) is a toolkit for single-cell data denoising and dimensionality reduction.
It is published together with ViScore, a framework for fair and scalable evaluation of dimensionality reduction. Check out the associated paper: Interpretable models for scRNA-seq data embedding with multi-scale structure preservation, where we describe and validate our methods in-depth.
- ViVAE achieves state-of-the-art multi-scale structure preservation.
- This is especially, but not exclusively, suitable for data with trajectories, outlier populations or suspected batch effects.
- Our embedding model implements encoder indicatrices: a tool to measure local distortions of latent space.
- We integrate ViVAE with FlowSOM for visualisation.
- The ViVAE model is parametric, enabling transfer learning and embedding of new points.
- ViVAE can take advantage of modern GPU architectures, especially for training on large datasets.
For most datasets, ViVAE can be run on a consumer laptop. Availability of a GPU is a significant boost.
To try out ViVAE without installing it locally, follow the tutorial in tutorials/example_scrnaseq.ipynb
to use ViVAE in Google Colab.
Python installation
ViVAE is a Python package based on PyTorch. We recommend creating a new Anaconda environment for it.
On Linux or macOS, use the command line for installation. On Windows, use Anaconda Prompt.
conda create --name ViVAE python=3.11.7 \
numpy==1.26.3 numba==0.59.0 pandas==2.2.0 matplotlib==3.8.2 scipy==1.12.0 pynndescent==0.5.11 scikit-learn==1.4.0 scanpy==1.9.8 pytorch==2.1.2
conda activate ViVAE
pip install git+https://github.com/saeyslab/FlowSOM_Python.git@80529c6b7a1747e8e71042102ac8762c3bfbaa1b
pip install --upgrade git+https://github.com/saeyslab/ViVAE.git
GPU acceleration is recommended if available. To verify whether PyTorch can use CUDA, activate your ViVAE environment and type:
python -c "import torch; print(torch.cuda.is_available())"
Alternatively, to verify whether PyTorch can use Metal (on AMD/Apple Silicon Macs):
python -c "import torch; print(torch.backends.mps.is_available())"
This will print either True
or False
.
R installation
We are working on an R implementation of ViVAE that is independent of PyTorch.
In the meantime, to install and run ViVAE in R using reticulate, use our R vignette (tutorials/example_r.Rmd
) (an RMarkdown file you can open in RStudio).
Our tutorials will help you start using ViVAE quickly, be it with scRNA-seq or cytometry data. The tutorials include data pre-processing, discuss the most important hyperparameters of ViVAE and touch on evaluation of embeddings using ViScore.
Using ViVAE with scRNA-seq data
ViVAE was primarily designed for, and tested with, single-cell transcriptomic datasets.
To get you started, we provide an example workflow for analysis of bone marrow single-cell transcriptomic data with ViVAE. We evaluate the separation of distincts immune cell lineages and general structure preservation by ViVAE, t-SNE and UMAP.
Additionally, we compute embedding errors by population and demonstrate the use of neighbourhood composition plots for explaining sources of error.
Advantages and potential pitfalls of smooth embeddings are described and discussed.
The tutorial is provided as a Jupyter notebook (tutorials/example_scrnaseq.ipynb
).
Using ViVAE with cytometry data
ViVAE, while intended mainly for scRNA-seq data, is straightforward to use with flow and mass cytometry data as well.
Its structure-preserving properties are especially advantageous if global structures are of interest. Additionally, ViVAE integrates with FlowSOM to provide a graph-based view of cytometry datasets.
We provide a Jupyter notebook tutorial (tutorials/example_cytometry.ipynb
) that covers importing and pre-processing of data, denoising, dimensionality reduction and some evaluation of the resulting embedding.
Our R installation vignette (tutorials/example_r.Rmd
) shows how to use ViVAE denoising and dimensionality reduction from R.
We also showcase some experimental modifications of the model that will mostly be interesting for developers of dimensionality reduction algorithms, below.
Interesting modifications of the ViVAE model
Some additional examples of modifications to the ViVAE model are provided:
-
PCA initialisation or general approximation of other DR models using imitation loss:
tutorials/imitation.ipynb
. -
Using stochastic-MDS loss with cosine distances in input space:
tutorials/cosine.ipynb
.
The associated manuscript presents case studies on various single-cell datasets.
These case studies are replicated using Jupyter notebooks in the case_studies
directory.
Breast immune cells transcriptome study (Reed)
case_study_reed.ipynb
provides code to reproduce the Reed dataset case study from our paper.
This dataset comes from the Human Breast Cell Atlas.
The authors provide labels for various leukocyte populations.
We compare ViVAE with t-SNE and UMAP and describe embedding errors per cell population using the Extended Neighbourhood-Proportion-Error (xNPE) and neighbourhood composition plots.
Developing zebrafish embryos transcriptome study (Farrell)
case_study_farrell.ipynb
provides code to reproduce the Farrell dataset case study from our paper.
This dataset contains cells from multiple stages of zebrafish embryo development.
The authors provide labels of distinct cell lineages.
We compare t-SNE, UMAP, a vanilla VAE, default ViVAE and ViVAE-EncoderOnly (a decoder-less model that implements parametric stochastic MDS with GPU acceleration). The analysis in our paper focuses on the differences between neighbour-embedding algorithms (which tend to form separate clusters) and multi-dimensional scaling algorithms (which produce more continuous represerntations). We use encoder indicatrices to describe different manners of latent space distortion by the three VAE-based models.
Mouse bone marrow CyTOF dataset study (Samusik)
case_study_samusik.ipynb
provides code to reproduce the Samusik dataset case study from our paper.
This is a popular reference dataset for showcasing dimensionality reduction and clustering tools.
The authors provide labels for various immune cell populations.
We use ViVAE to create a nice embedding of the data, then use FlowSOM for clustering (independent of the dimension reduction) and show a plot of the embedding with the FlowSOM minimum spanning tree (MST) superimposed.
To explore more options for evaluating cytometry data embeddings and integrating FlowSOM for informative visualisation, we refer you to the cytometry analysis tutorial in tutorials/example_cytometry.ipynb
.
In our paper, we compare ViVAE and other DR methods in terms of local and global structure preservation using ViScore. The ViScore repository contains our documented benchmarking set-up, which can be extended to other datasets and DR methods. This set-up includes full documentation to guide the user through the process of benchmarking or hyperparameter tuning on an HPC cluster from start to finish.