-
Notifications
You must be signed in to change notification settings - Fork 0
API
This page describes the interface between different sections of the benchmarking pipeline. For more detail on the overall structure see the Pipeline page.
Each dataset script should produce a .h5ad
file containing an AnnData
object with the following structure:
-
adata.X
should contain raw counts -
adata.obs
should include a column containing batch labels for each cell (with any name) -
adata.obs
should include a column containing annotation labels for each cell (with any name)
Some basic quality control is performed by the preparation script but any proper filtering of cells should be performed here if the source provides unfiltered cells.
The saved .h5ad
file can contain any other information (embeddings, other .obs
columns etc.) but this will be removed during the preparation step.
The dataset preparation step produces two .h5ad
files, one containing the reference subset and one containing the query subset (split according to the provided query batches).
These files contain ONLY the following:
-
adata.X
a sparse matrix containing raw counts -
adata.obs["Batch"]
containing batch labels for each cell -
adata.obs["Label"]
containing annotation labels for each cell -
adata.obs["Unseen"]
containing unseen population labels for each cell -
adata.uns["Species"]
containing the species of the dataset
This script also performs minimal filtering of the dataset, removing cells with less than 100 counts or 100 expressed features and features with zero counts (in the reference). Labels with fewer than 20 cells are also removed from both the reference and the query. The query cannot contain labels not present in the reference unless they are explicitly marked as unseen populations. The output of this step is the input to both the feature selection methods and the integration steps.
The method scripts take the reference AnnData
from the preparation step and produce a TSV file with a column named "Feature"
containing the names of the selected features.
Other columns containing information from the method are allowed but will not be used by later steps.
The scVI integration step takes the reference dataset from the preparation step and produces a directory containing the scVI model and a .h5ad
file with the following structure:
-
adata.obs["Batch"]
containing batch labels for each cell -
adata.obs["Label"]
containing annotation labels for each cell -
adata.obs["Unseen"]
containing unseen population labels for each cell -
adata.obsm["X_emb"]
containing the integrated embedding
Note that the integration output does not contain any expression data. This is to save disk space by not duplicating data. If a metric or another stage requires expression data it needs to accept both the integration output and the prepared dataset as input. PNG files showing plots of the unintegrated and integrated UMAPs coloured by batch and label are also produced.
The scANVI integration step takes the integrated scVI model and produces a directory containing the scANVI model and a .h5ad
file with the following fields IN ADDITION to those from scVI:
-
adata.obs["ReferenceLabel"]
containing the labels used in training the scANVI model
The query mapping steps take the reference model from scVI or scANVI and produce a directory containing the corresponding query model and a .h5ad
file with the same structure as the integration steps. NOTE that the embeddings here only contain the query data NOT the reference.
Similar plots to the integration steps are also produced, with additional panels showing the dataset (reference/query) and unseen population label.
The label prediction step takes the output of the integration step, trains a classifier on the integrated embedding and predicts labels for the mapped dataset. The output is a TSV file with the following columns for each query cell:
-
ID
containing a unique cell ID -
Label
containing the ground truth cell label -
Unseen
containing unseen population labels for each cell -
PredLabel
containing the predicted cell label -
MaxProb
containing the probability for the predicted label -
Prob_{label}
columns containing the probability for each label in the reference dataset
The metrics scripts take the .h5ad
file produced by either the integration, mapping or label prediction steps (depending on the type of metric) and produce a TSV file with a SINGLE ROW and the following columns:
-
Dataset
containing the name of the dataset that was evaluated -
Method
containing the name of the feature selection method that was evaluated -
Integration
containing the name of integration that was evaluated ("scVI"
or"scANVI"
) -
Type
containing the type of the metric (eg."Integration"
,"Classification"
etc.) -
Metric
containing the name of the metric -
Value
containing the calculated metric score. If necessary, scores should be adjusted so that 1 is the best possible score and 0 is the worst possible score.