Label projection

Automated cell type annotation from rich, labeled reference data

Repository: openproblems-bio/task_label_projection

Description

A major challenge for integrating single cell datasets is creating matching cell type annotations for each cell. One of the most common strategies for annotating cell types is referred to as “cluster-then-annotate” whereby cells are aggregated into clusters based on feature similarity and then manually characterized based on differential gene expression or previously identified marker genes. Recently, methods have emerged to build on this strategy and annotate cells using known marker genes. However, these strategies pose a difficulty for integrating atlas-scale datasets as the particular annotations may not match.

To ensure that the cell type labels in newly generated datasets match existing reference datasets, some methods align cells to a previously annotated reference dataset and then project labels from the reference to the new dataset.

Here, we compare methods for annotation based on a reference dataset. The datasets consist of two or more samples of single cell profiles that have been manually annotated with matching labels. These datasets are then split into training and test batches, and the task of each method is to train a cell type classifer on the training set and project those labels onto the test set.

Authors & contributors

name	roles
Nikolay Markov	author, maintainer
Scott Gigante	author
Robrecht Cannoodt	author

API

flowchart LR
  file_common_dataset("Common Dataset")
  comp_process_dataset[/"Data processor"/]
  file_solution("Solution")
  file_test("Test data")
  file_train("Training data")
  comp_control_method[/"Control method"/]
  comp_metric[/"Metric"/]
  comp_method[/"Method"/]
  file_prediction("Prediction")
  file_score("Score")
  file_common_dataset---comp_process_dataset
  comp_process_dataset-->file_solution
  comp_process_dataset-->file_test
  comp_process_dataset-->file_train
  file_solution---comp_control_method
  file_solution---comp_metric
  file_test---comp_control_method
  file_test---comp_method
  file_train---comp_control_method
  file_train---comp_method
  comp_control_method-->file_prediction
  comp_metric-->file_score
  comp_method-->file_prediction
  file_prediction---comp_metric

File format: Common Dataset

A subset of the common dataset.

Example file: resources_test/common/pancreas/dataset.h5ad

Format:

AnnData object
 obs: 'cell_type', 'batch'
 var: 'hvg', 'hvg_score'
 obsm: 'X_pca'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id'

Data structure:

Slot	Type	Description
`obs["cell_type"]`	`string`	Cell type information.
`obs["batch"]`	`string`	Batch information.
`var["hvg"]`	`boolean`	Whether or not the feature is considered to be a ‘highly variable gene’.
`var["hvg_score"]`	`double`	A ranking of the features by hvg.
`obsm["X_pca"]`	`double`	The resulting PCA embedding.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalized expression values.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["dataset_name"]`	`string`	Nicely formatted name.
`uns["dataset_url"]`	`string`	(Optional) Link to the original source of the dataset.
`uns["dataset_reference"]`	`string`	(Optional) Bibtex reference of the paper in which the dataset was published.
`uns["dataset_summary"]`	`string`	Short description of the dataset.
`uns["dataset_description"]`	`string`	Long description of the dataset.
`uns["dataset_organism"]`	`string`	(Optional) The organism of the sample in the dataset.
`uns["normalization_id"]`	`string`	Which normalization was used.

Component type: Data processor

A label projection dataset processor.

Arguments:

Name	Type	Description
`--input`	`file`	A subset of the common dataset.
`--output_train`	`file`	(Output) The training data.
`--output_test`	`file`	(Output) The test data (without labels).
`--output_solution`	`file`	(Output) The solution for the test data.

File format: Solution

The solution for the test data

Example file: resources_test/task_label_projection/pancreas/solution.h5ad

Format:

AnnData object
 obs: 'label', 'batch'
 var: 'hvg', 'hvg_score'
 obsm: 'X_pca'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id'

Data structure:

Slot	Type	Description
`obs["label"]`	`string`	Ground truth cell type labels.
`obs["batch"]`	`string`	Batch information.
`var["hvg"]`	`boolean`	Whether or not the feature is considered to be a ‘highly variable gene’.
`var["hvg_score"]`	`double`	A ranking of the features by hvg.
`obsm["X_pca"]`	`double`	The resulting PCA embedding.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalized counts.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["dataset_name"]`	`string`	Nicely formatted name.
`uns["dataset_url"]`	`string`	(Optional) Link to the original source of the dataset.
`uns["dataset_reference"]`	`string`	(Optional) Bibtex reference of the paper in which the dataset was published.
`uns["dataset_summary"]`	`string`	Short description of the dataset.
`uns["dataset_description"]`	`string`	Long description of the dataset.
`uns["dataset_organism"]`	`string`	(Optional) The organism of the sample in the dataset.
`uns["normalization_id"]`	`string`	Which normalization was used.

File format: Test data

The test data (without labels)

Example file: resources_test/task_label_projection/pancreas/test.h5ad

Format:

AnnData object
 obs: 'batch'
 var: 'hvg', 'hvg_score'
 obsm: 'X_pca'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'normalization_id'

Data structure:

Slot	Type	Description
`obs["batch"]`	`string`	Batch information.
`var["hvg"]`	`boolean`	Whether or not the feature is considered to be a ‘highly variable gene’.
`var["hvg_score"]`	`double`	A ranking of the features by hvg.
`obsm["X_pca"]`	`double`	The resulting PCA embedding.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalized counts.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["normalization_id"]`	`string`	Which normalization was used.

File format: Training data

The training data

Example file: resources_test/task_label_projection/pancreas/train.h5ad

Format:

AnnData object
 obs: 'label', 'batch'
 var: 'hvg', 'hvg_score'
 obsm: 'X_pca'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'normalization_id'

Data structure:

Slot	Type	Description
`obs["label"]`	`string`	Ground truth cell type labels.
`obs["batch"]`	`string`	Batch information.
`var["hvg"]`	`boolean`	Whether or not the feature is considered to be a ‘highly variable gene’.
`var["hvg_score"]`	`double`	A ranking of the features by hvg.
`obsm["X_pca"]`	`double`	The resulting PCA embedding.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalized counts.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["normalization_id"]`	`string`	Which normalization was used.

Component type: Control method

Quality control methods for verifying the pipeline.

Arguments:

Name	Type	Description
`--input_train`	`file`	The training data.
`--input_test`	`file`	The test data (without labels).
`--input_solution`	`file`	The solution for the test data.
`--output`	`file`	(Output) The prediction file.

Component type: Metric

A label projection metric.

Arguments:

Name	Type	Description
`--input_solution`	`file`	The solution for the test data.
`--input_prediction`	`file`	The prediction file.
`--output`	`file`	(Output) Metric score file.

Component type: Method

A label projection method.

Arguments:

Name	Type	Description
`--input_train`	`file`	The training data.
`--input_test`	`file`	The test data (without labels).
`--output`	`file`	(Output) The prediction file.

File format: Prediction

The prediction file

Example file: resources_test/task_label_projection/pancreas/prediction.h5ad

Format:

AnnData object
 obs: 'label_pred'
 uns: 'dataset_id', 'normalization_id', 'method_id'

Data structure:

Slot	Type	Description
`obs["label_pred"]`	`string`	Predicted labels for the test cells.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["normalization_id"]`	`string`	Which normalization was used.
`uns["method_id"]`	`string`	A unique identifier for the method.

File format: Score

Metric score file

Example file: resources_test/task_label_projection/pancreas/score.h5ad

Format:

AnnData object
 uns: 'dataset_id', 'normalization_id', 'method_id', 'metric_ids', 'metric_values'

Data structure:

Slot	Type	Description
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["normalization_id"]`	`string`	Which normalization was used.
`uns["method_id"]`	`string`	A unique identifier for the method.
`uns["metric_ids"]`	`string`	One or more unique metric identifiers.
`uns["metric_values"]`	`double`	The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github		.github
common @ 1660eef		common @ 1660eef
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
_viash.yaml		_viash.yaml
main.nf		main.nf
nextflow.config		nextflow.config
thumbnail.svg		thumbnail.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Label projection

Description

Authors & contributors

API

File format: Common Dataset

Component type: Data processor

File format: Solution

File format: Test data

File format: Training data

Component type: Control method

Component type: Metric

Component type: Method

File format: Prediction

File format: Score

About

Releases

Packages

Languages

License

openproblems-bio/task_label_projection

Folders and files

Latest commit

History

Repository files navigation

Label projection

Description

Authors & contributors

API

File format: Common Dataset

Component type: Data processor

File format: Solution

File format: Test data

File format: Training data

Component type: Control method

Component type: Metric

Component type: Method

File format: Prediction

File format: Score

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages