Predict Modality

Predicting the profiles of one modality (e.g. protein abundance) from another (e.g. mRNA expression).

Repository: openproblems-bio/task_predict_modality

Description

Experimental techniques to measure multiple modalities within the same single cell are increasingly becoming available. The demand for these measurements is driven by the promise to provide a deeper insight into the state of a cell. Yet, the modalities are also intrinsically linked. We know that DNA must be accessible (ATAC data) to produce mRNA (expression data), and mRNA in turn is used as a template to produce protein (protein abundance). These processes are regulated often by the same molecules that they produce: for example, a protein may bind DNA to prevent the production of more mRNA. Understanding these regulatory processes would be transformative for synthetic biology and drug target discovery. Any method that can predict a modality from another must have accounted for these regulatory processes, but the demand for multi-modal data shows that this is not trivial.

Authors & contributors

name	roles
Alejandro Granados	author
Alex Tong	author
Bastian Rieck	author
Daniel Burkhardt	author
Kai Waldrant	contributor
Kaiwen Deng	contributor
Louise Deconinck	author
Robrecht Cannoodt	author, maintainer
Xueer Chen	contributor
Jiwei Liu	contributor

API

flowchart LR
  file_common_dataset_mod1("Raw dataset RNA")
  comp_process_datasets[/"Process Dataset"/]
  file_test_mod1("Test mod1")
  file_test_mod2("Test mod2")
  file_train_mod1("Train mod1")
  file_train_mod2("Train mod2")
  comp_control_method[/"Control method"/]
  comp_method_predict[/"Predict"/]
  comp_method_train[/"Train"/]
  comp_method[/"Method"/]
  comp_metric[/"Metric"/]
  file_prediction("Prediction")
  file_pretrained_model("Pretrained model")
  file_score("Score")
  file_common_dataset_mod2("Raw dataset mod2")
  file_common_dataset_mod1---comp_process_datasets
  comp_process_datasets-->file_test_mod1
  comp_process_datasets-->file_test_mod2
  comp_process_datasets-->file_train_mod1
  comp_process_datasets-->file_train_mod2
  file_test_mod1---comp_control_method
  file_test_mod1---comp_method_predict
  file_test_mod1---comp_method_train
  file_test_mod1---comp_method
  file_test_mod2---comp_control_method
  file_test_mod2---comp_metric
  file_train_mod1---comp_control_method
  file_train_mod1---comp_method_predict
  file_train_mod1---comp_method_train
  file_train_mod1---comp_method
  file_train_mod2---comp_control_method
  file_train_mod2---comp_method_predict
  file_train_mod2---comp_method_train
  file_train_mod2---comp_method
  comp_control_method-->file_prediction
  comp_method_predict-->file_prediction
  comp_method_train-->file_pretrained_model
  comp_method-->file_prediction
  comp_metric-->file_score
  file_prediction---comp_metric
  file_pretrained_model---comp_method_predict
  file_common_dataset_mod2---comp_process_datasets

Loading

File format: Raw dataset RNA

The RNA modality of the raw dataset.

Example file: resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod1.h5ad

Format:

AnnData object
 obs: 'batch', 'size_factors'
 var: 'feature_id', 'feature_name', 'hvg', 'hvg_score'
 obsm: 'gene_activity'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'

Data structure:

Slot	Type	Description
`obs["batch"]`	`string`	Batch information.
`obs["size_factors"]`	`double`	(Optional) The size factors of the cells prior to normalization.
`var["feature_id"]`	`string`	Unique identifier for the feature, usually a ENSEMBL gene id.
`var["feature_name"]`	`string`	(Optional) A human-readable name for the feature, usually a gene symbol.
`var["hvg"]`	`boolean`	Whether or not the feature is considered to be a ‘highly variable gene’.
`var["hvg_score"]`	`double`	A score for the feature indicating how highly variable it is.
`obsm["gene_activity"]`	`double`	(Optional) ATAC gene activity.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalized expression values.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["dataset_name"]`	`string`	Nicely formatted name.
`uns["dataset_url"]`	`string`	(Optional) Link to the original source of the dataset.
`uns["dataset_reference"]`	`string`	(Optional) Bibtex reference of the paper in which the dataset was published.
`uns["dataset_summary"]`	`string`	Short description of the dataset.
`uns["dataset_description"]`	`string`	Long description of the dataset.
`uns["dataset_organism"]`	`string`	(Optional) The organism of the sample in the dataset.
`uns["normalization_id"]`	`string`	The unique identifier of the normalization method used.
`uns["gene_activity_var_names"]`	`string`	(Optional) Names of the gene activity matrix.

Component type: Process Dataset

A predict modality dataset processor.

Arguments:

Name	Type	Description
`--input_mod1`	`file`	The RNA modality of the raw dataset.
`--input_mod2`	`file`	The second modality of the raw dataset. Must be an ADT or an ATAC dataset.
`--output_train_mod1`	`file`	(Output) The mod1 expression values of the train cells.
`--output_train_mod2`	`file`	(Output) The mod2 expression values of the train cells.
`--output_test_mod1`	`file`	(Output) The mod1 expression values of the test cells.
`--output_test_mod2`	`file`	(Output) The mod2 expression values of the test cells.
`--seed`	`integer`	(Optional) NA. Default: `1`.

File format: Test mod1

The mod1 expression values of the test cells.

Example file: resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/test_mod1.h5ad

Format:

AnnData object
 obs: 'batch', 'size_factors'
 var: 'gene_ids', 'hvg', 'hvg_score'
 obsm: 'gene_activity'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'common_dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'

Data structure:

Slot	Type	Description
`obs["batch"]`	`string`	Batch information.
`obs["size_factors"]`	`double`	(Optional) The size factors of the cells prior to normalization.
`var["gene_ids"]`	`string`	(Optional) The gene identifiers (if available).
`var["hvg"]`	`boolean`	Whether or not the feature is considered to be a ‘highly variable gene’.
`var["hvg_score"]`	`double`	A score for the feature indicating how highly variable it is.
`obsm["gene_activity"]`	`double`	(Optional) ATAC gene activity.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalized expression values.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["common_dataset_id"]`	`string`	(Optional) A common identifier for the dataset.
`uns["dataset_name"]`	`string`	Nicely formatted name.
`uns["dataset_url"]`	`string`	(Optional) Link to the original source of the dataset.
`uns["dataset_reference"]`	`string`	(Optional) Bibtex reference of the paper in which the dataset was published.
`uns["dataset_summary"]`	`string`	Short description of the dataset.
`uns["dataset_description"]`	`string`	Long description of the dataset.
`uns["dataset_organism"]`	`string`	(Optional) The organism of the sample in the dataset.
`uns["normalization_id"]`	`string`	The unique identifier of the normalization method used.
`uns["gene_activity_var_names"]`	`string`	(Optional) Names of the gene activity matrix.

File format: Test mod2

The mod2 expression values of the test cells.

Example file: resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/test_mod2.h5ad

Format:

AnnData object
 obs: 'batch', 'size_factors'
 var: 'gene_ids', 'hvg', 'hvg_score'
 obsm: 'gene_activity'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'common_dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'gene_activity_var_names'

Data structure:

Slot	Type	Description
`obs["batch"]`	`string`	Batch information.
`obs["size_factors"]`	`double`	(Optional) The size factors of the cells prior to normalization.
`var["gene_ids"]`	`string`	(Optional) The gene identifiers (if available).
`var["hvg"]`	`boolean`	Whether or not the feature is considered to be a ‘highly variable gene’.
`var["hvg_score"]`	`double`	A score for the feature indicating how highly variable it is.
`obsm["gene_activity"]`	`double`	(Optional) ATAC gene activity.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalized expression values.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["common_dataset_id"]`	`string`	(Optional) A common identifier for the dataset.
`uns["dataset_name"]`	`string`	Nicely formatted name.
`uns["dataset_url"]`	`string`	(Optional) Link to the original source of the dataset.
`uns["dataset_reference"]`	`string`	(Optional) Bibtex reference of the paper in which the dataset was published.
`uns["dataset_summary"]`	`string`	Short description of the dataset.
`uns["dataset_description"]`	`string`	Long description of the dataset.
`uns["dataset_organism"]`	`string`	(Optional) The organism of the sample in the dataset.
`uns["gene_activity_var_names"]`	`string`	(Optional) Names of the gene activity matrix.

File format: Train mod1

The mod1 expression values of the train cells.

Example file: resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/train_mod1.h5ad

Format:

AnnData object
 obs: 'batch', 'size_factors'
 var: 'gene_ids', 'hvg', 'hvg_score'
 obsm: 'gene_activity'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'common_dataset_id', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'

Data structure:

Slot	Type	Description
`obs["batch"]`	`string`	Batch information.
`obs["size_factors"]`	`double`	(Optional) The size factors of the cells prior to normalization.
`var["gene_ids"]`	`string`	(Optional) The gene identifiers (if available).
`var["hvg"]`	`boolean`	Whether or not the feature is considered to be a ‘highly variable gene’.
`var["hvg_score"]`	`double`	A score for the feature indicating how highly variable it is.
`obsm["gene_activity"]`	`double`	(Optional) ATAC gene activity.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalized expression values.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["common_dataset_id"]`	`string`	(Optional) A common identifier for the dataset.
`uns["dataset_organism"]`	`string`	(Optional) The organism of the sample in the dataset.
`uns["normalization_id"]`	`string`	The unique identifier of the normalization method used.
`uns["gene_activity_var_names"]`	`string`	(Optional) Names of the gene activity matrix.

File format: Train mod2

The mod2 expression values of the train cells.

Example file: resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/train_mod2.h5ad

Format:

AnnData object
 obs: 'batch', 'size_factors'
 var: 'gene_ids', 'hvg', 'hvg_score'
 obsm: 'gene_activity'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'common_dataset_id', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'

Data structure:

Slot	Type	Description
`obs["batch"]`	`string`	Batch information.
`obs["size_factors"]`	`double`	(Optional) The size factors of the cells prior to normalization.
`var["gene_ids"]`	`string`	(Optional) The gene identifiers (if available).
`var["hvg"]`	`boolean`	Whether or not the feature is considered to be a ‘highly variable gene’.
`var["hvg_score"]`	`double`	A score for the feature indicating how highly variable it is.
`obsm["gene_activity"]`	`double`	(Optional) ATAC gene activity.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalized expression values.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["common_dataset_id"]`	`string`	(Optional) A common identifier for the dataset.
`uns["dataset_organism"]`	`string`	(Optional) The organism of the sample in the dataset.
`uns["normalization_id"]`	`string`	The unique identifier of the normalization method used.
`uns["gene_activity_var_names"]`	`string`	(Optional) Names of the gene activity matrix.

Component type: Control method

Quality control methods for verifying the pipeline.

Arguments:

Name	Type	Description
`--input_train_mod1`	`file`	The mod1 expression values of the train cells.
`--input_train_mod2`	`file`	The mod2 expression values of the train cells.
`--input_test_mod1`	`file`	The mod1 expression values of the test cells.
`--input_test_mod2`	`file`	The mod2 expression values of the test cells.
`--output`	`file`	(Output) A prediction of the mod2 expression values of the test cells.

Component type: Predict

Make predictions using a trained model.

Arguments:

Name	Type	Description
`--input_train_mod1`	`file`	(Optional) The mod1 expression values of the train cells.
`--input_train_mod2`	`file`	(Optional) The mod2 expression values of the train cells.
`--input_test_mod1`	`file`	The mod1 expression values of the test cells.
`--input_model`	`file`	A pretrained model for predicting the expression of one modality from another.
`--output`	`file`	(Output) A prediction of the mod2 expression values of the test cells.

Component type: Train

Train a model to predict the expression of one modality from another.

Arguments:

Name	Type	Description
`--input_train_mod1`	`file`	The mod1 expression values of the train cells.
`--input_train_mod2`	`file`	The mod2 expression values of the train cells.
`--input_test_mod1`	`file`	(Optional) The mod1 expression values of the test cells.
`--output`	`file`	(Output) A pretrained model for predicting the expression of one modality from another.

Component type: Method

A regression method.

Arguments:

Name	Type	Description
`--input_train_mod1`	`file`	The mod1 expression values of the train cells.
`--input_train_mod2`	`file`	The mod2 expression values of the train cells.
`--input_test_mod1`	`file`	The mod1 expression values of the test cells.
`--output`	`file`	(Output) A prediction of the mod2 expression values of the test cells.

Component type: Metric

A predict modality metric.

Arguments:

Name	Type	Description
`--input_prediction`	`file`	A prediction of the mod2 expression values of the test cells.
`--input_test_mod2`	`file`	The mod2 expression values of the test cells.
`--output`	`file`	(Output) Metric score file.

File format: Prediction

A prediction of the mod2 expression values of the test cells

Example file: resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/prediction.h5ad

Format:

AnnData object
 layers: 'normalized'
 uns: 'dataset_id', 'method_id'

Data structure:

Slot	Type	Description
`layers["normalized"]`	`double`	Predicted normalized expression values.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["method_id"]`	`string`	A unique identifier for the method.

File format: Pretrained model

A pretrained model for predicting the expression of one modality from another.

File format: Score

Metric score file

Example file: resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/score.h5ad

Format:

AnnData object
 uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values'

Data structure:

Slot	Type	Description
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["method_id"]`	`string`	A unique identifier for the method.
`uns["metric_ids"]`	`string`	One or more unique metric identifiers.
`uns["metric_values"]`	`double`	The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’.

File format: Raw dataset mod2

The second modality of the raw dataset. Must be an ADT or an ATAC dataset

Example file: resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod2.h5ad

Format:

AnnData object
 obs: 'batch', 'size_factors'
 var: 'feature_id', 'feature_name', 'hvg', 'hvg_score'
 obsm: 'gene_activity'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'

Data structure:

Slot	Type	Description
`obs["batch"]`	`string`	Batch information.
`obs["size_factors"]`	`double`	(Optional) The size factors of the cells prior to normalization.
`var["feature_id"]`	`string`	Unique identifier for the feature, usually a ENSEMBL gene id.
`var["feature_name"]`	`string`	(Optional) A human-readable name for the feature, usually a gene symbol.
`var["hvg"]`	`boolean`	Whether or not the feature is considered to be a ‘highly variable gene’.
`var["hvg_score"]`	`double`	A score for the feature indicating how highly variable it is.
`obsm["gene_activity"]`	`double`	(Optional) ATAC gene activity.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalized expression values.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["dataset_name"]`	`string`	Nicely formatted name.
`uns["dataset_url"]`	`string`	(Optional) Link to the original source of the dataset.
`uns["dataset_reference"]`	`string`	(Optional) Bibtex reference of the paper in which the dataset was published.
`uns["dataset_summary"]`	`string`	Short description of the dataset.
`uns["dataset_description"]`	`string`	Long description of the dataset.
`uns["dataset_organism"]`	`string`	(Optional) The organism of the sample in the dataset.
`uns["normalization_id"]`	`string`	The unique identifier of the normalization method used.
`uns["gene_activity_var_names"]`	`string`	(Optional) Names of the gene activity matrix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Predict Modality

Description

Authors & contributors

API

File format: Raw dataset RNA

Component type: Process Dataset

File format: Test mod1

File format: Test mod2

File format: Train mod1

File format: Train mod2

Component type: Control method

Component type: Predict

Component type: Train

Component type: Method

Component type: Metric

File format: Prediction

File format: Pretrained model

File format: Score

File format: Raw dataset mod2

Files

README.md

Latest commit

History

README.md

File metadata and controls

Predict Modality

Description

Authors & contributors

API

File format: Raw dataset RNA

Component type: Process Dataset

File format: Test mod1

File format: Test mod2

File format: Train mod1

File format: Train mod2

Component type: Control method

Component type: Predict

Component type: Train

Component type: Method

Component type: Metric

File format: Prediction

File format: Pretrained model

File format: Score

File format: Raw dataset mod2