Skip to content

Latest commit

 

History

History
534 lines (397 loc) · 20.2 KB

README.md

File metadata and controls

534 lines (397 loc) · 20.2 KB

Predict Modality

Predicting the profiles of one modality (e.g. protein abundance) from another (e.g. mRNA expression).

Repository: openproblems-bio/task_predict_modality

Description

Experimental techniques to measure multiple modalities within the same single cell are increasingly becoming available. The demand for these measurements is driven by the promise to provide a deeper insight into the state of a cell. Yet, the modalities are also intrinsically linked. We know that DNA must be accessible (ATAC data) to produce mRNA (expression data), and mRNA in turn is used as a template to produce protein (protein abundance). These processes are regulated often by the same molecules that they produce: for example, a protein may bind DNA to prevent the production of more mRNA. Understanding these regulatory processes would be transformative for synthetic biology and drug target discovery. Any method that can predict a modality from another must have accounted for these regulatory processes, but the demand for multi-modal data shows that this is not trivial.

Authors & contributors

name roles
Alejandro Granados author
Alex Tong author
Bastian Rieck author
Daniel Burkhardt author
Kai Waldrant contributor
Kaiwen Deng contributor
Louise Deconinck author
Robrecht Cannoodt author, maintainer
Xueer Chen contributor
Jiwei Liu contributor

API

flowchart LR
  file_common_dataset_mod1("Raw dataset RNA")
  comp_process_datasets[/"Process Dataset"/]
  file_test_mod1("Test mod1")
  file_test_mod2("Test mod2")
  file_train_mod1("Train mod1")
  file_train_mod2("Train mod2")
  comp_control_method[/"Control method"/]
  comp_method_predict[/"Predict"/]
  comp_method_train[/"Train"/]
  comp_method[/"Method"/]
  comp_metric[/"Metric"/]
  file_prediction("Prediction")
  file_pretrained_model("Pretrained model")
  file_score("Score")
  file_common_dataset_mod2("Raw dataset mod2")
  file_common_dataset_mod1---comp_process_datasets
  comp_process_datasets-->file_test_mod1
  comp_process_datasets-->file_test_mod2
  comp_process_datasets-->file_train_mod1
  comp_process_datasets-->file_train_mod2
  file_test_mod1---comp_control_method
  file_test_mod1---comp_method_predict
  file_test_mod1---comp_method_train
  file_test_mod1---comp_method
  file_test_mod2---comp_control_method
  file_test_mod2---comp_metric
  file_train_mod1---comp_control_method
  file_train_mod1---comp_method_predict
  file_train_mod1---comp_method_train
  file_train_mod1---comp_method
  file_train_mod2---comp_control_method
  file_train_mod2---comp_method_predict
  file_train_mod2---comp_method_train
  file_train_mod2---comp_method
  comp_control_method-->file_prediction
  comp_method_predict-->file_prediction
  comp_method_train-->file_pretrained_model
  comp_method-->file_prediction
  comp_metric-->file_score
  file_prediction---comp_metric
  file_pretrained_model---comp_method_predict
  file_common_dataset_mod2---comp_process_datasets
Loading

File format: Raw dataset RNA

The RNA modality of the raw dataset.

Example file: resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod1.h5ad

Format:

AnnData object
 obs: 'batch', 'size_factors'
 var: 'feature_id', 'feature_name', 'hvg', 'hvg_score'
 obsm: 'gene_activity'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'

Data structure:

Slot Type Description
obs["batch"] string Batch information.
obs["size_factors"] double (Optional) The size factors of the cells prior to normalization.
var["feature_id"] string Unique identifier for the feature, usually a ENSEMBL gene id.
var["feature_name"] string (Optional) A human-readable name for the feature, usually a gene symbol.
var["hvg"] boolean Whether or not the feature is considered to be a ‘highly variable gene’.
var["hvg_score"] double A score for the feature indicating how highly variable it is.
obsm["gene_activity"] double (Optional) ATAC gene activity.
layers["counts"] integer Raw counts.
layers["normalized"] double Normalized expression values.
uns["dataset_id"] string A unique identifier for the dataset.
uns["dataset_name"] string Nicely formatted name.
uns["dataset_url"] string (Optional) Link to the original source of the dataset.
uns["dataset_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.
uns["normalization_id"] string The unique identifier of the normalization method used.
uns["gene_activity_var_names"] string (Optional) Names of the gene activity matrix.

Component type: Process Dataset

A predict modality dataset processor.

Arguments:

Name Type Description
--input_mod1 file The RNA modality of the raw dataset.
--input_mod2 file The second modality of the raw dataset. Must be an ADT or an ATAC dataset.
--output_train_mod1 file (Output) The mod1 expression values of the train cells.
--output_train_mod2 file (Output) The mod2 expression values of the train cells.
--output_test_mod1 file (Output) The mod1 expression values of the test cells.
--output_test_mod2 file (Output) The mod2 expression values of the test cells.
--seed integer (Optional) NA. Default: 1.

File format: Test mod1

The mod1 expression values of the test cells.

Example file: resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/test_mod1.h5ad

Format:

AnnData object
 obs: 'batch', 'size_factors'
 var: 'gene_ids', 'hvg', 'hvg_score'
 obsm: 'gene_activity'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'common_dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'

Data structure:

Slot Type Description
obs["batch"] string Batch information.
obs["size_factors"] double (Optional) The size factors of the cells prior to normalization.
var["gene_ids"] string (Optional) The gene identifiers (if available).
var["hvg"] boolean Whether or not the feature is considered to be a ‘highly variable gene’.
var["hvg_score"] double A score for the feature indicating how highly variable it is.
obsm["gene_activity"] double (Optional) ATAC gene activity.
layers["counts"] integer Raw counts.
layers["normalized"] double Normalized expression values.
uns["dataset_id"] string A unique identifier for the dataset.
uns["common_dataset_id"] string (Optional) A common identifier for the dataset.
uns["dataset_name"] string Nicely formatted name.
uns["dataset_url"] string (Optional) Link to the original source of the dataset.
uns["dataset_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.
uns["normalization_id"] string The unique identifier of the normalization method used.
uns["gene_activity_var_names"] string (Optional) Names of the gene activity matrix.

File format: Test mod2

The mod2 expression values of the test cells.

Example file: resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/test_mod2.h5ad

Format:

AnnData object
 obs: 'batch', 'size_factors'
 var: 'gene_ids', 'hvg', 'hvg_score'
 obsm: 'gene_activity'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'common_dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'gene_activity_var_names'

Data structure:

Slot Type Description
obs["batch"] string Batch information.
obs["size_factors"] double (Optional) The size factors of the cells prior to normalization.
var["gene_ids"] string (Optional) The gene identifiers (if available).
var["hvg"] boolean Whether or not the feature is considered to be a ‘highly variable gene’.
var["hvg_score"] double A score for the feature indicating how highly variable it is.
obsm["gene_activity"] double (Optional) ATAC gene activity.
layers["counts"] integer Raw counts.
layers["normalized"] double Normalized expression values.
uns["dataset_id"] string A unique identifier for the dataset.
uns["common_dataset_id"] string (Optional) A common identifier for the dataset.
uns["dataset_name"] string Nicely formatted name.
uns["dataset_url"] string (Optional) Link to the original source of the dataset.
uns["dataset_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.
uns["gene_activity_var_names"] string (Optional) Names of the gene activity matrix.

File format: Train mod1

The mod1 expression values of the train cells.

Example file: resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/train_mod1.h5ad

Format:

AnnData object
 obs: 'batch', 'size_factors'
 var: 'gene_ids', 'hvg', 'hvg_score'
 obsm: 'gene_activity'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'common_dataset_id', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'

Data structure:

Slot Type Description
obs["batch"] string Batch information.
obs["size_factors"] double (Optional) The size factors of the cells prior to normalization.
var["gene_ids"] string (Optional) The gene identifiers (if available).
var["hvg"] boolean Whether or not the feature is considered to be a ‘highly variable gene’.
var["hvg_score"] double A score for the feature indicating how highly variable it is.
obsm["gene_activity"] double (Optional) ATAC gene activity.
layers["counts"] integer Raw counts.
layers["normalized"] double Normalized expression values.
uns["dataset_id"] string A unique identifier for the dataset.
uns["common_dataset_id"] string (Optional) A common identifier for the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.
uns["normalization_id"] string The unique identifier of the normalization method used.
uns["gene_activity_var_names"] string (Optional) Names of the gene activity matrix.

File format: Train mod2

The mod2 expression values of the train cells.

Example file: resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/train_mod2.h5ad

Format:

AnnData object
 obs: 'batch', 'size_factors'
 var: 'gene_ids', 'hvg', 'hvg_score'
 obsm: 'gene_activity'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'common_dataset_id', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'

Data structure:

Slot Type Description
obs["batch"] string Batch information.
obs["size_factors"] double (Optional) The size factors of the cells prior to normalization.
var["gene_ids"] string (Optional) The gene identifiers (if available).
var["hvg"] boolean Whether or not the feature is considered to be a ‘highly variable gene’.
var["hvg_score"] double A score for the feature indicating how highly variable it is.
obsm["gene_activity"] double (Optional) ATAC gene activity.
layers["counts"] integer Raw counts.
layers["normalized"] double Normalized expression values.
uns["dataset_id"] string A unique identifier for the dataset.
uns["common_dataset_id"] string (Optional) A common identifier for the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.
uns["normalization_id"] string The unique identifier of the normalization method used.
uns["gene_activity_var_names"] string (Optional) Names of the gene activity matrix.

Component type: Control method

Quality control methods for verifying the pipeline.

Arguments:

Name Type Description
--input_train_mod1 file The mod1 expression values of the train cells.
--input_train_mod2 file The mod2 expression values of the train cells.
--input_test_mod1 file The mod1 expression values of the test cells.
--input_test_mod2 file The mod2 expression values of the test cells.
--output file (Output) A prediction of the mod2 expression values of the test cells.

Component type: Predict

Make predictions using a trained model.

Arguments:

Name Type Description
--input_train_mod1 file (Optional) The mod1 expression values of the train cells.
--input_train_mod2 file (Optional) The mod2 expression values of the train cells.
--input_test_mod1 file The mod1 expression values of the test cells.
--input_model file A pretrained model for predicting the expression of one modality from another.
--output file (Output) A prediction of the mod2 expression values of the test cells.

Component type: Train

Train a model to predict the expression of one modality from another.

Arguments:

Name Type Description
--input_train_mod1 file The mod1 expression values of the train cells.
--input_train_mod2 file The mod2 expression values of the train cells.
--input_test_mod1 file (Optional) The mod1 expression values of the test cells.
--output file (Output) A pretrained model for predicting the expression of one modality from another.

Component type: Method

A regression method.

Arguments:

Name Type Description
--input_train_mod1 file The mod1 expression values of the train cells.
--input_train_mod2 file The mod2 expression values of the train cells.
--input_test_mod1 file The mod1 expression values of the test cells.
--output file (Output) A prediction of the mod2 expression values of the test cells.

Component type: Metric

A predict modality metric.

Arguments:

Name Type Description
--input_prediction file A prediction of the mod2 expression values of the test cells.
--input_test_mod2 file The mod2 expression values of the test cells.
--output file (Output) Metric score file.

File format: Prediction

A prediction of the mod2 expression values of the test cells

Example file: resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/prediction.h5ad

Format:

AnnData object
 layers: 'normalized'
 uns: 'dataset_id', 'method_id'

Data structure:

Slot Type Description
layers["normalized"] double Predicted normalized expression values.
uns["dataset_id"] string A unique identifier for the dataset.
uns["method_id"] string A unique identifier for the method.

File format: Pretrained model

A pretrained model for predicting the expression of one modality from another.

File format: Score

Metric score file

Example file: resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/score.h5ad

Format:

AnnData object
 uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values'

Data structure:

Slot Type Description
uns["dataset_id"] string A unique identifier for the dataset.
uns["method_id"] string A unique identifier for the method.
uns["metric_ids"] string One or more unique metric identifiers.
uns["metric_values"] double The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’.

File format: Raw dataset mod2

The second modality of the raw dataset. Must be an ADT or an ATAC dataset

Example file: resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod2.h5ad

Format:

AnnData object
 obs: 'batch', 'size_factors'
 var: 'feature_id', 'feature_name', 'hvg', 'hvg_score'
 obsm: 'gene_activity'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'

Data structure:

Slot Type Description
obs["batch"] string Batch information.
obs["size_factors"] double (Optional) The size factors of the cells prior to normalization.
var["feature_id"] string Unique identifier for the feature, usually a ENSEMBL gene id.
var["feature_name"] string (Optional) A human-readable name for the feature, usually a gene symbol.
var["hvg"] boolean Whether or not the feature is considered to be a ‘highly variable gene’.
var["hvg_score"] double A score for the feature indicating how highly variable it is.
obsm["gene_activity"] double (Optional) ATAC gene activity.
layers["counts"] integer Raw counts.
layers["normalized"] double Normalized expression values.
uns["dataset_id"] string A unique identifier for the dataset.
uns["dataset_name"] string Nicely formatted name.
uns["dataset_url"] string (Optional) Link to the original source of the dataset.
uns["dataset_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.
uns["normalization_id"] string The unique identifier of the normalization method used.
uns["gene_activity_var_names"] string (Optional) Names of the gene activity matrix.