Predicting the profiles of one modality (e.g. protein abundance) from another (e.g. mRNA expression).
Repository: openproblems-bio/task_predict_modality
Experimental techniques to measure multiple modalities within the same single cell are increasingly becoming available. The demand for these measurements is driven by the promise to provide a deeper insight into the state of a cell. Yet, the modalities are also intrinsically linked. We know that DNA must be accessible (ATAC data) to produce mRNA (expression data), and mRNA in turn is used as a template to produce protein (protein abundance). These processes are regulated often by the same molecules that they produce: for example, a protein may bind DNA to prevent the production of more mRNA. Understanding these regulatory processes would be transformative for synthetic biology and drug target discovery. Any method that can predict a modality from another must have accounted for these regulatory processes, but the demand for multi-modal data shows that this is not trivial.
name | roles |
---|---|
Alejandro Granados | author |
Alex Tong | author |
Bastian Rieck | author |
Daniel Burkhardt | author |
Kai Waldrant | contributor |
Kaiwen Deng | contributor |
Louise Deconinck | author |
Robrecht Cannoodt | author, maintainer |
Xueer Chen | contributor |
Jiwei Liu | contributor |
flowchart LR
file_common_dataset_mod1("Raw dataset RNA")
comp_process_datasets[/"Process Dataset"/]
file_test_mod1("Test mod1")
file_test_mod2("Test mod2")
file_train_mod1("Train mod1")
file_train_mod2("Train mod2")
comp_control_method[/"Control method"/]
comp_method_predict[/"Predict"/]
comp_method_train[/"Train"/]
comp_method[/"Method"/]
comp_metric[/"Metric"/]
file_prediction("Prediction")
file_pretrained_model("Pretrained model")
file_score("Score")
file_common_dataset_mod2("Raw dataset mod2")
file_common_dataset_mod1---comp_process_datasets
comp_process_datasets-->file_test_mod1
comp_process_datasets-->file_test_mod2
comp_process_datasets-->file_train_mod1
comp_process_datasets-->file_train_mod2
file_test_mod1---comp_control_method
file_test_mod1---comp_method_predict
file_test_mod1---comp_method_train
file_test_mod1---comp_method
file_test_mod2---comp_control_method
file_test_mod2---comp_metric
file_train_mod1---comp_control_method
file_train_mod1---comp_method_predict
file_train_mod1---comp_method_train
file_train_mod1---comp_method
file_train_mod2---comp_control_method
file_train_mod2---comp_method_predict
file_train_mod2---comp_method_train
file_train_mod2---comp_method
comp_control_method-->file_prediction
comp_method_predict-->file_prediction
comp_method_train-->file_pretrained_model
comp_method-->file_prediction
comp_metric-->file_score
file_prediction---comp_metric
file_pretrained_model---comp_method_predict
file_common_dataset_mod2---comp_process_datasets
The RNA modality of the raw dataset.
Example file:
resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod1.h5ad
Format:
AnnData object
obs: 'batch', 'size_factors'
var: 'feature_id', 'feature_name', 'hvg', 'hvg_score'
obsm: 'gene_activity'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'
Data structure:
Slot | Type | Description |
---|---|---|
obs["batch"] |
string |
Batch information. |
obs["size_factors"] |
double |
(Optional) The size factors of the cells prior to normalization. |
var["feature_id"] |
string |
Unique identifier for the feature, usually a ENSEMBL gene id. |
var["feature_name"] |
string |
(Optional) A human-readable name for the feature, usually a gene symbol. |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
double |
A score for the feature indicating how highly variable it is. |
obsm["gene_activity"] |
double |
(Optional) ATAC gene activity. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalized expression values. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["dataset_name"] |
string |
Nicely formatted name. |
uns["dataset_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["dataset_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
uns["normalization_id"] |
string |
The unique identifier of the normalization method used. |
uns["gene_activity_var_names"] |
string |
(Optional) Names of the gene activity matrix. |
A predict modality dataset processor.
Arguments:
Name | Type | Description |
---|---|---|
--input_mod1 |
file |
The RNA modality of the raw dataset. |
--input_mod2 |
file |
The second modality of the raw dataset. Must be an ADT or an ATAC dataset. |
--output_train_mod1 |
file |
(Output) The mod1 expression values of the train cells. |
--output_train_mod2 |
file |
(Output) The mod2 expression values of the train cells. |
--output_test_mod1 |
file |
(Output) The mod1 expression values of the test cells. |
--output_test_mod2 |
file |
(Output) The mod2 expression values of the test cells. |
--seed |
integer |
(Optional) NA. Default: 1 . |
The mod1 expression values of the test cells.
Example file:
resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/test_mod1.h5ad
Format:
AnnData object
obs: 'batch', 'size_factors'
var: 'gene_ids', 'hvg', 'hvg_score'
obsm: 'gene_activity'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'common_dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'
Data structure:
Slot | Type | Description |
---|---|---|
obs["batch"] |
string |
Batch information. |
obs["size_factors"] |
double |
(Optional) The size factors of the cells prior to normalization. |
var["gene_ids"] |
string |
(Optional) The gene identifiers (if available). |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
double |
A score for the feature indicating how highly variable it is. |
obsm["gene_activity"] |
double |
(Optional) ATAC gene activity. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalized expression values. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["common_dataset_id"] |
string |
(Optional) A common identifier for the dataset. |
uns["dataset_name"] |
string |
Nicely formatted name. |
uns["dataset_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["dataset_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
uns["normalization_id"] |
string |
The unique identifier of the normalization method used. |
uns["gene_activity_var_names"] |
string |
(Optional) Names of the gene activity matrix. |
The mod2 expression values of the test cells.
Example file:
resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/test_mod2.h5ad
Format:
AnnData object
obs: 'batch', 'size_factors'
var: 'gene_ids', 'hvg', 'hvg_score'
obsm: 'gene_activity'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'common_dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'gene_activity_var_names'
Data structure:
Slot | Type | Description |
---|---|---|
obs["batch"] |
string |
Batch information. |
obs["size_factors"] |
double |
(Optional) The size factors of the cells prior to normalization. |
var["gene_ids"] |
string |
(Optional) The gene identifiers (if available). |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
double |
A score for the feature indicating how highly variable it is. |
obsm["gene_activity"] |
double |
(Optional) ATAC gene activity. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalized expression values. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["common_dataset_id"] |
string |
(Optional) A common identifier for the dataset. |
uns["dataset_name"] |
string |
Nicely formatted name. |
uns["dataset_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["dataset_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
uns["gene_activity_var_names"] |
string |
(Optional) Names of the gene activity matrix. |
The mod1 expression values of the train cells.
Example file:
resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/train_mod1.h5ad
Format:
AnnData object
obs: 'batch', 'size_factors'
var: 'gene_ids', 'hvg', 'hvg_score'
obsm: 'gene_activity'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'common_dataset_id', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'
Data structure:
Slot | Type | Description |
---|---|---|
obs["batch"] |
string |
Batch information. |
obs["size_factors"] |
double |
(Optional) The size factors of the cells prior to normalization. |
var["gene_ids"] |
string |
(Optional) The gene identifiers (if available). |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
double |
A score for the feature indicating how highly variable it is. |
obsm["gene_activity"] |
double |
(Optional) ATAC gene activity. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalized expression values. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["common_dataset_id"] |
string |
(Optional) A common identifier for the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
uns["normalization_id"] |
string |
The unique identifier of the normalization method used. |
uns["gene_activity_var_names"] |
string |
(Optional) Names of the gene activity matrix. |
The mod2 expression values of the train cells.
Example file:
resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/train_mod2.h5ad
Format:
AnnData object
obs: 'batch', 'size_factors'
var: 'gene_ids', 'hvg', 'hvg_score'
obsm: 'gene_activity'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'common_dataset_id', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'
Data structure:
Slot | Type | Description |
---|---|---|
obs["batch"] |
string |
Batch information. |
obs["size_factors"] |
double |
(Optional) The size factors of the cells prior to normalization. |
var["gene_ids"] |
string |
(Optional) The gene identifiers (if available). |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
double |
A score for the feature indicating how highly variable it is. |
obsm["gene_activity"] |
double |
(Optional) ATAC gene activity. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalized expression values. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["common_dataset_id"] |
string |
(Optional) A common identifier for the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
uns["normalization_id"] |
string |
The unique identifier of the normalization method used. |
uns["gene_activity_var_names"] |
string |
(Optional) Names of the gene activity matrix. |
Quality control methods for verifying the pipeline.
Arguments:
Name | Type | Description |
---|---|---|
--input_train_mod1 |
file |
The mod1 expression values of the train cells. |
--input_train_mod2 |
file |
The mod2 expression values of the train cells. |
--input_test_mod1 |
file |
The mod1 expression values of the test cells. |
--input_test_mod2 |
file |
The mod2 expression values of the test cells. |
--output |
file |
(Output) A prediction of the mod2 expression values of the test cells. |
Make predictions using a trained model.
Arguments:
Name | Type | Description |
---|---|---|
--input_train_mod1 |
file |
(Optional) The mod1 expression values of the train cells. |
--input_train_mod2 |
file |
(Optional) The mod2 expression values of the train cells. |
--input_test_mod1 |
file |
The mod1 expression values of the test cells. |
--input_model |
file |
A pretrained model for predicting the expression of one modality from another. |
--output |
file |
(Output) A prediction of the mod2 expression values of the test cells. |
Train a model to predict the expression of one modality from another.
Arguments:
Name | Type | Description |
---|---|---|
--input_train_mod1 |
file |
The mod1 expression values of the train cells. |
--input_train_mod2 |
file |
The mod2 expression values of the train cells. |
--input_test_mod1 |
file |
(Optional) The mod1 expression values of the test cells. |
--output |
file |
(Output) A pretrained model for predicting the expression of one modality from another. |
A regression method.
Arguments:
Name | Type | Description |
---|---|---|
--input_train_mod1 |
file |
The mod1 expression values of the train cells. |
--input_train_mod2 |
file |
The mod2 expression values of the train cells. |
--input_test_mod1 |
file |
The mod1 expression values of the test cells. |
--output |
file |
(Output) A prediction of the mod2 expression values of the test cells. |
A predict modality metric.
Arguments:
Name | Type | Description |
---|---|---|
--input_prediction |
file |
A prediction of the mod2 expression values of the test cells. |
--input_test_mod2 |
file |
The mod2 expression values of the test cells. |
--output |
file |
(Output) Metric score file. |
A prediction of the mod2 expression values of the test cells
Example file:
resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/prediction.h5ad
Format:
AnnData object
layers: 'normalized'
uns: 'dataset_id', 'method_id'
Data structure:
Slot | Type | Description |
---|---|---|
layers["normalized"] |
double |
Predicted normalized expression values. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["method_id"] |
string |
A unique identifier for the method. |
A pretrained model for predicting the expression of one modality from another.
Metric score file
Example file:
resources_test/task_predict_modality/openproblems_neurips2021/bmmc_cite/swap/score.h5ad
Format:
AnnData object
uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values'
Data structure:
Slot | Type | Description |
---|---|---|
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["method_id"] |
string |
A unique identifier for the method. |
uns["metric_ids"] |
string |
One or more unique metric identifiers. |
uns["metric_values"] |
double |
The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’. |
The second modality of the raw dataset. Must be an ADT or an ATAC dataset
Example file:
resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod2.h5ad
Format:
AnnData object
obs: 'batch', 'size_factors'
var: 'feature_id', 'feature_name', 'hvg', 'hvg_score'
obsm: 'gene_activity'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names'
Data structure:
Slot | Type | Description |
---|---|---|
obs["batch"] |
string |
Batch information. |
obs["size_factors"] |
double |
(Optional) The size factors of the cells prior to normalization. |
var["feature_id"] |
string |
Unique identifier for the feature, usually a ENSEMBL gene id. |
var["feature_name"] |
string |
(Optional) A human-readable name for the feature, usually a gene symbol. |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
double |
A score for the feature indicating how highly variable it is. |
obsm["gene_activity"] |
double |
(Optional) ATAC gene activity. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalized expression values. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["dataset_name"] |
string |
Nicely formatted name. |
uns["dataset_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["dataset_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
uns["normalization_id"] |
string |
The unique identifier of the normalization method used. |
uns["gene_activity_var_names"] |
string |
(Optional) Names of the gene activity matrix. |