SPmarker is a machine learning based approach for identification of marker genes and classification of cells in plant tissues
Current release: 10/17/21 v0.2
In order to dissect the biological functions of each individual cells, an essential step in the analysis of single-cell RNA sequencing data is to classify specific cell types with marker genes. In this study, we have developed a machine learning pipeline called Single cell Predictive markers (SPmarker) to assign cell types and to identify novel cell-type marker genes in the Arabidopsis root. Our method can (1) assign cell types based on cells that were labeled using published methods, (2) project cell types identified by trajectory analysis from one dataset to other datasets, and (3) assign cell types based on internal GFP markers. Using SPmarker, we have identified hundreds of new marker genes and majority of these machine learning-derived marker genes were not identified before. As compared to known marker genes, we have found more orthologous genes of these new marker genes in corresponding rice single cell clusters. We have also found 172 new marker genes for Trichoblast in five non-Arabidopsis species, which expands the number of marker genes for this cell type by 35-154%. Our results represent a new approach to identify cell-type marker genes from scRNA-seq data and pave the way for cross-species mapping of scRNA-seq data in plants.
SPmarker is developed in Python with modules and external tools.
Before running this pipeline, a dependency check should be performed first to make sure every dependency is correctly installed.
For information about installing the dependencies, please see below. The version numbers listed below represents the version this pipeline is developed with, and using the newest version is recommended.
Python (v3.0 or more; Developed and tested with version 3.7.1)
pandas (python package; Developed and tested with version 1.1.1)
sklearn (python package; Developed and tested with version 0.24.2)
shap (python package; Developed and tested with version 0.39.0)
keras (python package; Developed and tested with version 2.4.3)
conda create -n py37 python=3.7
conda activate py37
conda install pandas
conda install scikit-learn
conda install -c conda-forge shap
conda install keras
- Download SPmarker
- Use scripts wrapped in the SPmarker to run pipeline
1. (mandatory) Gene expression matrix file (.csv)
The matrix row names are features/genes and column names are cell names
cell1 | cell2 | cell3 | |
---|---|---|---|
gene1 | 2 | 1 | 2 |
gene2 | 3 | 5 | 3 |
gene3 | 3 | 5 | 0 |
2. (mandatory) Provide cell meta file OR marker gene list file OR both
Optional a. cell meta file (.csv)
Note: This file contains three columns. prob means probability that the cell will be assigned to the cell type. this prob value is obtained from ICI method. If it is not from the ICI method or other methods that could give a probability, we can use 1 to represent prop.
cell_type | prob | |
---|---|---|
cell1 | celltype1 | 1 |
cell2 | celltype1 | 0.8 |
cell3 | celltype2 | 0.6 |
Optional b. marker gene list file (.txt)
Note: SPmarker will utilize a correlation-based method to predict cell identities that will be used for generate a meta file.
gene | cell_type |
---|---|
gene1 | celltype1 |
gene1 | celltype1 |
gene2 | celltype2 |
gene3 | celltype3 |
Optional c. cell meta file (.csv) and marker gene list file (.txt) Note: If users provide both files, the final output will return novel markers that do not contain the markers provided in the marker gene list file.
3. (optional) unknown cell matrix (.csv)
UN means unknown
UNcell1 | UNcell2 | UNcell3 | |
---|---|---|---|
gene1 | 2 | 1 | 2 |
gene2 | 3 | 5 | 3 |
gene3 | 3 | 5 | 0 |
1. Gene expression matrix file
(ipt_test1_gene_cell_mtx.csv)
2. cell meta file
(ipt_test1_meta.csv)
-
mkdir output_dir working_dir
-
python SPmarker.py \
-d working_dir/ -o output_dir/ \
-mtx ipt_test1_gene_cell_mtx.csv \
-meta ipt_test1_meta.csv
1. Gene expression matrix file
(ipt_test1_gene_cell_mtx.csv)
2. marker gene list file
(ipt_test1_marker_list.txt)
-
mkdir output_dir working_dir
-
python SPmarker.py \
-d working_dir/ -o output_dir/ \
-mtx ipt_test1_gene_cell_mtx.csv \
-mlist ipt_test1_marker_list.txt
1. Gene expression matrix file
(ipt_test1_gene_cell_mtx.csv)
2. cell meta file
(ipt_test1_meta.csv)
3. selected_features.csv
(ipt_test1_selected_features.csv)
-
mkdir output_dir working_dir
-
python SPmarker.py \
-d working_dir/ -o output_dir/ \
-mtx ipt_test1_gene_cell_mtx.csv \
-meta ipt_test1_meta.csv \
-feat_fl ipt_test1_selected_features.csv
1. Gene expression matrix file
(ipt_test1_gene_cell_mtx.csv)
2. cell meta file
(ipt_test1_meta.csv)
3. unknown cell matrix
(ipt_test1_unknown_exp.csv)
-
mkdir output_dir working_dir
-
python SPmarker.py \
-d working_dir/ -o output_dir/ \
-mtx ipt_test1_gene_cell_mtx.csv \
-meta ipt_test1_meta.csv \
-ukn_mtx ipt_test1_unknown_exp.csv
1. Gene expression matrix file
(ipt_test2_gene_cell_mtx.csv)
2. GFP marker gene name
(eg. GFP_marker, the name 'GFP_marker' should be one of feature names in the ipt_test2_exp.csv)
-
mkdir output_dir working_dir
-
python SPmarker.py \
-d working_dir/ -o output_dir/ \
-mtx ipt_test2_gene_cell_mtx.csv \
-m GFP_marker
1. opt_SHAP_markers_dir and opt_SVM_markers_dir
a. opt_top_20_novel_known_marker.txt
b. opt_top_20_novel_marker.txt
c. opt_top_20_summary_marker_composition.txt
d. opt_all_novel_known_marker.txt
e. opt_all_novel_marker.txt
f. opt_all_summary_marker_composition.txt
Note:
'top_20' means users define they want to select the top 20 markers based on the feature importance values.
'novel' means the predicted markers are novel markers. If users do not provide known marker file using '-kmar_fl', all the markers will be labeled as novel.
'novel_known' means the output contains the novel and known markers at the same time.
'marker_composition' means among the 20 markers, how many of them are novel and how many of them are known markers.
'all' means the results report all markers intead of top markers.
2.opt_prediction_dir
a. opt_prediction_RFindetest.txt
b. opt_prediction_SVMindetest.txt
Note:
'RF' means the prediction of cell types on unknown cells are based on Random Forest model.
'SVM' means the prediction of cell types on unknown cells are based on Support Vector Machine model.
usage:
**SPmarker**
SPmarker.py [-h] required: [-d working_dir][-o output_dir]
[-mtx expression_matrix_file]
[-ukn_mtx unknown_expression_matrix_file]
([-m marker_name]|[-meta meta_file])
optional: [-bns no][-bns_ratio 1:1][-cv_num 5]
[-indep_ratio 0.1][-eval_score MCC]
[-mar_num 20][-kmar_fl known_marker_file]
[-SVM yes][-feat_fl feature_file]
arguments:
-h, --help Show this help message and exit.
-d Working directory to store intermediate files of each step.
Default: ./ .
-o Output directory to store the output files.
Default: ./ .
-mtx Training expression matrix file.
Rowname is gene, and column name is cell barcode.
Please make sure the gene name do not contain space. Otherwise, the gene name will be transfered to a name with "_" connected.
-ukn_mtx An expression matrix file with cells that need to be annotated.
Please keep same format as 'Expression matrix file'.
If the genes have different orders as training matrix, the SPmarker will automatically keep the same feature orders between the unknown matrix and training matrix.
If there are some genes missing in the unknown matrix compared to the training matrix, this tool will assign '0' across all cells for this gene.
-m Provide a marker name such as a name form an internal GFP marker.
This marker will assign the GFP-related cell identity to a cell where reads can map to this GFP marker.
If users provide '-m', '-meta' cannot be provided.
-bns Balance the matrix of cell identities. This option works only when -m is initiated.
-bns_ratio Provide a ratio of cell number from different identities.
For example, if users set the ratio to be 1:1, they must initiate the '-m' to assign a GFP-related cell identity to cells (eg. 500) where reads can map to this GFP marker.
If there are 2000 cells not be assigned, SPmarker will sample 500 cells from these 1000 cells to allow the ratio to be 1:1.
If users set ratio to be 1:2, SPmarker will sample 1000 cells.
Default: 1:1.
-meta Provide a meta that contains known cell identity for all cells in the training matrix.
If the '-meta' is initiated, we should not provide '-m'.
-cv_num Initiate x fold cross validation.
Default: 5.
-indep_ratio Provide ratio of cells from independent dataset to all cells.
For example. If users provide 1000 cells in the '-mtx', SPmarker will sample 100 cells to be independent dataset (default).
Default: 0.1.
-eval_score Provide a type of evaluation score to decide the best model that will be used for marker identification.
Default: MCC.
-mar_num Provide the number of top candidate marker users want to extract as output markers from each cell type.
Default: 20.
If the feature number is below 20, we will extract all the features under the cell type.
-kmar_fl Provide the known marker gene list file.
Once users provide this file, they will obtain a file that only contains novel marker genes.
-SVM Decide to generate the SVM markers.
Default: -SVM yes.
-feat_fl Provide the features that will be kept in the expression file that is used for the training.
If users do not provide the argument, we will use all the features from the training matrix (-mtx).
## Appendix and FAQ
:::info
**Find this document incomplete?** Leave a comment!
:::