From 4f62859829e43fd553c89c3593cd87d2aecb815c Mon Sep 17 00:00:00 2001
From: giuliaelgarcia <147185635+giuliaelgarcia@users.noreply.github.com>
Date: Tue, 12 Mar 2024 17:24:42 +0000
Subject: [PATCH 1/4] mouse cellcycle gene list upload
The cellcycle gene list for mouse analysis is now in the resources
---
panpipes/resources/mouse_cell_cycle_genes.tsv | 98 +++++++++++++++++++
1 file changed, 98 insertions(+)
create mode 100644 panpipes/resources/mouse_cell_cycle_genes.tsv
diff --git a/panpipes/resources/mouse_cell_cycle_genes.tsv b/panpipes/resources/mouse_cell_cycle_genes.tsv
new file mode 100644
index 00000000..8b7c9bab
--- /dev/null
+++ b/panpipes/resources/mouse_cell_cycle_genes.tsv
@@ -0,0 +1,98 @@
+ gene_name cc_phase
+0 Mcm5 s
+1 Pcna s
+2 Tyms s
+3 Fen1 s
+4 Mcm2 s
+5 Mcm4 s
+6 Rrm1 s
+7 Ung s
+8 Gins2 s
+9 Mcm6 s
+10 Cdca7 s
+11 Dtl s
+12 Prim1 s
+13 Uhrf1 s
+14 Mlf1ip s
+15 Hells s
+16 Rfc2 s
+17 Rpa2 s
+18 Nasp s
+19 Rad51ap1 s
+20 Gmnn s
+21 Wdr76 s
+22 Slbp s
+23 Ccne2 s
+24 Ubr7 s
+25 Pold3 s
+26 Msh2 s
+27 Atad2 s
+28 Rad51 s
+29 Rrm2 s
+30 Cdc45 s
+31 Cdc6 s
+32 Exo1 s
+33 Tipin s
+34 Dscc1 s
+35 Blm s
+36 Casp8ap2 s
+37 Usp1 s
+38 Clspn s
+39 Pola1 s
+40 Chaf1b s
+41 Brip1 s
+42 E2f8 s
+43 Hmgb2 g2m
+44 Cdk1 g2m
+45 Nusap1 g2m
+46 Ube2c g2m
+47 Birc5 g2m
+48 Tpx2 g2m
+49 Top2a g2m
+50 Ndc80 g2m
+51 Cks2 g2m
+52 Nuf2 g2m
+53 Cks1b g2m
+54 Mki67 g2m
+55 Tmpo g2m
+56 Cenpf g2m
+57 Tacc3 g2m
+58 Fam64a g2m
+59 Smc4 g2m
+60 Ccnb2 g2m
+61 Ckap2l g2m
+62 Ckap2 g2m
+63 Aurkb g2m
+64 Bub1 g2m
+65 Kif11 g2m
+66 Anp32e g2m
+67 Tubb4b g2m
+68 Gtse1 g2m
+69 Kif20b g2m
+70 Hjurp g2m
+71 Cdca3 g2m
+72 Hn1 g2m
+73 Cdc20 g2m
+74 Ttk g2m
+75 Cdc25c g2m
+76 Kif2c g2m
+77 Rangap1 g2m
+78 Ncapd2 g2m
+79 Dlgap5 g2m
+80 Cdca2 g2m
+81 Cdca8 g2m
+82 Ect2 g2m
+83 Kif23 g2m
+84 Hmmr g2m
+85 Aurka g2m
+86 Psrc1 g2m
+87 Anln g2m
+88 Lbr g2m
+89 Ckap5 g2m
+90 Cenpe g2m
+91 Ctcf g2m
+92 Nek2 g2m
+93 G2e3 g2m
+94 Gas2l3 g2m
+95 Cbx5 g2m
+96 Cenpa g2m
From ef27fe5097bcd34d80600979461adc1e44e14c70 Mon Sep 17 00:00:00 2001
From: giuliaelgarcia <147185635+giuliaelgarcia@users.noreply.github.com>
Date: Tue, 12 Mar 2024 17:27:46 +0000
Subject: [PATCH 2/4] Added the mouse cellcycle genes gene_list_format.md
Explained that the normal cellcycle is for humans only and provided a path for mouse cellcycle list if a user needs it.
---
docs/usage/gene_list_format.md | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/docs/usage/gene_list_format.md b/docs/usage/gene_list_format.md
index e1df3016..9c6a02ae 100644
--- a/docs/usage/gene_list_format.md
+++ b/docs/usage/gene_list_format.md
@@ -40,9 +40,11 @@ For a typical usecase, we provide example lists on our [github page](https://git
### Cell cycle genes
-The cellcycle genes used in [scanpy.score_genes_cell_cycle](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes_cell_cycle.html)
+The human-only cellcycle genes used in [scanpy.score_genes_cell_cycle](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes_cell_cycle.html)
are stored in [resources/cell_cycle_genes.csv](https://github.com/DendrouLab/panpipes/blob/main/panpipes/resources/cell_cycle_genes.tsv)
+However, if the data is mouse only then the cellcycle gene list can be found in [resources/mouse_cell_cycle_genes.tsv](https://github.com/DendrouLab/panpipes/blob/mouse_cell_cycle/panpipes/resources/mouse_cell_cycle_genes.tsv)
+
Differently from the other custom gene file, the cell cycle file should be a **tab separated file with two columns**:
- **gene_name**: the name of the gene
@@ -67,7 +69,7 @@ If left blank, these actions will not be performed (i.e. no calculation of % of
### Supplying custom gene lists to calculate QC metrics
-The custom genelist file can be supplied by the user in two workflows to perform the three main actions:
+The human custom genelist file can be supplied by the user in two workflows to perform the three main actions:
1. **Ingest workflow**
@@ -87,6 +89,22 @@ The custom genelist file can be supplied by the user in two workflows to perform
*Note that we have formatted an example file containing all genes to use in both workflows, and therefore supply the same file to both workflows but users can have independent files for each of them.*
+However, if the input is from mouse data then, the custom genelist file can be supplied by the user in two workflows to perform the three main actions:
+
+1. **Ingest workflow**
+
+ pipeline_ingest config file: (pipeline.yml)
+
+ ```yaml
+ custom_genes_file: resources/qc_gene_list_mouse.csv
+ ```
+
+2. **Preprocess workflow**
+
+ pipeline_preprocess config file: (pipeline.yml)
+
+ ```yaml
+ exclude_file: resources/qc_gene_list_mouse.csv
### Explaining custom gene lists actions
1. **Ingest workflow** (pipeline_ingest.py)
From 31eb8c6e427ba5394707da186ab56494b0b29f98 Mon Sep 17 00:00:00 2001
From: giuliaelgarcia <147185635+giuliaelgarcia@users.noreply.github.com>
Date: Tue, 12 Mar 2024 17:35:41 +0000
Subject: [PATCH 3/4] Added mouse gene list information to
pipeline_ingestion_yml.md
Changed the current gene list default to specify its human and added instructions on what to do in case the data is from mouse and where to find the mouse gene list
---
docs/yaml_docs/pipeline_ingestion_yml.md | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/docs/yaml_docs/pipeline_ingestion_yml.md b/docs/yaml_docs/pipeline_ingestion_yml.md
index edb7c0fc..20dfb5de 100644
--- a/docs/yaml_docs/pipeline_ingestion_yml.md
+++ b/docs/yaml_docs/pipeline_ingestion_yml.md
@@ -205,8 +205,13 @@ To calculate RNA QC metrics, we need to define a gene list providing additional
Additionally, we can specify what actions we want to apply to the genes, such as what metrics to calculate.
custom_genes_file`String`, Default: resources/qc_genelist_1.0.csv
- Path to the file containing the entire gene list. Panpipes provides such a file with standard genes, and the path to this file is set as default.
-
+ Path to the file containing the entire human gene list. Panpipes provides such a file with standard genes, and the path to this file is set as default.
+
+However, if the input is from mouse data then the user must provide the mouse gene list as shown here:
+
+ custom_genes_file`String`, Default: qc_gene_list_mouse.csv
+
+This mouse gene list can be found in the panpipes [resources](https://github.com/DendrouLab/panpipes/blob/mouse_gene_list_upload/panpipes/resources/qc_gene_list_mouse.csv)
Usually, it's convenient to rely on known gene lists, as this simplifies various downstream tasks, such as evaluating the percentage of mitochondrial genes in the data, identify ribosomal genes, or excluding IGG genes from HVG selection.
For the ingestion workflow, we retrieved the cell cycle genes used in `scanpy.score_genes_cell_cycle` [Satija et al. (2015), Nature Biotechnology](https://www.nature.com/articles/nbt.3192) and stored them in a file: panpipes/resources/cell_cicle_genes.tsv.
@@ -432,4 +437,4 @@ This can help to determine any inconsistencies in staining per channel and other
The maximum value will be set at the value of the 99.5% quantile, applied per feature.
Note that this feature is in the default muon `mu.pp.dsb` code, but manually implemented here.
-
\ No newline at end of file
+
From 4261378bdb542c66080a86dbe165e9cf28f8f855 Mon Sep 17 00:00:00 2001
From: bio-la
Date: Wed, 13 Mar 2024 09:54:49 +0100
Subject: [PATCH 4/4] small fixes
---
docs/usage/gene_list_format.md | 3 ++-
docs/yaml_docs/pipeline_ingestion_yml.md | 20 +++++++++++---------
2 files changed, 13 insertions(+), 10 deletions(-)
diff --git a/docs/usage/gene_list_format.md b/docs/usage/gene_list_format.md
index 9c6a02ae..1c5b6aaf 100644
--- a/docs/usage/gene_list_format.md
+++ b/docs/usage/gene_list_format.md
@@ -43,7 +43,7 @@ For a typical usecase, we provide example lists on our [github page](https://git
The human-only cellcycle genes used in [scanpy.score_genes_cell_cycle](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes_cell_cycle.html)
are stored in [resources/cell_cycle_genes.csv](https://github.com/DendrouLab/panpipes/blob/main/panpipes/resources/cell_cycle_genes.tsv)
-However, if the data is mouse only then the cellcycle gene list can be found in [resources/mouse_cell_cycle_genes.tsv](https://github.com/DendrouLab/panpipes/blob/mouse_cell_cycle/panpipes/resources/mouse_cell_cycle_genes.tsv)
+However, if you are working with mouse data, we supply an alternative cellcycle gene list with murine genes, which can be found in [resources/mouse_cell_cycle_genes.tsv](https://github.com/DendrouLab/panpipes/blob/mouse_cell_cycle/panpipes/resources/mouse_cell_cycle_genes.tsv)
Differently from the other custom gene file, the cell cycle file should be a **tab separated file with two columns**:
@@ -105,6 +105,7 @@ However, if the input is from mouse data then, the custom genelist file can be s
```yaml
exclude_file: resources/qc_gene_list_mouse.csv
+ ```
### Explaining custom gene lists actions
1. **Ingest workflow** (pipeline_ingest.py)
diff --git a/docs/yaml_docs/pipeline_ingestion_yml.md b/docs/yaml_docs/pipeline_ingestion_yml.md
index 20dfb5de..5fe5d41e 100644
--- a/docs/yaml_docs/pipeline_ingestion_yml.md
+++ b/docs/yaml_docs/pipeline_ingestion_yml.md
@@ -201,21 +201,23 @@ In the ingestion workflow we compute cell and genes QC metrics (such as % of mit
Feel free to leave options blank to run with default parameters.
#### Providing a gene list
-To calculate RNA QC metrics, we need to define a gene list providing additional information on the genes in the data.
+To calculate RNA QC metrics based on custom genes annotations, we need to use a gene list providing additional information on the genes expressed in the data.
Additionally, we can specify what actions we want to apply to the genes, such as what metrics to calculate.
-custom_genes_file`String`, Default: resources/qc_genelist_1.0.csv
+Please visit our documentation section on [creating and using custom genes lists](../usage/gene_list_format.md) to perform quality control and visualization.
+custom_genes_file`String`, Mandatory parameter, Default: resources/qc_genelist_1.0.csv
Path to the file containing the entire human gene list. Panpipes provides such a file with standard genes, and the path to this file is set as default.
-However, if the input is from mouse data then the user must provide the mouse gene list as shown here:
+##### Working with different species than human
+*If working with a different species, the user must provide the appropriate gene list. For example, we offer a precompiled version of the qc gene list for mouse, the user can supply the list by specifying the path to the file as shown here:*
- custom_genes_file`String`, Default: qc_gene_list_mouse.csv
+ `custom_genes_file: qc_gene_list_mouse.csv`
-This mouse gene list can be found in the panpipes [resources](https://github.com/DendrouLab/panpipes/blob/mouse_gene_list_upload/panpipes/resources/qc_gene_list_mouse.csv)
+*Find the mouse gene list in our [resources](https://github.com/DendrouLab/panpipes/blob/mouse_gene_list_upload/panpipes/resources/qc_gene_list_mouse.csv)*
-Usually, it's convenient to rely on known gene lists, as this simplifies various downstream tasks, such as evaluating the percentage of mitochondrial genes in the data, identify ribosomal genes, or excluding IGG genes from HVG selection.
-For the ingestion workflow, we retrieved the cell cycle genes used in `scanpy.score_genes_cell_cycle` [Satija et al. (2015), Nature Biotechnology](https://www.nature.com/articles/nbt.3192) and stored them in a file: panpipes/resources/cell_cicle_genes.tsv.
-Additionally, we also provide an example for an entire gene list: panpipes/resources/qc_genelist_1.0.csv
+
+It's convenient to rely on known gene lists, as this simplifies various downstream tasks, such as evaluating the percentage of mitochondrial genes in the data, identify ribosomal genes, or excluding IGG genes from HVG selection.
+For the ingestion workflow, we retrieved the cell cycle genes used in `scanpy.score_genes_cell_cycle` [Satija et al. (2015), Nature Biotechnology](https://www.nature.com/articles/nbt.3192)
| mod | feature | group |
|-----|---------|--------|
@@ -228,7 +230,7 @@ Additionally, we also provide an example for an entire gene list: panpipes/resou
Next, we define "actions" on the genes as follows:
In the group column, specify what actions you want to apply to that specific gene.
-For instance: calc_proportion: mt will calculate proportion of reads mapping to the genes whose group is "mt".
+For instance: `calc_proportion: mt` will calculate proportion of reads mapping to the genes whose group is "mt" in the custom genes file.
(for pipeline_ingest.py)
calc_proportions: calculate proportion of reads mapping to X genes over total number of reads, per cell