Merge branch 'main' into sarah_check_logs

DendrouLab · Apr 27, 2024 · 4f140b0 · 4f140b0
2 parents 87ead52 + 7cdafa6
commit 4f140b0
Show file tree

Hide file tree

Showing 22 changed files with 1,304 additions and 121 deletions.
diff --git a/docs/img/anndata_schema.svg b/docs/img/anndata_schema.svg
diff --git a/docs/img/mudata_paper.svg b/docs/img/mudata_paper.svg
diff --git a/docs/index.rst b/docs/index.rst
@@ -15,6 +15,10 @@ What is Panpipes?
 Panpipes is a collection of cgat-core/ruffus pipelines to streamline the analysis of multi-modal single cell data.
 Panpipes supports any combination of the following single-cell modalities: scRNAseq, CITEseq, scV(D)Jseq, and scATACseq
 
+.. image:: img/panpipes_cropped_gif.gif
+  :width: 650 
+  :alt: how does panpipes work
+
 Check out the :doc:`installation<install>` and :doc:`usage guidelines<usage/index>` page for further information.
 
 .. image:: img/Panpipes_Figure1_v21024_1.png

diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
diff --git a/docs/usage/gene_list_format.md b/docs/usage/gene_list_format.md
@@ -4,14 +4,20 @@ Using custom genes annotations: gene list formats
 It's often practical to rely on known gene lists, for a series of tasks, like evaluating % of mitochondrial genes or
 ribosomal genes, or excluding genes from HVG selection such as those constituting the IG chains. 
 
-### Custom gene lists
+We pre-compile these lists for qc and for cell cycle.
+
+### Custom gene list
 
 We provide an example of a preformatted gene lists file in [resources/qc_genelist_1.0.csv](https://github.com/DendrouLab/panpipes/blob/main/panpipes/resources/qc_genelist_1.0.csv).
 
-All Custom Gene Lists files provided to the pipeline should be in a 3 columns format, where the column headers are "mod" (modality: "rna", "prot", or "atac"), feature and group. The group column is used to distinguish different gene groups.
+All Custom Gene Lists files provided to the pipeline should be in a 3 columns format, where the column headers are "mod" (modality: "rna", "prot", or "atac"), feature and group. 
+The group column is used to label and distinguish different gene groups, in the same way as you would submit them as a single vector. The same gene can belong to multiple groups by
+duplicating the row for that gene and changing the group label. 
+Gene ids can come in a variety of formats, including upper or lower cases, letter,numbers or a combination of both. 
+Users can provide any gene format in the custom gene files, depending on the reference they used to produce their count matrices. 
 
 - **mod**: the modality for the feature in use. Modalities are always specified in lowercase.
-- **feature**: feature name, i.e. a gene or a protein id. 
+- **feature**: feature name, i.e. a gene id 
 - **group**: the group the gene belongs to. Group can be upper or lowercase or a mix of both and will be interpreted as a string.
 
 | mod | feature | group   |
@@ -22,8 +28,8 @@ All Custom Gene Lists files provided to the pipeline should be in a 3 columns fo
 | rna | gene_1  | markerX |
 | ... | ...     | ...     |
 
-Users can provide any gene format in the custom gene files, depending on the reference they used to produce their count matrices. 
-Gene ids can come in a variety of formats, including upper or lower cases, letter,numbers or a combination of both. 
+
+For example, if your count matrix uses Ensembl ids, you would specify `ENSG00000163914` in the feature column and group `photoreceptor`
 
 ```
 GeneCards Symbol: RHO
@@ -32,12 +38,11 @@ NCBI Gene: 6010
 Ensembl: ENSG00000163914
 ```
 
-Therefore, the gene lists are not pre-determined within panpipes in order to maximise flexibility and users should provide their own lists.
+**The gene lists are not pre-determined within panpipes in order to maximise flexibility and users can provide their own lists.**
 
 For a typical usecase, we provide example lists on our [github page](https://github.com/DendrouLab/panpipes/tree/main/panpipes/resources) which are also used by default as specified in the [next sections](#explaining-custom-gene-lists-actions).
 
 
-
 ### Cell cycle genes
 
 The human-only cellcycle genes used in [scanpy.score_genes_cell_cycle](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes_cell_cycle.html) 
@@ -66,30 +71,31 @@ We encode three main actions to use gene lists to describe qualities of the cell
 Specify the "group" name of the genes you want to use to apply a specific action to calculate the cell QC metric
 
 If left blank, these actions will not be performed (i.e. no calculation of % of mt genes per cell will be included in the ingestion of the data)
+If the group name is not spelled correctly, the action will fail. See [The section on](#explaining-custom-gene-lists-actions)
 
 ### Supplying custom gene lists to calculate QC metrics
 
-The human custom genelist file can be supplied by the user in two workflows to perform the three main actions:
+The human custom genelist file can be supplied by the user in the workflows configuration files:
 
 1. **Ingest workflow**
-
     pipeline_ingest config file: (pipeline.yml)
 
+    Supply the gene list by customizing the following parameter
+
     ```yaml
     custom_genes_file: resources/qc_genelist_1.0.csv
     ```
 
-2. **Preprocess workflow**
-
-    pipeline_preprocess config file: (pipeline.yml)
-
+2. **Preprocess workflow** pipeline_preprocess config file: (pipeline.yml)
+    
+    Supply the gene list by customizing the following parameter
     ```yaml
     exclude_file: resources/qc_genelist_1.0.csv
     ```
 
 *Note that we have formatted an example file containing all genes to use in both workflows, and therefore supply the same file to both workflows but users can have independent files for each of them.*
 
-However, if the input is from mouse data then, the custom genelist file can be supplied by the user in two workflows to perform the three main actions:
+If the input is from mouse data 
 
 1. **Ingest workflow**
 
@@ -110,6 +116,19 @@ However, if the input is from mouse data then, the custom genelist file can be s
 
 1. **Ingest workflow** (pipeline_ingest.py)
 
+In the qc section of the `ingest` workflow, we specify the main actions harnessing the gene list provided in the `custom_genes_file` param:
+
+```yaml
+    custom_genes_file: resources/qc_genelist_1.0.csv
+
+    calc_proportions: hb,mt,rp
+    score_genes: mt
+
+    ccgenes: default
+
+```
+
+
 - **calc_proportions:** calculate proportion of reads mapping to X genes over total number of reads, per cell, using [scanpy.pp.calculate_qc_metrics](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.calculate_qc_metrics.html#scanpy.pp.calculate_qc_metrics).
 
     For example, for the rna modality, including a list of mitochondrial
@@ -130,21 +149,35 @@ However, if the input is from mouse data then, the custom genelist file can be s
 
 2. **Preprocess workflow** (pipeline_preprocess.py)
 
+In the preprocess workflow, we specify what action to perform on the genes in the custom gene list using the `exclude` parameter.
+
+```yaml
+  exclude_file: resources/qc_genelist_1.0.csv
+  exclude: default
+ ``` 
+
 - **exclude:** exclude these genes from the HVG selection, if they are deemed Highly Variable.
 
-    For the exclude action, if set to `default` the workflow will look for genes whose group is set to `exclude` in the supplied qc_genelist file. Alternatively, if you are specifying your custom gene list and you want to exclude another set of genes, for example a group you call `TCR_genes`, specify this group (i.e. `exclude: TCR_genes`)
+    For the exclude action, if set to `default` the workflow will look for genes whose group is set to `exclude` in the supplied qc_genelist file. Alternatively, if you are specifying your custom gene list and you want to exclude another set of genes, for example a group you call `TCR_genes`, specify this group (i.e. `exclude: TCR_genes`) If left blank, no genes will be excluded from the HVG.
 
 ### Cell cycle actions
 
 As described before, we also rely on a user-supplied list of genes to calculate the cell cycle phase of a cell. We believe that this choice offers the maximum flexibility to use a trusted gene-set for the calculation of this metric.
 The cell cycle scoring happens in the `ingest` workflow using the `ccgenes` parameter. The cell cycle action is performed using [`scanpy.tl.score_genes_cell_cycle`](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes_cell_cycle.html)
 
+```yaml
+# cell cycle action
+ccgenes: default
+
+```
+
+
 **ccgenes:**  
-Setting the `ccgenes` param to `default` in the ingest workflow will calculate the phase of the cell cycle in which the cell is by using `scanpy.tl.score_genes_cell_cycle` using the file provided in panpipes/resources/cell_cicle_genes.tsv. Using this file, this action will produce at least 3 columns in the `mdata["rna"].obs` assay, namely 'S_score', 'G2M_score', 'phase'.
+Setting the `ccgenes` param to `default` in the ingest workflow will calculate the phase of the cell cycle in which the cell is by using `scanpy.tl.score_genes_cell_cycle` using the file provided in `panpipes/resources/cell_cicle_genes.tsv`. Using this file, this action will produce at least 3 columns in the `mdata["rna"].obs` assay, namely 'S_score', 'G2M_score', 'phase'.
 
 Users can create their own list, and need to specify the path to this new file in in the `ccgenes` param to score the cells with their custom list.
+If left blank, the cellcycle score won't be calculated.
 
-If left blank, the cellcycle score will not be calculated.
 
 Using Custom Gene lists to plot: the Visualization workflow
 ---------------
@@ -170,6 +203,33 @@ minimal:
 Generally in the visualization pipeline all gene groups in the input are plotted. In heatmaps and dotplots, one dotplot per group is plotted. For UMAPs, one plot per gene is
 plotted, and a new file is saved per group.
 
+
+## Plot Makers in the Visualization workflow 
+
+The custom maker csv file for full and minimal must contain three columns and follow the following structure: 
+  | mod  | feature  | group        |
+  |------|----------|--------------|
+  | prot | prot_CD8 | Tcellmarkers |
+  | rna  | CD8A     | Tcellmarkers |
+
+The full list will be plotted in dot plots and matrix plots, with one plot per group. 
+
+The shorter list will be plotted on umaps as well as other plot types, with one plot per group. 
+
+ | feature_1 | feature_2 | colour         |
+ |-----------|-----------|----------------|
+ | CD8A      | prot_CD8  |                |
+ | CD4       | CD8A      | doublet_scores |
+
+
+
+## Plot metadata variables 
+The scatter_features.csv file should have the following format:
+
+ | feature_1 | feature_2 | colour         |
+ |-----------|-----------|----------------|
+ |rna:total_counts | prot:total_counts  | doublet_scores
+
 ## Final notes
 
 Be deliberate and informative with the choice of group names for any gene set use, since the `.obs` column generated as output will be named based on the group of the gene list input file.

diff --git a/docs/usage/index.md b/docs/usage/index.md
@@ -6,6 +6,7 @@ See [installation](../install) for installation instructions
 ```{toctree}
 :maxdepth: 1
 
+panpipes_scverse
 general_principles
 setup_for_qc_mm
 gene_list_format
@@ -14,4 +15,5 @@ normalization_methods
 different_entry_points
 setup_for_spatial_workflows
 integration_methods
+troubleshooting
 ```
diff --git a/docs/usage/integration_methods.md b/docs/usage/integration_methods.md
@@ -9,14 +9,18 @@
 </style>
 # Integration methods implemented in panpipes
 
-The panpipes integration pipeline implements a variety of tools to batch correct individual modalities and/or integrate across modalities to produce a reduced dimension representation of the dataset.<br>
-There are different tools available for each modality such as RNA (also referred to as GEX), PROT (can be referred to as ADT) and ATAC which can be run as required before running `panpipes integration make merge_batch_correction`
-to create the final object with the reduced dimension represented.<br> 
+The panpipes integration workflow implements a variety of tools to batch correct individual modalities and/or integrate across modalities to produce a reduced dimension representation of the dataset.<br>
+There are different tools available for each modality:  RNA (also referred to as GEX), PROT (can be referred to as ADT) and ATAC which can be integrated into any preferred combination customising the integration workflow configuration file and running `panpipes integration make full`. After the results of the integration are inspected, the final object is created with `panpipes integration make merge_integration`.<br> 
 
 The ideal way to run `panpipes integration` is to use the output `MuData`file from `panpipes preprocess` since it will already be in the required format. 
 However, if using independent MuData the object should contain normalised data in the X slot of each modality, a ‘raw_counts’ layer in each modality, and a sample_id column in each slot of the obs and the outer obs. 
 
-The following table describes the different methods of batch correction available and their specificities: 
+Users can choose which integration method they want to apply based on their experiment, their experience with the tools or available benchmarks: we link all the relevant resources below. 
+
+
+We don't believe in "one method fits all", we instead offer `panpipes` as a framework to run multiple tools efficiently,  ensuring reproducibility of results. 
+We believe this will empower users to choose the method that best fits their biological question, keeping a record of the hyperparameters in the configuration files, so you can safely re-run your analysis and share it with collaborators. We will continue to update the integration methods offered in `panpipes` and we invite you to [contribute yours](https://panpipes-pipelines.readthedocs.io/en/latest/contribute_guidelines.html)!
+The following table describes the different methods currently supported and their specificities: 
 
 | Method    | type of integration         | modalities      | code                                                                              | references                                                                                           | benchmarks paper                                                                                           |
 |-----------|-----------------------------|-----------------|-----------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
@@ -28,3 +32,5 @@ The following table describes the different methods of batch correction availabl
 | totalVI   | multimodal                  | prot, rna       | [totalVI](https://github.com/scverse/scvi-tools)                                  | [Gayoso  et al. 2021](https://pubmed.ncbi.nlm.nih.gov/33589839/)                                     | [Makrodimitris et al 2024](https://academic.oup.com/bib/article/25/1/bbad416/7450271)                      |
 | MOFA      | multimodal                  | rna, atac, prot | [MOFA](https://github.com/bioFAM/mofapy2)                                        | [Argelaguet et al 2020](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02015-1) | [Lee, Kaestner,  and Li 2023](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03073-x) |
 | WNN       | multimodal                  | rna, atac, prot | [WNN](https://muon.readthedocs.io/en/latest/api/generated/muon.pp.neighbors.html) | [Hao et al 2021](https://pubmed.ncbi.nlm.nih.gov/34062119/)                                          | [Lee, Kaestner,  and Li 2023](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03073-x) |
+
+
diff --git a/docs/usage/panpipes_scverse.md b/docs/usage/panpipes_scverse.md
@@ -0,0 +1,40 @@
+# Panpipes and the scverse
+
+
+Panpipes is an analytical pipeline implemented in `python`, which orchestrates single cell analyses in an automated fashion. Panpipes relies on the `scverse`, an ecosystem of tools for single cell 'omics analysis in python.
+Checkout the [scverse](https://scverse.org/) documentation for the packages included and how to contribute your package so we can support it in `panpipes` too! 
+
+
+
+Panpipes has at its core [`AnnData`](https://anndata.readthedocs.io/en/latest/) and [`MuData`](https://mudata.readthedocs.io/en/latest/), for handling annotated data matrices in memory and on disk, with the best of the pandas and xarray functionalities.
+
+<figure>
+    <img src="https://github.com/DendrouLab/panpipes/blob/main/docs/img/anndata_schema.svg?raw=true" alt="img1" width="40%">
+    <figcaption>`AnnData` is anndata is a container for handling annotated data matrices objects.</figcaption>
+</figure>
+
+
+<figure>
+    <img src="https://github.com/DendrouLab/panpipes/blob/main/docs/img/mudata_paper.svg?raw=true" alt="img2" width="40%">
+    <figcaption>`MuData` is a dictionary of `AnnData` objects.</figcaption>
+</figure>
+
+The workhorses for panpipes are `scanpy`, `muon` and `squidpy`, frameworks for analyzing single-cell gene expression, multimodal data and spatial trascriptomics.
+
+For deep-learning based methods for uni or multimodal integration, we leverage the functionalities of `scvi-tools`, a library developing probabilistic models for single-cell omics data in PyTorch.
+
+To get help with `panpipes`, you can open an issue on our [Github page](). 
+Please use the [scverse discourse](https://discourse.scverse.org/) to document issues with `scverse` packages and get the help of other scverse users! 
+
+
+Please use these links to familiarize with these data structures and frameworks:
+
+- [AnnData](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html)
+- [MuData Quickstart](https://muon.readthedocs.io/en/latest/notebooks/quickstart_mudata.html)
+- [Single cell analysis with scanpy](https://scanpy.readthedocs.io/en/latest/) 
+- [Multimodal analyses with muon](https://muon-tutorials.readthedocs.io/en/latest/)
+- [Spatial analyses with squidpy](https://squidpy.readthedocs.io/en/stable/)
+- [Scvi-tools](https://scvi-tools.org/)
+- [Interoperability](https://scverse-tutorials.readthedocs.io/en/latest/notebooks/scverse_data_interoperability.html)
+- [Best-Practices for single cell multimodal analyses ](https://www.sc-best-practices.org/preamble.html)
+