From 04ed1dbb46914553cb19337de0a13d0da2dbc297 Mon Sep 17 00:00:00 2001
From: Giulia Garcia <147185635+giuliaelgarcia@users.noreply.github.com>
Date: Tue, 23 Apr 2024 10:42:24 +0100
Subject: [PATCH 1/6] Update index.rst

---
 docs/yaml_docs/index.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/docs/yaml_docs/index.rst b/docs/yaml_docs/index.rst
index 3fd86bc8..dd945cdf 100644
--- a/docs/yaml_docs/index.rst
+++ b/docs/yaml_docs/index.rst
@@ -10,4 +10,5 @@ Workflows configuration files
     pipeline_clustering_yml
     spatial_qc
     spatial_preprocess
-    spatial_deconvolution
\ No newline at end of file
+    spatial_deconvolution
+    pipeline_clustering_yml.md

From 9870c0fc72954d95e451a01e4a23ba2c7bc8be80 Mon Sep 17 00:00:00 2001
From: bio-la <fabiola.curion@gmail.com>
Date: Wed, 24 Apr 2024 16:52:26 +0200
Subject: [PATCH 2/6] fixed wrong params

---
 docs/yaml_docs/pipeline_clustering_yml.md | 27 ++++++++++++++---------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/docs/yaml_docs/pipeline_clustering_yml.md b/docs/yaml_docs/pipeline_clustering_yml.md
index bc5a22dd..2f783d77 100644
--- a/docs/yaml_docs/pipeline_clustering_yml.md
+++ b/docs/yaml_docs/pipeline_clustering_yml.md
@@ -62,16 +62,21 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
   Specify the full object if your scaled_obj contains only HVG.  If your scaled_obj contains all the genes then leave full_obj blank. 
   panpipes will use the full object to do marker genes analysis (rank_gene_groups) and for plotting those genes. 
 - <span class="parameter">modalities</span><br>
-  - <span class="parameter">rna</span> `Boolean`, Default: True<br>
+ Which modalities to run clustering on. 
+  - <span class="parameter">rna</span> `Boolean`, Default: True<br> If set to `True`, the workflow will stop if it doesn't find a modality named 'rna'
   - <span class="parameter">prot</span> `Boolean`, Default: True<br>
+  If set to `True`, the workflow will stop if it doesn't find a modality named 'prot'
   - <span class="parameter">atac</span> `Boolean`, Default: False<br>
+   If set to `True`, the workflow will stop if it doesn't find a modality named 'atac'
+  
   - <span class="parameter">spatial</span> `Boolean`, Default: False<br>
-  Run clustering on each individual modality.
+  If set to `True`, the workflow will stop if it doesn't find a modality named 'spatial'
+  
 
 - <span class="parameter">multimodal</span><br>
-  - <span class="parameter">rna_clustering</span> `Boolean`, Default: True<br>
-  - <span class="parameter">integration_method</span> `String`, Default: WNN<br>
-  Options here include WNN, mofa, and totalVI, and it tells us where to look for.
+  - <span class="parameter">rna_clustering</span> `Boolean`, Default: False<br> If set to True, runs clustering on multimodal embedding
+  - <span class="parameter">integration_method</span> `String`, Default: None<br>
+  Specify the name of the multimodal embedding. Options here include WNN, mofa, totalvi, multivi. In case you have run WNN, the neigbhours calculation will be skipped since WNN provides its own.
 
 ## Parameters for finding neighbours 
 
@@ -79,7 +84,7 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
  Sets the number of neighbors to use when calculating the graph for clustering and umap.
   - <span class="parameter">rna:</span> 
 
-     - <span class="parameter">use_existing </span> `Boolean`, Default: True<br>
+     - <span class="parameter">use_existing </span> `Boolean`, Default: True<br> Use existing neighbours in .uns calculated in the `integration` workflow. If `False`, it will recalculate using the following parameters
      - <span class="parameter">dim_red </span> `String`, Default: X_pca<br>
        Defines which representation in .obsm to use for nearest neighbors
      - <span class="parameter">n_dim_red</span> `Integer`, Default: 30<br>
@@ -94,7 +99,7 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
      
   - <span class="parameter">prot:</span> 
 
-     - <span class="parameter">use_existing </span> `Boolean`, Default: True<br>
+     - <span class="parameter">use_existing </span> `Boolean`, Default: True<br> Use existing neighbours in .uns calculated in the `integration` workflow. If `False`, it will recalculate using the following parameters
      - <span class="parameter">dim_red </span> `String`, Default: X_pca<br>
        Defines which representation in .obsm to use for nearest neighbors
      - <span class="parameter">n_dim_red</span> `Integer`, Default: 30<br>
@@ -109,7 +114,7 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
 
   - <span class="parameter">atac:</span> 
 
-     - <span class="parameter">use_existing </span> `Boolean`, Default: True<br>
+     - <span class="parameter">use_existing </span> `Boolean`, Default: True<br> Use existing neighbours in .uns calculated in the `integration` workflow. If `False`, it will recalculate using the following parameters
      - <span class="parameter">dim_red </span> `String`, Default: X_lsi<br>
        Defines which representation in .obsm to use for nearest neighbors
      - <span class="parameter">n_dim_red</span> `Integer`, Default: 1<br>
@@ -125,7 +130,7 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
 
   - <span class="parameter">spatial:</span> 
 
-     - <span class="parameter">use_existing </span> `Boolean`, Default: False<br>
+     - <span class="parameter">use_existing </span> `Boolean`, Default: False<br> Use existing neighbours in .uns calculated in the `integration` workflow. If `False`, it will recalculate using the following parameters
      - <span class="parameter">dim_red </span> `String`, Default: X_pca<br>
        Defines which representation in .obsm to use for nearest neighbors
      - <span class="parameter">n_dim_red</span> `Integer`, Default: 30<br>
@@ -142,7 +147,7 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
 
   - <span class="parameter">umap:</span> 
 
-     - <span class="parameter">run </span> `Boolean`, Default: True<br>
+     - <span class="parameter">run </span> `Boolean`, Default: True<br> Set to `True` runs the umap calculation and plotting.
      - <span class="parameter">rna:</span>
          - <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
            Can specify an array: 0.25,0.5
@@ -265,7 +270,7 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
    - <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
        This parameter is mandatory if pseudo_seurat is set to True 
 ## Plot specifications
-Used to define which metadata columns are used in the visualizations 
+Used to define layers are used in the markers visualizations 
  - <span class="parameter">plotspecs:</span><br>
    - <span class="parameter">layers: </span><br>
      - <span class="parameter">rna </span> `String`, Default: logged_counts<br>

From 03b2c2892c40fc7c542f7ea2f17b4ba78a349db8 Mon Sep 17 00:00:00 2001
From: bio-la <fabiola.curion@gmail.com>
Date: Wed, 24 Apr 2024 16:53:27 +0200
Subject: [PATCH 3/6] typo

---
 docs/yaml_docs/pipeline_clustering_yml.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/yaml_docs/pipeline_clustering_yml.md b/docs/yaml_docs/pipeline_clustering_yml.md
index 2f783d77..cdd1ccd6 100644
--- a/docs/yaml_docs/pipeline_clustering_yml.md
+++ b/docs/yaml_docs/pipeline_clustering_yml.md
@@ -270,7 +270,7 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
    - <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
        This parameter is mandatory if pseudo_seurat is set to True 
 ## Plot specifications
-Used to define layers are used in the markers visualizations 
+Define which layers are used in the markers visualization 
  - <span class="parameter">plotspecs:</span><br>
    - <span class="parameter">layers: </span><br>
      - <span class="parameter">rna </span> `String`, Default: logged_counts<br>

From 529613000641cf142987bd8d3d249f145dd425f8 Mon Sep 17 00:00:00 2001
From: bio-la <fabiola.curion@gmail.com>
Date: Wed, 24 Apr 2024 17:12:26 +0200
Subject: [PATCH 4/6] fixes

---
 panpipes/panpipes/pipeline_clustering/pipeline.yml | 12 ++++++++----
 panpipes/python_scripts/run_umap.py                |  2 +-
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/panpipes/panpipes/pipeline_clustering/pipeline.yml b/panpipes/panpipes/pipeline_clustering/pipeline.yml
index 7bf2db11..dc34e725 100644
--- a/panpipes/panpipes/pipeline_clustering/pipeline.yml
+++ b/panpipes/panpipes/pipeline_clustering/pipeline.yml
@@ -38,10 +38,10 @@ modalities:
   atac: False
   spatial: False
 
-# if True, will look for WNN, or totalVI output
+# if True, will look for WNN, mofa, multivi, totalVI embeddings
 multimodal:
-  run_clustering: True
-  #WNN, mofa, totalVI # this will tell us where to look for 
+  run_clustering: False
+  #WNN, mofa, multivi, totalVI embeddings
   integration_method: 
 
 
@@ -50,9 +50,10 @@ multimodal:
 # ---------------------------------------
 # 
 # -----------------------------
-# number of neighbors to use when calculating the graph for clustering and umap.
+# number of neighbors to use when calculating the knn graph for clustering and umap.
 neighbors:
   rna:
+    #use the knn calculated in the integration workflow. If False it will recalculate
     use_existing: True
     # which representation in .obsm to use for nearest neighbors
     # if dim_red=X_pca and X_pca not in .obsm, will be computed with default parameters
@@ -66,6 +67,7 @@ neighbors:
     # scanpy | hnsw (from scvelo)
     method: scanpy
   prot:
+    #use the knn calculated in the integration workflow. If False it will recalculate
     use_existing: True
     # which representation in .obsm to use for nearest neighbors
     # if dim_red=X_pca and X_pca not in .obsm, will be computed with default parameters
@@ -79,6 +81,7 @@ neighbors:
     # scanpy | hnsw (from scvelo)
     method: scanpy
   atac:
+    #use the knn calculated in the integration workflow. If False it will recalculate
     use_existing: True
     # which representation in .obsm to use for nearest neighbors
     # if dim_red=X_lsi/X_pca and X_lsi/X_pca not in .obsm, will be computed with default parameters
@@ -94,6 +97,7 @@ neighbors:
     # scanpy | hnsw (from scvelo)
     method: scanpy
   spatial:
+    #use the knn calculated in the integration workflow. If False it will recalculate
     use_existing: False
     # which representation in .obsm to use for nearest neighbors
     # if dim_red=X_pca and X_pca not in .obsm, will be computed with default parameters
diff --git a/panpipes/python_scripts/run_umap.py b/panpipes/python_scripts/run_umap.py
index 6a5b957b..e4fe42b0 100644
--- a/panpipes/python_scripts/run_umap.py
+++ b/panpipes/python_scripts/run_umap.py
@@ -33,7 +33,7 @@
                     default=0.1, 
                     help="no. neighbours parameters for sc.pp.neighbors()")
 parser.add_argument("--neighbors_key", 
-                    default="neighbors", help="algortihm choice from louvain and leiden")
+                    default="neighbors", help="name of the saved knn neighbors")
 
 args, opt = parser.parse_known_args()
 L.info(args)

From 9771db6a566550ea569523aca537f3d1b272c8c7 Mon Sep 17 00:00:00 2001
From: bio-la <fabiola.curion@gmail.com>
Date: Fri, 26 Apr 2024 11:25:00 +0200
Subject: [PATCH 5/6] small changes

---
 docs/yaml_docs/pipeline_clustering_yml.md | 8 ++++++--
 panpipes/panpipes/pipeline_clustering.py  | 3 ++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/docs/yaml_docs/pipeline_clustering_yml.md b/docs/yaml_docs/pipeline_clustering_yml.md
index cdd1ccd6..e190c55d 100644
--- a/docs/yaml_docs/pipeline_clustering_yml.md
+++ b/docs/yaml_docs/pipeline_clustering_yml.md
@@ -14,7 +14,10 @@ In this documentation, the parameters of the `clustering` configuration yaml fil
 This file is generated running `panpipes clustering config`. <br>
 The individual steps run by the pipeline are described in [clustering workflow](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/clustering.html)
 
-When running the clustering workflow, panpipes provides a basic `pipeline.yml` file.
+The `clustering` workflow works with outputs generated by the `integration` workflow, and expects a `MuData` object with 
+`neighbors` saved in the `.uns` of the global layer to run clustering on the multimodal embedding. If `neighbors` are calculated on each modality layers, these can be reused or re-calculated on the flight.
+
+When running the clustering workflow, panpipes provides a basic `pipeline.yml` file to customize with parameters.
 To run the workflow on your own data, you need to specify the parameters described below in the `pipeline.yml` file to meet the requirements of your data.
 
 However, we do provide pre-filled versions of the `pipeline.yml` file for individual [tutorials](https://panpipes-pipelines.readthedocs.io/en/latest/tutorials/index.html).
@@ -76,7 +79,8 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
 - <span class="parameter">multimodal</span><br>
   - <span class="parameter">rna_clustering</span> `Boolean`, Default: False<br> If set to True, runs clustering on multimodal embedding
   - <span class="parameter">integration_method</span> `String`, Default: None<br>
-  Specify the name of the multimodal embedding. Options here include WNN, mofa, totalvi, multivi. In case you have run WNN, the neigbhours calculation will be skipped since WNN provides its own.
+  In case you have run WNN and want to run clustering on the wnn embedding, specify "WNN" here. The neigbhours are saved with a different `--neighbors_key` param only for wnn, for every other method (totalvi, multivi, mofa) leave this parameter blank. 
+
 
 ## Parameters for finding neighbours 
 
diff --git a/panpipes/panpipes/pipeline_clustering.py b/panpipes/panpipes/pipeline_clustering.py
index 99837875..a3caad38 100644
--- a/panpipes/panpipes/pipeline_clustering.py
+++ b/panpipes/panpipes/pipeline_clustering.py
@@ -43,9 +43,10 @@ def set_up_dirs(log_file):
 ## Single modality scripts
 ## ------------------------------------
 
-# -----------------------------------=
+# --------------------------------------
 # neighbors
 # --------------------------------------
+# TO DO create task to re-run neighbours on multimodal outer representations (this script can only read in each mod layer)
 @follows(set_up_dirs)
 @originate(PARAMS['mudata_with_knn'])
 def run_neighbors(outfile):

From 34e5dd924ceef57f8b6eee5a50b4c8eddc958bc4 Mon Sep 17 00:00:00 2001
From: bio-la <fabiola.curion@gmail.com>
Date: Fri, 26 Apr 2024 11:37:33 +0200
Subject: [PATCH 6/6] floats and arrays

---
 docs/yaml_docs/pipeline_clustering_yml.md | 33 ++++++++++++++---------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/docs/yaml_docs/pipeline_clustering_yml.md b/docs/yaml_docs/pipeline_clustering_yml.md
index e190c55d..7f476833 100644
--- a/docs/yaml_docs/pipeline_clustering_yml.md
+++ b/docs/yaml_docs/pipeline_clustering_yml.md
@@ -154,48 +154,48 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
      - <span class="parameter">run </span> `Boolean`, Default: True<br> Set to `True` runs the umap calculation and plotting.
      - <span class="parameter">rna:</span>
          - <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
-           Can specify an array: 0.25,0.5
+           Can specify a single float or an array: 0.25,0.5
       - <span class="parameter">prot:</span>
          - <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
-           Can specify an array: 0.25,0.5,0.8
+           Can specify a single float or an array: 0.25,0.5,0.8
       - <span class="parameter">atac:</span>
          - <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
-           Can specify an array: 0.25,0.5,0.8
+           Can specify a single float or an array: 0.25,0.5,0.8
       - <span class="parameter">multimodal:</span>
          - <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
-           Can specify an array: 0.25,0.5,0.8
+           Can specify a single float or an array: 0.25,0.5,0.8
       - <span class="parameter">rna:</span>
          - <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
-            Can specify an array: 0.25,0.5,0.8
+            Can specify a single float or an array: 0.25,0.5,0.8
 
 ## Parameters for clustering 
 
   - <span class="parameter">clusterspecs:</span>
       - <span class="parameter">rna:</span>
           - <span class="parameter">resolutions </span> `Float`, Default: 0.2, 0.6, 1<br>
-           Can specify an array: 0.2,0.6,1
+           Can specify a single float or an array: 0.2,0.6,1
           - <span class="parameter">algorithm</span> `String`, Default: leiden<br>
             Options include louvain or leiden. 
       - <span class="parameter">prot:</span>
           - <span class="parameter">resolutions </span> `Float`, Default: 0.2, 0.6, 1<br>
-           Can specify an array: 0.2,0.6,1
+           Can specify a single float or an array: 0.2,0.6,1
           - <span class="parameter">algorithm</span> `String`, Default: leiden<br>
             Options include louvain or leiden.
 
       - <span class="parameter">atac:</span>
           - <span class="parameter">resolutions </span> `Float`, Default: 0.2, 0.6, 1<br>
-           Can specify an array to compute in parallel: 0.2,0.6,1
+           Can specify a single float or an array to compute in parallel: 0.2,0.6,1
           - <span class="parameter">algorithm</span> `String`, Default: leiden<br>
             Options include louvain or leiden. 
       - <span class="parameter">multimmodal:</span>
           - <span class="parameter">resolutions </span> `Float`, Default: 0.5, 0.7<br>
-           Can specify an array to compute in parallel: 0.2,0.6,1 
+           Can specify a single float or an array to compute in parallel: 0.2,0.6,1 
           - <span class="parameter">algorithm</span> `String`, Default: leiden<br>
             Options include louvain or leiden.
 
       - <span class="parameter">spatial:</span>
           - <span class="parameter">resolutions </span> `Float`, Default: 0.2, 0.6, 1<br>
-           Can specify an array to compute in parallel: 0.2,0.6,1 
+           Can specify a single float or an array to compute in parallel: 0.2,0.6,1 
           - <span class="parameter">algorithm</span> `String`, Default: leiden<br>
             Options include louvain or leiden. 
 
@@ -216,8 +216,10 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
        Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis
        - <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
        - <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
+       Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. 
        This parameter is mandatory if pseudo_seurat is set to True 
        - <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
+       Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.
        This parameter is mandatory if pseudo_seurat is set to True 
  - <span class="parameter">prot:</span><br>
    - <span class="parameter">run </span> `Boolean`, Default: True<br>
@@ -228,8 +230,10 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
    - <span class="parameter">method </span> `String`, Default: wilcoxon<br>
    - <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
    - <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
+       Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. 
        This parameter is mandatory if pseudo_seurat is set to True 
    - <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
+    Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. 
        This parameter is mandatory if pseudo_seurat is set to True 
 
  - <span class="parameter">atac:</span><br>
@@ -243,8 +247,10 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
         Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’
     - <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
     - <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
+       Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. 
        This parameter is mandatory if pseudo_seurat is set to True 
     - <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
+      Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.
        This parameter is mandatory if pseudo_seurat is set to True 
 
 
@@ -255,9 +261,9 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
         Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’
     - <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
     - <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
-       This parameter is mandatory if pseudo_seurat is set to True 
+       Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True 
     - <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
-       This parameter is mandatory if pseudo_seurat is set to True
+       Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.This parameter is mandatory if pseudo_seurat is set to True
 
 
  - <span class="parameter">spatial:</span><br>
@@ -270,8 +276,9 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
        Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis
    - <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
    - <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
-      This parameter is mandatory if pseudo_seurat is set to True 
+      Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True 
    - <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
+       Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. 
        This parameter is mandatory if pseudo_seurat is set to True 
 ## Plot specifications
 Define which layers are used in the markers visualization