diff --git a/docs/yaml_docs/pipeline_preprocess_yml.md b/docs/yaml_docs/pipeline_preprocess_yml.md
index 9f697d72..ffaa581f 100644
--- a/docs/yaml_docs/pipeline_preprocess_yml.md
+++ b/docs/yaml_docs/pipeline_preprocess_yml.md
@@ -271,7 +271,7 @@ Options for the detection of highly variable genes (HVGs) in the RNA modality.
This variable defines the group name tagging the genes to be excluded in file specified in the previous parameter.
Leave empty if you don't want to exclude genes from HVG detection.
- - filter `Booleab`, Default: False
+ - filter `Boolean`, Default: False
Set to True if you want to filter the object to retain only Highly Variable Genes.
regress_variables `String`
@@ -279,7 +279,99 @@ Options for the detection of highly variable genes (HVGs) in the RNA modality.
Leave blank if you don't want to regress out anything.
We recommend not regressing out anything unless you have good reason to.
-## Scaling
+### Scaling
+Scaling has the effect that all genes are weighted equally for downstream analysis.
+Whether applying scaling or not is still a matter of debate, as stated in the [Leucken et al Best Practices paper](https://doi.org/10.15252/msb.20188746):
+> "There is currently no consensus on whether or not to perform normalization over genes.
+ While the popular Seurat tutorials (Butler et al, 2018) generally apply gene scaling,
+ the authors of the Slingshot method opt against scaling over genes in their tutorial (Street et al, 2018).
+ The preference between the two choices revolves around whether all genes should be weighted equally for downstream analysis,
+ or whether the magnitude of expression of a gene is an informative proxy for the importance of the gene."
+
+
+run_scale `Boolean`, Default: True
+ Set to False if you do not want to scale the data.
+
+scale_max_value `Float`
+ Clip to this value after scaling.
+ If left blank, scaling is run with default parameters, as described in the [scanpy API](https://scanpy.readthedocs.io/en/stable/api/scanpy.pp.scale.html).
+
+### RNA Dimensionality Reduction
+pca
+ Parameters for PCA dimensionality reduction.
+
+ - n_pcs `Integer`, Default: 50
+ Number of principal components to compute.
+
+ - solver `String`, Default: default
+ Setting this parameter to "default" will use the `arpack` solver.
+ If you want to use a different solver, you can specify it as described in the [scanpy API](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.pca.html).
+
+ - color_by `String`, Default: sample_id
+ The variable to color the PCA plot by. Should be a column in the obs of the adata.
+
+
+## Protein (PROT) preprocessing steps
+prot
+ Parameters for the preprocessing of the protein modality.
+
+ - normalisation_methods `String` (comma-separated), Default: clr,dsb
+ Comma separated string of normalisation options.
+ Available options are: dsb,clr .
+ For more details, please refer to the [muon documentation](https://muon.readthedocs.io/en/latest/omics/citeseq.html).
+ Muon also provides separate information on [dsb normalisation](https://muon.readthedocs.io/en/latest/api/generated/muon.prot.pp.dsb.html)
+ and [clr normalisation](https://muon.readthedocs.io/en/latest/api/generated/muon.prot.pp.clr.html) methods.
+ The normalised count matrices are stored in layers called 'clr' and 'dsb', along with a layer called 'raw_counts'.
+ If you choose to run both (dsb and clr), then 'dsb' is stored in X as default.
+ For downstream visualisation, you can either specify the layer, or take the default stored in X.
+
+ - clr_margin `Integer` (0 or 1), Default: 0
+ Parameter for CLR normalisation.
+ The CLR margin determines whether you normalise per cell (as you would normalise RNA data), or by feature (recommended, due to the variable nature of protein assays).
+ Hence, CLR margin 0 is recommended for informative qc plots in this pipeline.
+ - 0 = normalise column-wise (per feature)
+ - 1 = normalise row-wise (per cell)
+
+ - background_obj `String` (Path)
+ Parameter for DSB normalisation.
+ You must specify the path to the background `MuData` (h5mu) object created in the ingest pipeline in order to run dsb normalisation.
+
+ - quantile_clipping `Boolean`, Default: True
+ Parameter for DSB normalisation.
+ Whether to perform quantile clipping on the normalised data.
+ Despite normalisation, some cells get extreme outliers which can be clipped as discussed [here](https://github.com/niaid/dsb).
+ The maximum value will be set at the 99.5% quantile value, applied per feature.
+ Please note that this feature is in the default muon `mu.pp.dsb` code, but manually implemented in this code.
+
+ - store_as_X `String`
+ If you choose to run more than one normalisation method, specify which normalisation method should be stored in the X slot.
+ If left blank, 'dsb' is the default that will be stored in X.
+
+ - save_norm_prot_mtx `Boolean`, Default: False
+ Specify if you want to save the prot normalised assay additionally as a txt file.
+
+ - pca `Boolean`, Default: False
+ Specify if you want to run PCA on the normalised protein data. This might be useful, when you have more than 50 features in your protein assay.
+
+ - n_pcs `Integer`, Default: 50
+ Number of principal components to compute. Specify at least n_pcs <= number of features -1.
+
+ - solver `String`, Default: default
+ Which solver to use for PCA. If set to "default", the 'arpack' solver is used.
+
+ - color_by `String`, Default: sample_id
+ Column to be fetched from the protein layer .obs to color the PCA plot by.
+
+## ATAC steps preprocessing steps
+atac
+ Parameters for the preprocessing of the ATAC modality.
+
+
+
+
+
+
+
diff --git a/panpipes/panpipes/pipeline_preprocess/pipeline.yml b/panpipes/panpipes/pipeline_preprocess/pipeline.yml
index 3c7ca29c..0bac851a 100644
--- a/panpipes/panpipes/pipeline_preprocess/pipeline.yml
+++ b/panpipes/panpipes/pipeline_preprocess/pipeline.yml
@@ -138,83 +138,46 @@ regress_variables:
#---------
# Scaling
-#---------
-# This scaling has the effect that all genes are weighted equally for downstream analysis.
-# discussion from Leucken et al Best Practices paper: https://doi.org/10.15252/msb.20188746
-# "There is currently no consensus on whether or not to perform normalization over genes.
-# While the popular Seurat tutorials (Butler et al, 2018) generally apply gene scaling,
-# the authors of the Slingshot method opt against scaling over genes in their tutorial (Street et al, 2018).
-# The preference between the two choices revolves around whether all genes should be weighted equally for downstream analysis,
-# or whether the magnitude of expression of a gene is an informative proxy for the importance of the gene."
run_scale: True
-# if blank defaults scale, clip values as per: https://scanpy.readthedocs.io/en/stable/api/scanpy.pp.scale.html
scale_max_value:
-#----------------------------
-# RNA Dimensionality reduction
-#----------------------------
+
+#-----------------------------
+# RNA Dimensionality Reduction
pca:
- # number of components to compute
n_pcs: 50
- # set "default" will use 'arpack'.
- # Otherwise specify a different solver, see https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.pca.html
- solver: default
- # color to use when plotting the PCA
+ solver: default
color_by: sample_id
-# --------------------------------------------------------------------------------------------------------
-# Protein (PROT) steps
-# --------------------------------------------------------------------------------------------------------
+
+# ----------------------------------
+# Protein (PROT) preprocessing steps
+# ----------------------------------
prot:
- # comma separated string of normalisation options
- # options: dsb,clr
- # more details in this vignette https://muon.readthedocs.io/en/latest/omics/citeseq.html
- # dsb https://muon.readthedocs.io/en/latest/api/generated/muon.prot.pp.dsb.html
- # clr https://muon.readthedocs.io/en/latest/api/generated/muon.prot.pp.clr.html
normalisation_methods: clr,dsb
- # the normalised matrices are stored in layers called 'clr' and 'dsb', along with a layer called 'raw_counts'
- # if you choose to run both then 'dsb' is stored in X as default.
- # In downstream visualisation, you can either specify the layer, or take the default.
# CLR parameters:
- # margin determines whether you normalise per cell (as you would RNA norm),
- # or by feature (recommended, due to the variable nature of prot assays).
- # CLR margin 0 is recommended for informative qc plots in this pipeline
- # 0 = normalise colwise (per feature)
- # 1 = normalise rowwise (per cell)
+ # 0 = normalise column-wise (per feature, recommended)
+ # 1 = normalise row-wise (per cell)
clr_margin: 0
# DSB parameters:
- # you must specify the path to the background h5mu created in pipeline ingest in order to run dsb.
background_obj:
- # quantile clipping,
- # even with normalisation, some cells get extreme outliers which can be clipped as discussed https://github.com/niaid/dsb
- # maximum value will be set at the value of the 99.5% quantile, applied per feature
- # note that this feature is in the default muon mu.pp.dsb code, but manually implemented in this code.
quantile_clipping: True
-
- # which normalisation method to be stored in the X slot. If you choose to run more than one normalisation method,
- # which one to you want to store in the X slot, if not specified 'dsb' is the default when run.
- store_as_X:
- # do you want to save the prot normalised assay additionally as a txt file:
+ store_as_X:
save_norm_prot_mtx: False
+
#----------------------------
- # Prot Dimensionality reduction
- #----------------------------
- # it may be useful to run PCA on the protein assay, when you have more than 50 features.
- # Set to False by default
+ # Protein Dimensionality reduction
pca: False
- # number of components. Specify at least n_pcs <= number of features -1
n_pcs: 50
- # which solver to use, set "default" will use 'arpack'.
solver: default
- # column to be fetched from the protein layer .obs
color_by: sample_id
-# --------------------------------------------------------------------------------------------------------
-# ATAC steps
-# --------------------------------------------------------------------------------------------------------
+# ------------------------------
+# ATAC steps preprocessing steps
+# ------------------------------
atac:
binarize: False
normalize: TFIDF # "log1p" or "TFIDF"