diff --git a/docs/yaml_docs/index.rst b/docs/yaml_docs/index.rst
index ad028376..94ab5126 100644
--- a/docs/yaml_docs/index.rst
+++ b/docs/yaml_docs/index.rst
@@ -6,6 +6,7 @@ Workflows configuration files
useful_info_on_yml
pipeline_ingestion_yml
+ pipeline_preprocess_yml
pipeline_integration_yml
spatial_qc
spatial_preprocess
diff --git a/docs/yaml_docs/pipeline_ingestion_yml.md b/docs/yaml_docs/pipeline_ingestion_yml.md
index 9980ec79..a2df04bc 100644
--- a/docs/yaml_docs/pipeline_ingestion_yml.md
+++ b/docs/yaml_docs/pipeline_ingestion_yml.md
@@ -20,7 +20,7 @@ However, we do provide pre-filled versions of the `pipeline.yml` file for indivi
For more information on functionalities implemented in `panpipes` to read the configuration files, such as reading blocks of parameters and reusing blocks with `&anchors` and `*scalars`, please check [our documentation](./useful_info_on_yml.md)
-You can download the different ingestion pipeline.yml files here:
+You can download the different ingestion `pipeline.yml` files here:
- Basic `pipeline.yml` file (not prefilled) that is generated when calling `panpipes ingest config: [Download here](https://github.com/DendrouLab/panpipes/blob/main/panpipes/panpipes/pipeline_ingest/pipeline.yml)
- `pipeline.yml` file for [Ingesting data Tutorial](https://panpipes-tutorials.readthedocs.io/en/latest/ingesting_data/Ingesting_data_with_panpipes.html): [Download here](https://github.com/DendrouLab/panpipes-tutorials/blob/main/docs/ingesting_data/pipeline.yml)
- `pipeline.yml` file for [Ingesting Mouse data Tutorial](https://panpipes-tutorials.readthedocs.io/en/latest/ingesting_mouse/Ingesting_mouse_data_with_panpipes.html): [Download here](https://github.com/DendrouLab/panpipes-tutorials/blob/main/docs/ingesting_mouse/pipeline.yml)
diff --git a/docs/yaml_docs/pipeline_preprocess_yml.md b/docs/yaml_docs/pipeline_preprocess_yml.md
new file mode 100644
index 00000000..cd9cab75
--- /dev/null
+++ b/docs/yaml_docs/pipeline_preprocess_yml.md
@@ -0,0 +1,438 @@
+
+
+# Preprocess YAML
+
+In this documentation, the parameters of the `preprocess` configuration yaml file are explained.
+This file is generated by running `panpipes preprocess config`.
The individual steps run by the pipeline are described in the [preprocess workflow](../workflows/preprocess.md).
+
+When running the preprocess workflow, panpipes provides a basic `pipeline.yml` file.
+To run the workflow on your own data, you need to specify the parameters described below in the `pipeline.yml` file to meet the requirements of your data.
+However, we do provide pre-filled versions of the `pipeline.yml` file for individual [tutorials](https://panpipes-pipelines.readthedocs.io/en/latest/tutorials/index.html).
+
+For more information on functionalities implemented in `panpipes` to read the configuration files, such as reading blocks of parameters and reusing blocks with `&anchors` and `*scalars`, please check [our documentation](./useful_info_on_yml.md).
+
+
+You can download the different preprocess `pipeline.yml` files here:
+- Basic `pipeline.yml` file (not prefilled) that is generated when calling `panpipes preprocess config: [Download here](https://github.com/DendrouLab/panpipes/blob/main/panpipes/panpipes/pipeline_preprocess/pipeline.yml).
+- Prefilled `pipeline.yml` file for the [preprocess tutorial](https://panpipes-tutorials.readthedocs.io/en/latest/filtering_data/filtering_data_with_panpipes.html): [Download here](https://github.com/DendrouLab/panpipes-tutorials/blob/main/docs/filtering_data/pipeline.yml).
+
+## Compute resources options
+
+resources
+Computing resources to use, specifically the number of threads used for parallel jobs.
+Specified by the following three parameters:
+ - threads_high `Integer`, Default: 2
+ Number of threads used for high intensity computing tasks.
+ For each thread, there must be enough memory to load all your input files at once and create the MuData object.
+
+ - threads_medium `Integer`, Default: 2
+ Number of threads used for medium intensity computing tasks.
+ For each thread, there must be enough memory to load your mudata and do computationally light tasks.
+
+ - threads_low `Integer`, Default: 1
+ Number of threads used for low intensity computing tasks.
+ For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.
+
+condaenv `String` (Path)
+ Path to conda environment that should be used to run panpipes.
+ Leave blank if running native or your cluster automatically inherits the login node environment.
+ For more information on this, please refer to the detailed explanation [here](https://panpipes-pipelines.readthedocs.io/en/latest/install.html#specifying-conda-environments-to-run-panpipes).
+
+## General project specifications
+
+sample_prefix `String`
+ Prefix for sample names.
+
+unfiltered_obj `String`
+ If running this on prefiltered data, complete the following steps:
+ 1. Leave `unfiltered_obj` (this parameter) blank
+ 2. Rename your filtered file so that it matches the format PARAMS['sample_prefix'] + '.h5mu'
+ 3. Put the renamed file in the same folder as the `pipeline.yml`
+ 4. Set `filtering run` to `False` below
+
+modalities
+ Specify which modalities are included in the data by setting the respective modality to True.
+ Leave empty (None) or False to signal this modality is not part of the experiment.
+ The modalities are processed in the order of the following list:
+ - rna `Boolean`, Default: True
+
+ - prot `Boolean`, Default: False
+
+ - rep `Boolean`, Default: False
+
+ - atac `Boolean`, Default: False
+
+## Filtering Cells and Features
+Filtering in panpipes is done sequentially for all modalities, filtering first cells and then features.
+For each modality, the pipeline.yml file contains a dictionary with the following structure:
+
+```yaml
+MODALITY:
+ obs:
+ min:
+ max:
+ bool:
+ var:
+ min:
+ max:
+ bool:
+```
+
+This format can be applied to any modality by editing the filtering dictionary
+You are not restricted by the columns given as default.
+
+This is fully customizable to any columns in the mudata.obs or var object.
+When specifying a column name, make sure it exactly matches the column name in the h5mu object.
+
+Example:
+```yaml
+rna:
+ obs:
+ min: # Any column for which you want to run a minimum filter
+ n_genes_by_counts: 500 # i.e. will filter out cells with a value less than 500 in the n_genes_by_counts column
+ max: # Any column for which you want to run a maximum filter
+ pct_counts_mt: 20 # i.e. each cell may have a maximum of 20 in the pct_counts_mt column
+ # be careful with any columns named after gene sets.
+ # The column will be named based on the gene list input file,
+ # so if the mitochondrial genes are in group "mt"
+ # as in the example given in the resource file,
+ # then the column will be named "pct_counts_mt".
+ bool:
+ is_doublet: False # if you have any boolean columns you want to filter on,
+ # then use this section of the modality dictionary
+ # in this case any obs['is_doublet'] that are False will be retained in the dataset.
+```
+
+filtering
+ - run `Boolean`, Default: True
+ If set to False, no filtering is applied to the `MuData` object.
+
+ - keep_barcodes `String` (Path)
+ Path to a file containing specific cell barcodes you want to keep; leave blank if not applicable.
+
+### RNA-specific filtering (rna)
+obs
+ Parameters for obs, i.e. cell level filtering:
+
+ - min
+ Filtering cells based on a minimum value in a column. Leave parameters blank if you do not want to filter by them.
+
+ - n_genes_by_counts `Integer`
+ Minimum number of genes by counts per cell.
+ For instance, setting the parameter to 500, will filter out cells with a value less than 500 in the n_genes_by_counts column.
+
+ - max
+ Filtering cells based on a maximum value in a column. Leave parameters blank if you do not want to filter by them.
+
+ - total_counts `Integer`
+ Cells with a total count greater than this value will be filtered out.
+
+ - n_genes_by_counts `Integer`
+ Maximum number of genes by counts per cell.
+
+ - pct_counts_mt `Integer` (in Percent)
+ Percent of counts that are mitochondrial genes. Cells with a value greater than this will be filtered out.
+ Should be a value between 0 and 100 (%).
+
+ - pct_counts_rp `Integer` (in Percent)
+ Percent of counts that are ribosomal genes. Cells with a value greater than this will be filtered out.
+ Should be a value between 0 and 100 (%).
+
+ - doublet_scores `Integer`
+ If you want to apply a custom scrublet threshold per input sample you can specify it here.
+ Provide either as one score for all samples (e.g. 0.25), or a csv file with two columns sample_id, and cut off.
+
+ - bool
+ You can add a new column to the mudata['rna'].obs with boolean (True/False) values, and then list
+ that column under this bool section. This can be done for any modality.
+
+var
+ Parameters for var, i.e. gene (feature) level filtering:
+
+ - min
+
+ - n_cells_by_counts `Integer`
+
+ - max
+
+ - total_counts `Integer`
+
+ - n_cells_by_counts `Integer`
+
+### Protein-specific filtering (prot)
+obs
+ Parameters for obs, i.e. cell level filtering:
+
+ - max
+ Filtering cells based on a maximum value in a column. Leave parameters blank if you do not want to filter by them.
+
+ - total_counts `Integer`
+ Cells with a total count greater than this value will be filtered out.
+
+### ATAC-specific filtering (atac)
+var
+ Parameters for var, i.e. gene (feature) level filtering:
+
+ - nucleosome_signal
+
+## Intersecting cell barcodes
+intersect_mods `String`
+ Taking observations present only in modalities listed in mods, or all modalities if set to None.
+ Provide a comma separated list where you want to keep only the intersection of barcodes. e.g. rna,prot
+
+## Downsampling cell barcodes
+downsample_n `Integer`
+ Number of cells to downsample to, leave blank to keep all cells.
+
+downsample_col `String`
+ If you want to equalise by dataset or sample_id, then specifiy a column in obs of the adata to downsample by here.
+ If specified, the data will be subset to n cells **per** downsample_col value.
+
+downsample_mods `String` (comma separated)
+ Specify which modalities you want to subsample.
+ If more than one modality is added then these will be intersected.
+ Provide as a comma separated String, e.g.: rna,prot
+
+## Plotting variables
+plotqc
+ All metrics in this section should be provided as a comma separated string without spaces e.g. a,b,c
+ Leave blank to avoid plotting.
+
+ - grouping_var `String` (comma separated), Default: sample_id
+ Use these categorical variables to plot/split by.
+
+ - rna_metrics `String` (comma separated), Default: pct_counts_mt,pct_counts_rp,pct_counts_hb,pct_counts_ig,doublet_scores
+ Specify the metrics in the metadata of the RNA modality to plot.
+
+ - prot_metrics `String` (comma separated), Default: total_counts,log1p_total_counts,n_prot_by_counts,pct_counts_isotype
+ Specify the metrics in the metadata of the Protein modality to plot.
+
+ - atac_metrics `String` (comma separated)
+ Specify the metrics in the metadata of the ATAC modality to plot.
+
+ - rep_metrics `String` (comma separated)
+ Specify the metrics in the metadata of the Rep modality to plot.
+
+## RNA preprocessing steps
+Currently, only standard preprocessing steps (sc.pp.normalize_total followed by sc.pp.log1p) is offered for the RNA modality.
+
+log1p `Boolean`, Default: True
+ If set to False, the log1p transformation is not applied to the RNA modality.
+
+hvg
+Options for the detection of highly variable genes (HVGs) in the RNA modality.
+
+ - flavor `String`, Default: seurat
+ Choose one of the supported hvg_flavor options: "seurat", "cell_ranger", "seurat_v3".
+ For the dispersion based methods "seurat" and "cell_ranger", you can specify parameters: `min_mean`, `max_mean`, `min_disp`(listed below).
+ For "seurat_v3" a different method is used, and you need to specify how many variable genes to find by specifying the parameter `n_top_genes`.
+ If you specify `n_top_genes`, then the other parameters (`min_mean`, `max_mean`, `min_disp`) are nulled.
+ For further reading on this, please refer to the [scanpy API](https://scanpy.readthedocs.io/en/stable/api/scanpy.pp.highly_variable_genes.html).
+
+ - batch_key `String`
+ If `batch_key` is specified, highly-variable genes are selected within each batch separately and merged.
+ For details on this, please refer to the [scanpy API](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html#:~:text=or%20return%20them.-,batch_key,-%3A%20Optional%5B).
+ If you want to use more than one obs column as covariates, specify this as as "covariate1,covariate2" (comma separated list).
+ Leave blank if no batch should be accounted for in the HVG detection (default behavior).
+
+ - n_top_genes `Integer`, Default: 2000
+ Number of highly-variable genes to keep. You must specify this parameter if flavor is "seurat_v3".
+
+ - min_mean `Float`
+ Minimum mean expression of genes to be considered as highly variable genes.
+ Ignored if `n_top_genes` is specified or if flavor is set to "seurat_v3".
+
+ - max_mean `Float`
+ Maximum mean expression of genes to be considered as highly variable genes.
+ Ignored if `n_top_genes` is specified or if flavor is set to "seurat_v3".
+
+ - min_disp `Float`
+ Minimum dispersion of genes to be considered as highly variable genes.
+ Ignored if `n_top_genes` is specified or if flavor is set to "seurat_v3".
+
+ - exclude_file `String` (Path)
+ It may be useful to exclude some genes from the HVG selection.
+ In this case, you can provide a file with a list of genes to exclude.
+ We provide an example for genes that could be excluded when analyzing immune cells [here](https://github.com/DendrouLab/panpipes/blob/main/panpipes/resources/qc_genelist_1.0.csv).
+ When examining this file, you will note that it has three columns, the first specifying the modality, the second one the gene id and the third the groups to which the respective gene belongs.
+ This workflow will exclude the genes that are marked accordingly by their group name.
+ By default, the workflows will remove the genes that are flagged as "exclude" in the group column from HVG detection.
+ You can customize the gene list and change the name of the gene group in the `exclude:` parameter (see below) accordingly.
+
+ - exclude `String`
+ This variable defines the group name tagging the genes to be excluded in file specified in the previous parameter.
+ Leave empty if you don't want to exclude genes from HVG detection.
+
+ - filter `Boolean`, Default: False
+ Set to True if you want to filter the object to retain only Highly Variable Genes.
+
+regress_variables `String`
+ Regression variables, specify the variables you want to regress out.
+ Leave blank if you don't want to regress out anything.
+ We recommend not regressing out anything unless you have good reason to.
+
+### Scaling
+Scaling has the effect that all genes are weighted equally for downstream analysis.
+Whether applying scaling or not is still a matter of debate, as stated in the [Leucken et al Best Practices paper](https://doi.org/10.15252/msb.20188746):
+> "There is currently no consensus on whether or not to perform normalization over genes.
+ While the popular Seurat tutorials (Butler et al, 2018) generally apply gene scaling,
+ the authors of the Slingshot method opt against scaling over genes in their tutorial (Street et al, 2018).
+ The preference between the two choices revolves around whether all genes should be weighted equally for downstream analysis,
+ or whether the magnitude of expression of a gene is an informative proxy for the importance of the gene."
+
+
+run_scale `Boolean`, Default: True
+ Set to False if you do not want to scale the data.
+
+scale_max_value `Float`
+ Clip to this value after scaling.
+ If left blank, scaling is run with default parameters, as described in the [scanpy API](https://scanpy.readthedocs.io/en/stable/api/scanpy.pp.scale.html).
+
+### RNA Dimensionality Reduction
+pca
+ Parameters for PCA dimensionality reduction.
+
+ - n_pcs `Integer`, Default: 50
+ Number of principal components to compute.
+
+ - solver `String`, Default: default
+ Setting this parameter to "default" will use the `arpack` solver.
+ If you want to use a different solver, you can specify it as described in the [scanpy API](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.pca.html).
+
+ - color_by `String`, Default: sample_id
+ The variable to color the PCA plot by. Should be a column in the obs of the adata.
+
+
+## Protein (PROT) preprocessing steps
+prot
+ Parameters for the preprocessing of the protein modality.
+
+ - normalisation_methods `String` (comma-separated), Default: clr,dsb
+ Comma separated string of normalisation options.
+ Available options are: dsb,clr .
+ For more details, please refer to the [muon documentation](https://muon.readthedocs.io/en/latest/omics/citeseq.html).
+ Muon also provides separate information on [dsb normalisation](https://muon.readthedocs.io/en/latest/api/generated/muon.prot.pp.dsb.html)
+ and [clr normalisation](https://muon.readthedocs.io/en/latest/api/generated/muon.prot.pp.clr.html) methods.
+ The normalised count matrices are stored in layers called 'clr' and 'dsb', along with a layer called 'raw_counts'.
+ If you choose to run both (dsb and clr), then 'dsb' is stored in X as default.
+ For downstream visualisation, you can either specify the layer, or take the default stored in X.
+
+ - clr_margin `Integer` (0 or 1), Default: 1
+ Parameter for CLR normalisation.
+ The CLR margin determines whether you normalise per cell (as you would normalise RNA data), or by feature (recommended, due to the variable nature of protein assays).
+ Hence, CLR margin 1 is recommended for informative qc plots in this pipeline.
+ - 0 = normalise row-wise (per cell)
+ - 1 = normalise column-wise (per feature)
+
+ - background_obj `String` (Path)
+ Parameter for DSB normalisation.
+ You must specify the path to the background `MuData` (h5mu) object created in the ingest pipeline in order to run dsb normalisation.
+
+ - quantile_clipping `Boolean`, Default: True
+ Parameter for DSB normalisation.
+ Whether to perform quantile clipping on the normalised data.
+ Despite normalisation, some cells get extreme outliers which can be clipped as discussed [here](https://github.com/niaid/dsb).
+ The maximum value will be set at the 99.5% quantile value, applied per feature.
+ Please note that this feature is in the default muon `mu.pp.dsb` code, but manually implemented in this code.
+
+ - store_as_X `String`
+ If you choose to run more than one normalisation method, specify which normalisation method should be stored in the X slot.
+ If left blank, 'dsb' is the default that will be stored in X.
+
+ - save_norm_prot_mtx `Boolean`, Default: False
+ Specify if you want to save the prot normalised assay additionally as a txt file.
+
+ - pca `Boolean`, Default: False
+ Specify if you want to run PCA on the normalised protein data. This might be useful, when you have more than 50 features in your protein assay.
+
+ - n_pcs `Integer`, Default: 50
+ Number of principal components to compute. Specify at least n_pcs <= number of features -1.
+
+ - solver `String`, Default: default
+ Which solver to use for PCA. If set to "default", the 'arpack' solver is used.
+
+ - color_by `String`, Default: sample_id
+ Column to be fetched from the protein layer .obs to color the PCA plot by.
+
+## ATAC preprocessing steps
+atac
+ Parameters for the preprocessing of the ATAC modality.
+
+ - binarize `Boolean`, Default: False
+ If set to True, the data will be binarized.
+
+ - normalize `String`, Default: TFIDF
+ What normalisation method to use. Available options are "log1p" or "TFIDF".
+
+ - TFIDF_flavour `String`, Default: signac
+ TFIDF normalisation flavor. Leave blank if you don't use TFIDF normalisation.
+ Available options are: "signac", "logTF" or "logIDF".
+
+ - feature_selection_flavour `String`, Default: signac
+ Flavor for selecting highly variable features (HVF).
+ HVF selection either with scanpy's `pp.highly_variable_genes()` function or a `pseudo-FindTopFeatures()` function of the signac package.
+ Accordingly, available options are: "signac" or "scanpy".
+
+ - min_mean `Float`, Default: 0.05
+ Applicable if `feature_selection_flavour` is set to "scanpy".
+ You can leave this parameter blank if you want to use the default value.
+
+ - max_mean `Float`, Default: 1.5
+ Applicable if `feature_selection_flavour` is set to "scanpy".
+ You can leave this parameter blank if you want to use the default value.
+
+ - min_disp `Float`, Default: 0.5
+ Applicable if `feature_selection_flavour` is set to "scanpy".
+ You can leave this parameter blank if you want to use the default value.
+
+ - n_top_features `Integer`
+ Applicable if `feature_selection_flavour` is set to "scanpy".
+ Number of highly-variable features to keep.
+ If specified, overwrites previous defaults for HVF selection.
+
+ - filter_by_hvf `Boolean`, Default: False
+ Applicable if `feature_selection_flavour` is set to "scanpy".
+ Set to True if you want to filter the ATAC layer to retain only HVFs.
+
+ - min_cutoff `String`, Default: q5
+ Applicable if `feature_selection_flavour` is set to "signac".
+ Can be specified as follows:
+ - "q[x]": "q" followed by the minimum percentile, e.g. q5 will set the top 95% most common features as higly variable.
+ - "c[x]": "c" followed by a minimum cell count, e.g. c100 will set features present in > 100 cells as highly variable.
+ - "tc[x]": "tc" followed by a minimum total count, e.g. tc100 will set features with total counts > 100 as highly variable.
+ - "NULL": All features are assigned as highly variable.
+ - "NA": Highly variable features won't be changed.
+
+ - dimred `String`, Default: LSI
+ Available options are: PCA or LSI.
+ LSI will only be computed if TFIDF normalisation was used.
+
+ - n_comps `Integer`, Default: 50
+ Number of components to compute.
+
+ - solver `String`, Default: default
+ If using PCA, which solver to use. Setting this parameter to "default", will use the 'arpack' solver.
+
+ - color_by `String`, Default: sample_id
+ Specify the covariate you want to use to color the dimensionality reduction plot.
+
+ - dim_remove `TODO`
+ Whether to remove the component(s) associated to technical artifacts.
+ For instance, it is common to remove the first LSI component, as it is often associated with batch effects.
+ Leave blank to avoid removing any.
+
+
+
+
+
+
+
+
diff --git a/panpipes/panpipes/pipeline_preprocess/pipeline.yml b/panpipes/panpipes/pipeline_preprocess/pipeline.yml
index e150978a..11efe857 100644
--- a/panpipes/panpipes/pipeline_preprocess/pipeline.yml
+++ b/panpipes/panpipes/pipeline_preprocess/pipeline.yml
@@ -1,36 +1,26 @@
# ============================================================
# Preprocess workflow Panpipes (pipeline_preprocess.py)
# ============================================================
-# written by Charlotte Rich-Griffin and Fabiola Curion
+# This file contains the parameters for the ingest workflow.
+# For full descriptions of the parameters, see the documentation at https://panpipes-pipelines.readthedocs.io/en/latest/yaml_docs/pipeline_preprocess_yml.html
-# ------------------------
-# compute resource options
-# ------------------------
+#--------------------------
+# Compute resources options
+#--------------------------
resources:
- # Number of threads used for parallel jobs
- # this must be enough memory to load your mudata and do computationally intensive tasks
threads_high: 2
- # this must be enough memory to load your mudata and do computationally light tasks
threads_medium: 2
- # this must be enough memory to load text files and do plotting, requires much less memory than the other two
threads_low: 1
-# path to conda env, leave blank if running native or your cluster automatically inherits the login node environment
+
condaenv:
-# allows for tweaking where the jobs get submitted, in case there is a special queue for long jobs or you have access to a gpu
-# leave as is if you do not want to use the alternative queues
-# --------------------------
-# Start
-# --------------------------
+#-------------------------------
+# General project specifications
+#-------------------------------
sample_prefix:
unfiltered_obj:
-# if running this on prefiltered data then
-#1. set unfiltered obj (above) to blank
-#2. rename your filtered file to match, the format PARAMS['sample_prefix'] + '.h5mu'
-#3. put renamed file in the same folder as this yml.
-#4. set filtering run: to False below.
modalities:
rna: True
@@ -38,291 +28,174 @@ modalities:
rep: False
atac: False
-# --------------------------
+# ----------------------------
# Filtering Cells and Features
-# --------------------------
+# ----------------------------
+# Filtering is done sequentially for all modalities, filtering first cells and then features.
+# In the following, you can specify the filtering parameters for each modality.
-# the filtering process in panpipes is sequential as it goes through the filtering dictionary.
-# for each modality, starting with rna, it will first filter on obs and then vars.
-# each modality has a dictionary in the following format.
-
-# MODALITY
-# obs:
- # min:
- # max:
- # bool
-# var:
- # min:
- # max:
- # bool
-
-# This format can be applied to any modality by editing the filtering dictionary
-# You are not restricted by the columns given as default.
-
-# This is fully customizable to any columns in the mudata.obs or var object.
-# When specifying a column name, make sure it exactly matches the column name in the h5mu object.
-
-#Example:
-
-#------------------
-# rna:
-#------------------
- # obs:
- # min: <-- Any column for which you want to run a minimum filter,
- # n_genes_by_counts: 500 <--- i.e. each cell must have a minimum of 500 in the n_genes_by_counts column
- # max: <-- Any column for which you want to run a maximum filter
- # pct_counts_mt: 20 <-- i.e. each cell may have a maximum of 25 in the pct_counts_mt column
- # be careful with any columns named after gene sets.
- # The column will be named based on the gene list input file,
- # so if the mitochondrial genes are in group "mt"
- # as in the example given in the resource file
- # then the column will be named "pct_counts_mt" .
-
- # bool:
- # is_doublet: False <--- if you have any boolean columns you want to filter on,
- # then use this section of the modality dictionary
- # in this case any obs['is_doublet'] that are False will be retained in the dataset.
-# --------------------------------------------------------------------------------------
filtering:
- # if set to false no filtering is applied to the mudata object
run: True
- # a file containing only cell barcodes you want to keep, leave blank if not applicable
keep_barcodes:
- #------------------------------------------------------
+
+ #------------------------
+ # RNA-specific filtering
rna:
- #------------------------------------------------------
- ## obs filtering: cell level filtering here
+ # obs, i.e. cell level filtering
obs:
min:
- n_genes_by_counts:
+ n_genes_by_counts:
max:
total_counts:
- n_genes_by_counts:
- # percent filtering:
- # this should be a value between 0 and 100%.
- # leave blank or set to 100 to avoid filtering for any of these param
+ n_genes_by_counts:
pct_counts_mt:
pct_counts_rp:
- # either one score for all samples e.g. 0.25,
- # or a csv file with two columns sample_id, and cut off
- # if you want to apply a custom scrublet threshold per input sample you can specify it as a max value here
doublet_scores:
- # you could add a new column to the mudata['rna'].obs with True False values, and list
- # that column under bool:, you can do this for any modality
bool:
- ## var filtering: (feature) gene level filtering here
+
+ # var, i.e. gene (feature) level filtering
var:
min:
n_cells_by_counts:
max:
total_counts:
n_cells_by_counts:
- #------------------------------------------------------
+
+ #------------------------
+ # Protein-specific filtering
prot:
- #------------------------------------------------------
- ## obs filtering: cell level filtering here
+ # obs, i.e. cell level filtering
obs:
max:
total_counts:
- ##var filtering: (feature) protein level filtering here
+
+ # var, i.e. gene (feature) level filtering
var:
max:
min:
- #------------------------------------------------------
+
+ #------------------------
+ # ATAC-specific filtering
atac:
- #------------------------------------------------------
- ## obs filtering: cell level filtering here
+ # obs, i.e. cell level filtering
obs:
max:
- ## var filtering: (feature) fragment level filtering here
+
+ # var, i.e. gene (feature) level filtering
var:
nucleosome_signal:
-# ----------------------
+
+# ---------------------------
# Intersecting cell barcodes
-# ----------------------
+# ---------------------------
# Subset observations (cells) in-place by intersect
-# taking observations present only in modalities listed in mods, or all modalities if mods is None.
-# set a comma separated list where you want to keep only the intersection of barcodes. e.g. rna,prot
intersect_mods:
-# ----------------------
+
+# --------------------------
# Downsampling cell barcodes
-# ----------------------
-# how many cells to downsample to, leave blank to keep all cells.
-downsample_n:
-# if we want to equalise by dataset or sample_id then specifiy a column in obs
-# then the data will be subset to n cells **per** downsample_col value.
-downsample_col:
-# which modalities do we want to subsample
-# comma separated string e.g. rna,prot
-# if more than 1 modality is added then these will be intersected.
+# --------------------------
+downsample_n:
+downsample_col:
downsample_mods:
-# ----------------------
+
+# ------------------
# Plotting variables
-# ----------------------
-# all metrics should be inputted as a comma separated string without spaces e.g. a,b,c
+# ------------------
+# all metrics in this section should be provided as a comma separated string without spaces e.g. a,b,c
# leave blank to avoid plotting
plotqc:
- # use these categorical variables to plot/split by
grouping_var: sample_id
- # specify metrics in the metadata to plot:
rna_metrics: pct_counts_mt,pct_counts_rp,pct_counts_hb,pct_counts_ig,doublet_scores
prot_metrics: total_counts,log1p_total_counts,n_prot_by_counts,pct_counts_isotype
atac_metrics:
rep_metrics:
-# --------------------------------------------------------------------------------------------------------
-# RNA steps
-# --------------------------------------------------------------------------------------------------------
-# currently only standard sc.pp.normalize_total(adata, target_sum=1e4) followed by sc.pp.log1p(adata) is offered for RNA
+
+
+# -----------------------
+# RNA preprocessing steps
+# -----------------------
+# Currently, only standard preprocessing steps (sc.pp.normalize_total followed by sc.pp.log1p) is offered for the RNA modality.
log1p: True
-# hvg_flavour options include "seurat", "cell_ranger", "seurat_v3", default; "seurat"
-# for dispersion based methods "seurat" and "cell_ranger", you can specify parameters: min_mean, max_mean, min_disp
-# for "seurat_v3" a different method is used, and you need to specify how many variable genes to find in n_top_genes
-# If you specify n_top_genes, then the other paramters are nulled.
-# details: https://scanpy.readthedocs.io/en/stable/api/scanpy.pp.highly_variable_genes.html
hvg:
- flavor: seurat # "seurat", "cell_ranger", "seurat_v3"
- # If batch key is specified, highly-variable genes are selected within each batch separately and merged.
- # details: https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html#:~:text=or%20return%20them.-,batch_key,-%3A%20Optional%5B
- # If you want to use more than one obs column as a covariates, include it as covariate1,covariate2 (comma separated list)
- # Leave blank for no batch (default)
+ flavor: seurat # Options: seurat, cell_ranger, or seurat_v3
batch_key:
n_top_genes: 2000
min_mean:
max_mean:
min_disp:
- # It may be useful to exclude some genes from the HVG selection.
- # In the file resources/qc_genelist_1.0.csv, we include an example of genes that could be excluded when analysing immune cells,
- # Examine this file, it has a first column with gene ids and the second column identifying the groups to
- # which this genes belong.
- # This workflow will exclude the genes that you specify by their group name. when specifying "default", the workflows will
- # remove from hvg the genes that in the file are flagged "exclude". You can customize the gene list and change the name of the gene group in
- # the `exclude:` param accordingly.
+
exclude_file:
- exclude: # this is the variable that defines the genes to be excluded in the above file, leave empty if you're not excluding genes from HVG
- # Do you want to filter the object to retain only Highly Variable Genes?
+ exclude:
filter: False
-# Regression variables, what do you want to regress out, leave blank if nothing
-# We recommend not regressing out unless you have good reason to.
+
regress_variables:
-#----------------------------
+
+
+#---------
# Scaling
-#----------------------------
-# This scaling has the effect that all genes are weighted equally for downstream analysis.
-# discussion from Leucken et al Best Practices paper: https://doi.org/10.15252/msb.20188746
-# "There is currently no consensus on whether or not to perform normalization over genes.
-# While the popular Seurat tutorials (Butler et al, 2018) generally apply gene scaling,
-# the authors of the Slingshot method opt against scaling over genes in their tutorial (Street et al, 2018).
-# The preference between the two choices revolves around whether all genes should be weighted equally for downstream analysis,
-# or whether the magnitude of expression of a gene is an informative proxy for the importance of the gene."
run_scale: True
-# if blank defaults scale, clip values as per: https://scanpy.readthedocs.io/en/stable/api/scanpy.pp.scale.html
scale_max_value:
-#----------------------------
-# RNA Dimensionality reduction
-#----------------------------
+
+#-----------------------------
+# RNA Dimensionality Reduction
pca:
- # number of components to compute
n_pcs: 50
- # set "default" will use 'arpack'.
- # Otherwise specify a different solver, see https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.pca.html
- solver: default
- # color to use when plotting the PCA
+ solver: default
color_by: sample_id
-# --------------------------------------------------------------------------------------------------------
-# Protein (PROT) steps
-# --------------------------------------------------------------------------------------------------------
+
+# ----------------------------------
+# Protein (PROT) preprocessing steps
+# ----------------------------------
prot:
- # comma separated string of normalisation options
- # options: dsb,clr
- # more details in this vignette https://muon.readthedocs.io/en/latest/omics/citeseq.html
- # dsb https://muon.readthedocs.io/en/latest/api/generated/muon.prot.pp.dsb.html
- # clr https://muon.readthedocs.io/en/latest/api/generated/muon.prot.pp.clr.html
normalisation_methods: clr,dsb
- # the normalised matrices are stored in layers called 'clr' and 'dsb', along with a layer called 'raw_counts'
- # if you choose to run both then 'dsb' is stored in X as default.
- # In downstream visualisation, you can either specify the layer, or take the default.
# CLR parameters:
- # margin determines whether you normalise per cell (as you would RNA norm),
- # or by feature (recommended, due to the variable nature of prot assays).
- # CLR margin 0 is recommended for informative qc plots in this pipeline
- # 0 = normalise colwise (per feature)
- # 1 = normalise rowwise (per cell)
- clr_margin: 0
+ # 0 = normalise row-wise (per cell)
+ # 1 = normalise column-wise (per feature, recommended)
+ clr_margin: 1
# DSB parameters:
- # you must specify the path to the background h5mu created in pipeline ingest in order to run dsb.
background_obj:
- # quantile clipping,
- # even with normalisation, some cells get extreme outliers which can be clipped as discussed https://github.com/niaid/dsb
- # maximum value will be set at the value of the 99.5% quantile, applied per feature
- # note that this feature is in the default muon mu.pp.dsb code, but manually implemented in this code.
quantile_clipping: True
-
- # which normalisation method to be stored in the X slot. If you choose to run more than one normalisation method,
- # which one to you want to store in the X slot, if not specified 'dsb' is the default when run.
- store_as_X:
- # do you want to save the prot normalised assay additionally as a txt file:
+ store_as_X:
save_norm_prot_mtx: False
- #----------------------------
- # Prot Dimensionality reduction
- #----------------------------
- # it may be useful to run PCA on the protein assay, when you have more than 50 features.
- # Set to False by default
+
+ #---------------------------------
+ # Protein Dimensionality reduction
pca: False
- # number of components. Specify at least n_pcs <= number of features -1
n_pcs: 50
- # which solver to use, set "default" will use 'arpack'.
solver: default
- # column to be fetched from the protein layer .obs
color_by: sample_id
-# --------------------------------------------------------------------------------------------------------
-# ATAC steps
-# --------------------------------------------------------------------------------------------------------
+# ------------------------
+# ATAC preprocessing steps
+# ------------------------
atac:
binarize: False
- normalize: TFIDF # "log1p" or "TFIDF"
- # if normalize = "TFIDF", else leave blank:
- TFIDF_flavour: signac # "signac", "logTF" or "logIDF"
- # highly variable feature selection:
- # HVF selection either with scanpy's pp.highly_variable_genes() function or a pseudo-FindTopFeatures() function of the signac package
- feature_selection_flavour: signac # "signac" or "scanpy"
- # parameters for HVF flavour == "scanpy", leave the below blank to use defaults
- min_mean: #default 0.05
- max_mean: #default 1.5
- min_disp: #default 0.5
- # if n_top_features is specified, it overwrites previous defaults for HVF selection
- n_top_features:
- # Filter the atac layer to retain only HVF
+ normalize: TFIDF #"log1p" or "TFIDF"
+ TFIDF_flavour: signac #"signac", "logTF" or "logIDF"
+ feature_selection_flavour: signac #"signac" or "scanpy"
+
+ # parameters for feature_selection_flavour == "scanpy", leave blank to use defaults
+ min_mean: #default 0.05
+ max_mean: #default 1.5
+ min_disp: #default 0.5
+ n_top_features: #if specified, overwrites previous defaults for HVF selection
filter_by_hvf: False
- # parameter for HVF flavour == "signac"
+
+ # parameter for feature_selection_flavour == "signac"
min_cutoff: q5
- # min_cutoff can be specified as follows:
- # "q[x]": "q" followed by the minimum percentile, e.g. q5 will set the top 95% most common features as higly variable
- # "c[x]": "c" followed by a minimum cell count, e.g. c100 will set features present in > 100 cells as highly variable
- # "tc[x]": "tc" followed by a minimum total count, e.g. tc100 will set features with total counts > 100 as highly variable
- # "NULL": All features are assigned as highly variable
- # "NA": Highly variable features won't be changed
- #----------------------------
+
+ #------------------------------
# ATAC Dimensionality reduction
- #----------------------------
- dimred: LSI #PCA or LSI (LSI will only be computed if the normalize param is set to TFIDF)
- n_comps: 50 # how many components to compute
- # which dimension to exclude from further processing (sometimes useful to remove PC/LSI_1 if it's associated to tech factors)
- # leave blank to retain all
- # if using PCA, which solver to use. Default == 'arpack'
+ dimred: LSI #PCA or LSI
+ n_comps: 50
solver: default
- # what covariate to use to color the dimensionality reduction
color_by: sample_id
- # whether to remove the component(s) associated to technical effects, common to remove 1 for LSI
- # leave blank to avoid removing any
dim_remove:
-
-