Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustering yaml creayed #202

Merged
merged 8 commits into from
Apr 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/yaml_docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,5 @@ Workflows configuration files
spatial_qc
spatial_preprocess
spatial_deconvolution
pipeline_refmap_yml.md
pipeline_refmap_yml

66 changes: 41 additions & 25 deletions docs/yaml_docs/pipeline_clustering_yml.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,10 @@ In this documentation, the parameters of the `clustering` configuration yaml fil
This file is generated running `panpipes clustering config`. <br>
The individual steps run by the pipeline are described in [clustering workflow](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/clustering.html)

When running the clustering workflow, panpipes provides a basic `pipeline.yml` file.
The `clustering` workflow works with outputs generated by the `integration` workflow, and expects a `MuData` object with
`neighbors` saved in the `.uns` of the global layer to run clustering on the multimodal embedding. If `neighbors` are calculated on each modality layers, these can be reused or re-calculated on the flight.

When running the clustering workflow, panpipes provides a basic `pipeline.yml` file to customize with parameters.
To run the workflow on your own data, you need to specify the parameters described below in the `pipeline.yml` file to meet the requirements of your data.

However, we do provide pre-filled versions of the `pipeline.yml` file for individual [tutorials](https://panpipes-pipelines.readthedocs.io/en/latest/tutorials/index.html).
Expand Down Expand Up @@ -62,24 +65,30 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th
Specify the full object if your scaled_obj contains only HVG. If your scaled_obj contains all the genes then leave full_obj blank.
panpipes will use the full object to do marker genes analysis (rank_gene_groups) and for plotting those genes.
- <span class="parameter">modalities</span><br>
- <span class="parameter">rna</span> `Boolean`, Default: True<br>
Which modalities to run clustering on.
- <span class="parameter">rna</span> `Boolean`, Default: True<br> If set to `True`, the workflow will stop if it doesn't find a modality named 'rna'
- <span class="parameter">prot</span> `Boolean`, Default: True<br>
If set to `True`, the workflow will stop if it doesn't find a modality named 'prot'
- <span class="parameter">atac</span> `Boolean`, Default: False<br>
If set to `True`, the workflow will stop if it doesn't find a modality named 'atac'

- <span class="parameter">spatial</span> `Boolean`, Default: False<br>
Run clustering on each individual modality.
If set to `True`, the workflow will stop if it doesn't find a modality named 'spatial'


- <span class="parameter">multimodal</span><br>
- <span class="parameter">rna_clustering</span> `Boolean`, Default: True<br>
- <span class="parameter">integration_method</span> `String`, Default: WNN<br>
Options here include WNN, mofa, and totalVI, and it tells us where to look for.
- <span class="parameter">rna_clustering</span> `Boolean`, Default: False<br> If set to True, runs clustering on multimodal embedding
- <span class="parameter">integration_method</span> `String`, Default: None<br>
In case you have run WNN and want to run clustering on the wnn embedding, specify "WNN" here. The neigbhours are saved with a different `--neighbors_key` param only for wnn, for every other method (totalvi, multivi, mofa) leave this parameter blank.


## Parameters for finding neighbours

- <span class="parameter">neighbors:</span>
Sets the number of neighbors to use when calculating the graph for clustering and umap.
- <span class="parameter">rna:</span>

- <span class="parameter">use_existing </span> `Boolean`, Default: True<br>
- <span class="parameter">use_existing </span> `Boolean`, Default: True<br> Use existing neighbours in .uns calculated in the `integration` workflow. If `False`, it will recalculate using the following parameters
- <span class="parameter">dim_red </span> `String`, Default: X_pca<br>
Defines which representation in .obsm to use for nearest neighbors
- <span class="parameter">n_dim_red</span> `Integer`, Default: 30<br>
Expand All @@ -94,7 +103,7 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th

- <span class="parameter">prot:</span>

- <span class="parameter">use_existing </span> `Boolean`, Default: True<br>
- <span class="parameter">use_existing </span> `Boolean`, Default: True<br> Use existing neighbours in .uns calculated in the `integration` workflow. If `False`, it will recalculate using the following parameters
- <span class="parameter">dim_red </span> `String`, Default: X_pca<br>
Defines which representation in .obsm to use for nearest neighbors
- <span class="parameter">n_dim_red</span> `Integer`, Default: 30<br>
Expand All @@ -109,7 +118,7 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th

- <span class="parameter">atac:</span>

- <span class="parameter">use_existing </span> `Boolean`, Default: True<br>
- <span class="parameter">use_existing </span> `Boolean`, Default: True<br> Use existing neighbours in .uns calculated in the `integration` workflow. If `False`, it will recalculate using the following parameters
- <span class="parameter">dim_red </span> `String`, Default: X_lsi<br>
Defines which representation in .obsm to use for nearest neighbors
- <span class="parameter">n_dim_red</span> `Integer`, Default: 1<br>
Expand All @@ -125,7 +134,7 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th

- <span class="parameter">spatial:</span>

- <span class="parameter">use_existing </span> `Boolean`, Default: False<br>
- <span class="parameter">use_existing </span> `Boolean`, Default: False<br> Use existing neighbours in .uns calculated in the `integration` workflow. If `False`, it will recalculate using the following parameters
- <span class="parameter">dim_red </span> `String`, Default: X_pca<br>
Defines which representation in .obsm to use for nearest neighbors
- <span class="parameter">n_dim_red</span> `Integer`, Default: 30<br>
Expand All @@ -142,51 +151,51 @@ Prefix for the sample that comes out of the filtering/ preprocessing steps of th

- <span class="parameter">umap:</span>

- <span class="parameter">run </span> `Boolean`, Default: True<br>
- <span class="parameter">run </span> `Boolean`, Default: True<br> Set to `True` runs the umap calculation and plotting.
- <span class="parameter">rna:</span>
- <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
Can specify an array: 0.25,0.5
Can specify a single float or an array: 0.25,0.5
- <span class="parameter">prot:</span>
- <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
Can specify an array: 0.25,0.5,0.8
Can specify a single float or an array: 0.25,0.5,0.8
- <span class="parameter">atac:</span>
- <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
Can specify an array: 0.25,0.5,0.8
Can specify a single float or an array: 0.25,0.5,0.8
- <span class="parameter">multimodal:</span>
- <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
Can specify an array: 0.25,0.5,0.8
Can specify a single float or an array: 0.25,0.5,0.8
- <span class="parameter">rna:</span>
- <span class="parameter">mindist </span> `Float`, Default: 0.5<br>
Can specify an array: 0.25,0.5,0.8
Can specify a single float or an array: 0.25,0.5,0.8

## Parameters for clustering

- <span class="parameter">clusterspecs:</span>
- <span class="parameter">rna:</span>
- <span class="parameter">resolutions </span> `Float`, Default: 0.2, 0.6, 1<br>
Can specify an array: 0.2,0.6,1
Can specify a single float or an array: 0.2,0.6,1
- <span class="parameter">algorithm</span> `String`, Default: leiden<br>
Options include louvain or leiden.
- <span class="parameter">prot:</span>
- <span class="parameter">resolutions </span> `Float`, Default: 0.2, 0.6, 1<br>
Can specify an array: 0.2,0.6,1
Can specify a single float or an array: 0.2,0.6,1
- <span class="parameter">algorithm</span> `String`, Default: leiden<br>
Options include louvain or leiden.

- <span class="parameter">atac:</span>
- <span class="parameter">resolutions </span> `Float`, Default: 0.2, 0.6, 1<br>
Can specify an array to compute in parallel: 0.2,0.6,1
Can specify a single float or an array to compute in parallel: 0.2,0.6,1
- <span class="parameter">algorithm</span> `String`, Default: leiden<br>
Options include louvain or leiden.
- <span class="parameter">multimmodal:</span>
- <span class="parameter">resolutions </span> `Float`, Default: 0.5, 0.7<br>
Can specify an array to compute in parallel: 0.2,0.6,1
Can specify a single float or an array to compute in parallel: 0.2,0.6,1
- <span class="parameter">algorithm</span> `String`, Default: leiden<br>
Options include louvain or leiden.

- <span class="parameter">spatial:</span>
- <span class="parameter">resolutions </span> `Float`, Default: 0.2, 0.6, 1<br>
Can specify an array to compute in parallel: 0.2,0.6,1
Can specify a single float or an array to compute in parallel: 0.2,0.6,1
- <span class="parameter">algorithm</span> `String`, Default: leiden<br>
Options include louvain or leiden.

Expand All @@ -207,8 +216,10 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis
- <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
- <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations.
This parameter is mandatory if pseudo_seurat is set to True
- <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.
This parameter is mandatory if pseudo_seurat is set to True
- <span class="parameter">prot:</span><br>
- <span class="parameter">run </span> `Boolean`, Default: True<br>
Expand All @@ -219,8 +230,10 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
- <span class="parameter">method </span> `String`, Default: wilcoxon<br>
- <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
- <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations.
This parameter is mandatory if pseudo_seurat is set to True
- <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.
This parameter is mandatory if pseudo_seurat is set to True

- <span class="parameter">atac:</span><br>
Expand All @@ -234,8 +247,10 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’
- <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
- <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations.
This parameter is mandatory if pseudo_seurat is set to True
- <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.
This parameter is mandatory if pseudo_seurat is set to True


Expand All @@ -246,9 +261,9 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’
- <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
- <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
This parameter is mandatory if pseudo_seurat is set to True
Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True
- <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
This parameter is mandatory if pseudo_seurat is set to True
Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.This parameter is mandatory if pseudo_seurat is set to True


- <span class="parameter">spatial:</span><br>
Expand All @@ -261,11 +276,12 @@ When pseudo_seurat is set to True then a [python implementation](https://github.
Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis
- <span class="parameter">pseudo_seurat </span> `Boolean`, Default: False<br>
- <span class="parameter">minpct </span> `Float`, Default: 0.1<br>
This parameter is mandatory if pseudo_seurat is set to True
Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True
- <span class="parameter">threshuse </span> `Float`, Default: 0.25<br>
Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.
This parameter is mandatory if pseudo_seurat is set to True
## Plot specifications
Used to define which metadata columns are used in the visualizations
Define which layers are used in the markers visualization
- <span class="parameter">plotspecs:</span><br>
- <span class="parameter">layers: </span><br>
- <span class="parameter">rna </span> `String`, Default: logged_counts<br>
Expand Down
3 changes: 2 additions & 1 deletion panpipes/panpipes/pipeline_clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,10 @@ def set_up_dirs(log_file):
## Single modality scripts
## ------------------------------------

# -----------------------------------=
# --------------------------------------
# neighbors
# --------------------------------------
# TO DO create task to re-run neighbours on multimodal outer representations (this script can only read in each mod layer)
@follows(set_up_dirs)
@originate(PARAMS['mudata_with_knn'])
def run_neighbors(outfile):
Expand Down
7 changes: 6 additions & 1 deletion panpipes/panpipes/pipeline_clustering/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ modalities:
atac: False
spatial: False

# if True, will look for WNN, or totalVI output
# if True, will look for WNN, mofa, multivi, totalVI embeddings
multimodal:
run_clustering: True
integration_method:
Expand All @@ -40,22 +40,26 @@ multimodal:
# ---------------------------------------
#
# -----------------------------

neighbors:
rna:
#use the knn calculated in the integration workflow. If False it will recalculate
use_existing: True
dim_red: X_pca
n_dim_red: 30
k: 30
metric: euclidean
method: scanpy
prot:
#use the knn calculated in the integration workflow. If False it will recalculate
use_existing: True
dim_red: X_pca
n_dim_red: 30
k: 30
metric: euclidean
method: scanpy
atac:
#use the knn calculated in the integration workflow. If False it will recalculate
use_existing: True
dim_red: X_lsi
dim_remove: 1
Expand All @@ -64,6 +68,7 @@ neighbors:
metric: euclidean
method: scanpy
spatial:
#use the knn calculated in the integration workflow. If False it will recalculate
use_existing: False
dim_red: X_pca
n_dim_red: 30
Expand Down
2 changes: 1 addition & 1 deletion panpipes/python_scripts/run_umap.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
default=0.1,
help="no. neighbours parameters for sc.pp.neighbors()")
parser.add_argument("--neighbors_key",
default="neighbors", help="algortihm choice from louvain and leiden")
default="neighbors", help="name of the saved knn neighbors")

args, opt = parser.parse_known_args()
L.info(args)
Expand Down