Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dataset loaders #909

Merged
merged 14 commits into from
Oct 18, 2024
3 changes: 2 additions & 1 deletion _viash.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ viash_version: 0.9.0
description: |
Open Problems is a living, extensible, community-guided benchmarking platform.
license: MIT
keywords: [openproblems, benchmarking, single-cell]
keywords: [openproblems, benchmarking, single-cell omics]

references:
doi:
Expand All @@ -24,6 +24,7 @@ config_mods: |
.runners[.type == "nextflow"].config.labels := { lowmem : "memory = 20.Gb", midmem : "memory = 50.Gb", highmem : "memory = 100.Gb", lowcpu : "cpus = 5", midcpu : "cpus = 15", highcpu : "cpus = 30", lowtime : "time = 1.h", midtime : "time = 4.h", hightime : "time = 8.h", veryhightime : "time = 24.h" }
.runners[.type == "nextflow"].config.script := "process.errorStrategy = 'ignore'"


info:
test_resources:
- type: s3
Expand Down
31 changes: 15 additions & 16 deletions src/datasets/api/comp_dataset_loader.yaml
Original file line number Diff line number Diff line change
@@ -1,16 +1,15 @@
functionality:
namespace: "datasets/loaders"
info:
type: dataset_loader
type_info:
label: Dataset loader
summary: A component which generates a "Common dataset".
description: |
A dataset loader will typically have an identifier (e.g. a GEO identifier)
or URL as input argument and additional arguments to define where the script needs to download a dataset from and how to process it.
arguments:
- name: "--output"
__merge__: file_raw.yaml
direction: "output"
required: true
test_resources: []
# namespace: "datasets/loaders"
info:
type: dataset_loader
type_info:
label: Dataset loader
summary: A component which generates a "Common dataset".
description: |
A dataset loader will typically have an identifier (e.g. a GEO identifier)
or URL as input argument and additional arguments to define where the script needs to download a dataset from and how to process it.
arguments:
- name: "--output"
__merge__: file_raw.yaml
direction: "output"
required: true
test_resources: []
71 changes: 35 additions & 36 deletions src/datasets/api/comp_normalization.yaml
Original file line number Diff line number Diff line change
@@ -1,36 +1,35 @@
functionality:
namespace: "datasets/normalization"
info:
type: dataset_normalization
type_info:
label: Dataset normalization
summary: |
A normalization method which processes the raw counts into a normalized dataset.
description:
A component for normalizing the raw counts as output by dataset loaders into a normalized dataset.
arguments:
- name: "--input"
__merge__: file_raw.yaml
direction: input
required: true
- name: "--output"
__merge__: file_normalized.yaml
direction: output
required: true
- name: "--normalization_id"
type: string
description: "The normalization id to store in the dataset metadata. If not specified, the functionality name will be used."
required: false
- name: "--layer_output"
type: string
default: "normalized"
description: The name of the layer in which to store the normalized data.
- name: "--obs_size_factors"
type: string
default: "size_factors"
description: In which .obs slot to store the size factors (if any).
test_resources:
- path: /resources_test/common/pancreas
dest: resources_test/common/pancreas
- type: python_script
path: /src/common/comp_tests/run_and_check_adata.py
namespace: "datasets/normalization"
info:
type: dataset_normalization
type_info:
label: Dataset normalization
summary: |
A normalization method which processes the raw counts into a normalized dataset.
description:
A component for normalizing the raw counts as output by dataset loaders into a normalized dataset.
arguments:
- name: "--input"
__merge__: file_raw.yaml
direction: input
required: true
- name: "--output"
__merge__: file_normalized.yaml
direction: output
required: true
- name: "--normalization_id"
type: string
description: "The normalization id to store in the dataset metadata. If not specified, the functionality name will be used."
required: false
- name: "--layer_output"
type: string
default: "normalized"
description: The name of the layer in which to store the normalized data.
- name: "--obs_size_factors"
type: string
default: "size_factors"
description: In which .obs slot to store the size factors (if any).
test_resources:
- path: /resources_test/common/pancreas
dest: resources_test/common/pancreas
- type: python_script
path: /common/component_tests/run_and_check_output.py
79 changes: 39 additions & 40 deletions src/datasets/api/comp_processor_hvg.yaml
Original file line number Diff line number Diff line change
@@ -1,40 +1,39 @@
functionality:
namespace: "datasets/processors"
info:
type: dataset_processor
type_info:
label: HVG
summary: |
Computes the highly variable genes scores.
description: |
The resulting AnnData will contain both a boolean 'hvg' column in 'var', as well as a numerical 'hvg_score' in 'var'.
arguments:
- name: "--input"
__merge__: file_normalized.yaml
required: true
direction: input
- name: "--input_layer"
type: string
default: "normalized"
description: Which layer to use as input.
- name: "--output"
direction: output
__merge__: file_hvg.yaml
required: true
- name: "--var_hvg"
type: string
default: "hvg"
description: "In which .var slot to store whether a feature is considered to be hvg."
- name: "--var_hvg_score"
type: string
default: "hvg_score"
description: "In which .var slot to store the gene variance score (normalized dispersion)."
- name: "--num_features"
type: integer
default: 1000
description: "The number of HVG to select"
test_resources:
- path: /resources_test/common/pancreas
dest: resources_test/common/pancreas
- type: python_script
path: /src/common/comp_tests/run_and_check_adata.py
namespace: "datasets/processors"
info:
type: dataset_processor
type_info:
label: HVG
summary: |
Computes the highly variable genes scores.
description: |
The resulting AnnData will contain both a boolean 'hvg' column in 'var', as well as a numerical 'hvg_score' in 'var'.
arguments:
- name: "--input"
__merge__: file_normalized.yaml
required: true
direction: input
- name: "--input_layer"
type: string
default: "normalized"
description: Which layer to use as input.
- name: "--output"
direction: output
__merge__: file_hvg.yaml
required: true
- name: "--var_hvg"
type: string
default: "hvg"
description: "In which .var slot to store whether a feature is considered to be hvg."
- name: "--var_hvg_score"
type: string
default: "hvg_score"
description: "In which .var slot to store the gene variance score (normalized dispersion)."
- name: "--num_features"
type: integer
default: 1000
description: "The number of HVG to select"
test_resources:
- path: /resources_test/common/pancreas
dest: resources_test/common/pancreas
- type: python_script
path: /common/component_tests/run_and_check_output.py
77 changes: 38 additions & 39 deletions src/datasets/api/comp_processor_knn.yaml
Original file line number Diff line number Diff line change
@@ -1,39 +1,38 @@
functionality:
namespace: "datasets/processors"
info:
type: dataset_processor
type_info:
label: KNN
summary: |
Computes the k-nearest-neighbours for each cell.
description: |
The resulting AnnData will contain both the knn distances and the knn connectivities in 'obsp'.
arguments:
- name: "--input"
__merge__: file_pca.yaml
required: true
direction: input
- name: "--input_layer"
type: string
default: "normalized"
description: Which layer to use as input.
- name: "--output"
direction: output
__merge__: file_knn.yaml
required: true
- name: "--key_added"
type: string
default: "knn"
description: |
The neighbors data is added to `.uns[key_added]`,
distances are stored in `.obsp[key_added+'_distances']` and
connectivities in `.obsp[key_added+'_connectivities']`.
- name: "--num_neighbors"
type: integer
default: 15
description: "The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation."
test_resources:
- path: /resources_test/common/pancreas
dest: resources_test/common/pancreas
- type: python_script
path: /src/common/comp_tests/run_and_check_adata.py
namespace: "datasets/processors"
info:
type: dataset_processor
type_info:
label: KNN
summary: |
Computes the k-nearest-neighbours for each cell.
description: |
The resulting AnnData will contain both the knn distances and the knn connectivities in 'obsp'.
arguments:
- name: "--input"
__merge__: file_pca.yaml
required: true
direction: input
- name: "--input_layer"
type: string
default: "normalized"
description: Which layer to use as input.
- name: "--output"
direction: output
__merge__: file_knn.yaml
required: true
- name: "--key_added"
type: string
default: "knn"
description: |
The neighbors data is added to `.uns[key_added]`,
distances are stored in `.obsp[key_added+'_distances']` and
connectivities in `.obsp[key_added+'_connectivities']`.
- name: "--num_neighbors"
type: integer
default: 15
description: "The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation."
test_resources:
- path: /resources_test/common/pancreas
dest: resources_test/common/pancreas
- type: python_script
path: /common/component_tests/run_and_check_output.py
95 changes: 47 additions & 48 deletions src/datasets/api/comp_processor_pca.yaml
Original file line number Diff line number Diff line change
@@ -1,49 +1,48 @@
functionality:
namespace: "datasets/processors"
info:
type: dataset_processor
type_info:
label: PCA
summary: |
Computes a PCA embedding of the normalized data.
description:
The resulting AnnData will contain an embedding in obsm, as well as optional loadings in 'varm'.
arguments:
- name: "--input"
__merge__: file_hvg.yaml
required: true
direction: input
- name: "--input_layer"
type: string
default: "normalized"
description: Which layer to use as input.
- name: "--input_var_features"
type: string
description: Column name in .var matrix that will be used to select which genes to run the PCA on.
default: hvg
- name: "--output"
direction: output
__merge__: file_pca.yaml
required: true
- name: "--obsm_embedding"
type: string
default: "X_pca"
description: "In which .obsm slot to store the resulting embedding."
- name: "--varm_loadings"
type: string
default: "pca_loadings"
description: "In which .varm slot to store the resulting loadings matrix."
- name: "--uns_variance"
type: string
default: "pca_variance"
description: "In which .uns slot to store the resulting variance objects."
- name: "--num_components"
type: integer
example: 25
description: Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.
test_resources:
- path: /resources_test/common/pancreas
dest: resources_test/common/pancreas
- type: python_script
path: /src/common/comp_tests/run_and_check_adata.py
namespace: "datasets/processors"
info:
type: dataset_processor
type_info:
label: PCA
summary: |
Computes a PCA embedding of the normalized data.
description:
The resulting AnnData will contain an embedding in obsm, as well as optional loadings in 'varm'.
arguments:
- name: "--input"
__merge__: file_hvg.yaml
required: true
direction: input
- name: "--input_layer"
type: string
default: "normalized"
description: Which layer to use as input.
- name: "--input_var_features"
type: string
description: Column name in .var matrix that will be used to select which genes to run the PCA on.
default: hvg
- name: "--output"
direction: output
__merge__: file_pca.yaml
required: true
- name: "--obsm_embedding"
type: string
default: "X_pca"
description: "In which .obsm slot to store the resulting embedding."
- name: "--varm_loadings"
type: string
default: "pca_loadings"
description: "In which .varm slot to store the resulting loadings matrix."
- name: "--uns_variance"
type: string
default: "pca_variance"
description: "In which .uns slot to store the resulting variance objects."
- name: "--num_components"
type: integer
example: 25
description: Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.
test_resources:
- path: /resources_test/common/pancreas
dest: resources_test/common/pancreas
- type: python_script
path: /common/component_tests/run_and_check_output.py

Loading