OpenEBench based Nextflow workflow for assessment of a bioinformatics tool's performance in quantifying poly(A) site usage from RNA-seq data
This README describes the APAeval (absolute) quantification benchmarking workflow. For a more general introduction to benchmarking workflows see the main benchmarking workflow README.md
. For the specification of metrics, in- and output file formats, see the quantification benchmarks specification.
See description in the main benchmarking workflow README.md
.
-
input_file
: output file from method workflow in bed6 format -
Validation checks performed in
quantification_dockers/q_validation/validation.py
:- input file has to be tab separated file with 6 columns
- start and end coordinates (col 2,3) have to be int64
- strand (col 6) has to be one of [+,-]
- chromosome (col 1) has to match the ones from the genome annotation (see below
genome_dir
) - genome file is checked for valid chromosome naming
-
The
validated_[participant].[challenge].[event].json
file is used in the consolidation step, but not in the compute metrics one. However, the workflow exits after the validation step if one or more of the input files don't comply to the specifications of the current benchmarking event
-
"input file" and "gold standard file" will be compared in order to calculate the metrics
-
input_file
: output file from method workflow in bed6 format -
gold standard
: bed6 file derived from 3'end sequencing on the same sample(s) as the RNA-seq data used in the challenge -
windows
parameter is used to compute metrics for a list of window sizes.- For running on OEB: the parameter is read from
nextflow.config
.
- For running on OEB: the parameter is read from
-
genome_dir
: Directory to genome annotation in gtf format with 9 fields as specified here. The gtf is used for the relative PAS usage metric computation.- For running on OEB: The genome directory is specified in
nextflow.config
- For the test data, challenge
challenge_1.mm10
with ground truth filechallenge_1.mm10.bed
will use genome filegencode.test.mm10.gtf
, because both containmm10
within two dots in the filename.
- For running on OEB: The genome directory is specified in
NOTE: the genome file needs to contain the same substring as the challenge. That is, challenge
[partone].[organism].[partwo].bed
requires a genome annotation file like[partone].[organism].[partwo].gtf
, where[organism]
starts with mm or hg (only these two currently supported). And[partone]
and[parttwo]
can be an aribitrary string (or empty string).
tpm_threshold
: Expression filter for predictions. PolyA sites with smaller or equal transcripts per million (tpm) will be removed before metric compuatation.- APAeval custom functions called in
quantification_dockers/q_metrics/compute_metrics.py
are defined inutils/apaeval
- The
assessments_[participant].[challenge].[event].json
file is used in the consolidation step
- Gathers all
validated_[participant].[challenge].[event].json
files from the validation step, allassessments_[participant].[challenge].[event].json
files from the metrics computation step, and - if available - existing aggregation data (currently imported from thedata/
directory; innextflow.config
:aggregation_dir
) - Outputs OEB compatible
consolidated_result.json
file for the tested participant - "aggregation" objects in the
consolidated_result.json
determine which metrics are to be plotted against each other on the OEB website - In order to specify which of the metrics present in the assessment objects should be plotted on OEB, the file
quantification_dockers/q_consolidation/aggregation_template.json
has to be modified.
Please check out the sections on building docker images and running the benchmarking workflow in the main APAeval benchmarking workflow README