Skip to content

Latest commit

 

History

History
63 lines (52 loc) · 4.74 KB

File metadata and controls

63 lines (52 loc) · 4.74 KB

APAeval Quantification

OpenEBench based Nextflow workflow for assessment of a bioinformatics tool's performance in quantifying poly(A) site usage from RNA-seq data


This README describes the APAeval (absolute) quantification benchmarking workflow. For a more general introduction to benchmarking workflows see the main benchmarking workflow README.md. For the specification of metrics, in- and output file formats, see the quantification benchmarks specification.

(File) naming requirements

See description in the main benchmarking workflow README.md.

Description of steps

1. Validation

  • input_file: output file from method workflow in bed6 format

  • Validation checks performed in quantification_dockers/q_validation/validation.py:

    • input file has to be tab separated file with 6 columns
    • start and end coordinates (col 2,3) have to be int64
    • strand (col 6) has to be one of [+,-]
    • chromosome (col 1) has to match the ones from the genome annotation (see below genome_dir)
    • genome file is checked for valid chromosome naming
  • The validated_[participant].[challenge].[event].json file is used in the consolidation step, but not in the compute metrics one. However, the workflow exits after the validation step if one or more of the input files don't comply to the specifications of the current benchmarking event

2. Metrics Computation

  • "input file" and "gold standard file" will be compared in order to calculate the metrics

  • input_file: output file from method workflow in bed6 format

  • gold standard: bed6 file derived from 3'end sequencing on the same sample(s) as the RNA-seq data used in the challenge

  • windows parameter is used to compute metrics for a list of window sizes.

    • For running on OEB: the parameter is read from nextflow.config.
  • genome_dir: Directory to genome annotation in gtf format with 9 fields as specified here. The gtf is used for the relative PAS usage metric computation.

    • For running on OEB: The genome directory is specified in nextflow.config
    • For the test data, challenge challenge_1.mm10 with ground truth file challenge_1.mm10.bed will use genome file gencode.test.mm10.gtf, because both contain mm10 within two dots in the filename.

NOTE: the genome file needs to contain the same substring as the challenge. That is, challenge [partone].[organism].[partwo].bed requires a genome annotation file like [partone].[organism].[partwo].gtf, where [organism] starts with mm or hg (only these two currently supported). And [partone] and [parttwo] can be an aribitrary string (or empty string).

  • tpm_threshold: Expression filter for predictions. PolyA sites with smaller or equal transcripts per million (tpm) will be removed before metric compuatation.
  • APAeval custom functions called in quantification_dockers/q_metrics/compute_metrics.py are defined in utils/apaeval
  • The assessments_[participant].[challenge].[event].json file is used in the consolidation step

3. Results Consolidation

  • Gathers all validated_[participant].[challenge].[event].json files from the validation step, all assessments_[participant].[challenge].[event].json files from the metrics computation step, and - if available - existing aggregation data (currently imported from the data/ directory; in nextflow.config: aggregation_dir)
  • Outputs OEB compatible consolidated_result.json file for the tested participant
  • "aggregation" objects in the consolidated_result.json determine which metrics are to be plotted against each other on the OEB website
  • In order to specify which of the metrics present in the assessment objects should be plotted on OEB, the file quantification_dockers/q_consolidation/aggregation_template.json has to be modified.

Usage

Please check out the sections on building docker images and running the benchmarking workflow in the main APAeval benchmarking workflow README