Make supplying hundreds of input files easier by standardizing on PEP #510

rhpvorderman · 2022-08-15T12:27:39Z

rhpvorderman
Aug 15, 2022
Collaborator

PEP, or Portable Encapsulated Projects, is a community effort to make sample metadata reusable.

A problem we do have specifically as bioinformaticians in the LUMC is handling sample data. For each sample we have (usually) two FASTQ files, a sample ID, library ID (for the FASTQ files), readgroup ID (for the FASTQ files) and some other metadata. Since this is WDL we can use structs to keep this information together. This means that an individual sample will look like this in json:

{
  "my_pipeline.samples": [
    {
      "id": "mysample",
      "readgroups": [
        {
           "id": "rg1",
           "lib_id": "lib1",
           "R1": "mysample-lib1-rg1-R1.fastq.gz",
           "R2": "mysample-lib1-rg1-R2.fastq.gz",
        }
      ]
    }
  ]
}

This makes adding multiple samples very tedious. But, hey, I can create a script to parse this directory and... Is this really how we want to do things in WDL? So in BioWDL we use a custom CSV format: https://biowdl.github.io/germline-DNA/develop/index.html#sample-configuration. But this has limitations, not to mention it being specific to BioWDL only.

PEP is a standardized data format. It can be used across workflow systems. Snakemake and CWL already support it. It uses a YAML metadata file plus a CSV sample table. In the newest specification they even only for just using the CSV sample table. This simplifies providing inputs for WDL as well as adhering to a standard. Since there is already parsers for PEP the community effort makes this easier rather than trying to solve this issue by modifying our own input format.

I think PEP support can be quite easily integrated in all of the workflow runners individually, but it would be good to also codify this in the spec to ensure interoperability between execution engines.

patmagee · 2022-08-26T19:50:53Z

patmagee
Aug 26, 2022
Maintainer

@rhpvorderman can you give an example of how we could incorperate PEP or what a WDL supporting PEP would look like?

0 replies

rhpvorderman · 2022-08-30T07:00:42Z

rhpvorderman
Aug 30, 2022
Collaborator Author

The current specification states that the input must be given as JSON. It would be useful if engines also supported PEP too. This can be a feature for individual engines instead of a spec change if desired.

A WDL implementation may choose to support any additional input and output mechanisms so long as they are documented, and or tools are provided to interconvert between engine-specific input and the standard JSON format, to foster interoperability between tools in the WDL ecosystem.

So the spec leaves room for, say, miniwdl to implement PEP. But it will lead to some fragmentation in the WDL ecosystem if only miniwdl supports it. So that is why I put it forward here.
In practice this will mostly be used by researchers in the biosciences. Which is what we use WDL for as well. If this is true for the majority of WDL users it might be useful to make it part of the spec.

Alternatively, I can make a few PRs to miniwdl to support PEP, and we can all see how it works out.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make supplying hundreds of input files easier by standardizing on PEP #510

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Make supplying hundreds of input files easier by standardizing on PEP #510

rhpvorderman Aug 15, 2022 Collaborator

Replies: 2 comments

patmagee Aug 26, 2022 Maintainer

rhpvorderman Aug 30, 2022 Collaborator Author

rhpvorderman
Aug 15, 2022
Collaborator

patmagee
Aug 26, 2022
Maintainer

rhpvorderman
Aug 30, 2022
Collaborator Author