Make supplying hundreds of input files easier by standardizing on PEP #510
Replies: 2 comments
-
@rhpvorderman can you give an example of how we could incorperate PEP or what a WDL supporting PEP would look like? |
Beta Was this translation helpful? Give feedback.
-
The current specification states that the input must be given as JSON. It would be useful if engines also supported PEP too. This can be a feature for individual engines instead of a spec change if desired.
So the spec leaves room for, say, miniwdl to implement PEP. But it will lead to some fragmentation in the WDL ecosystem if only miniwdl supports it. So that is why I put it forward here. Alternatively, I can make a few PRs to miniwdl to support PEP, and we can all see how it works out. |
Beta Was this translation helpful? Give feedback.
-
A problem we do have specifically as bioinformaticians in the LUMC is handling sample data. For each sample we have (usually) two FASTQ files, a sample ID, library ID (for the FASTQ files), readgroup ID (for the FASTQ files) and some other metadata. Since this is WDL we can use structs to keep this information together. This means that an individual sample will look like this in json:
This makes adding multiple samples very tedious. But, hey, I can create a script to parse this directory and... Is this really how we want to do things in WDL? So in BioWDL we use a custom CSV format: https://biowdl.github.io/germline-DNA/develop/index.html#sample-configuration. But this has limitations, not to mention it being specific to BioWDL only.
PEP is a standardized data format. It can be used across workflow systems. Snakemake and CWL already support it. It uses a YAML metadata file plus a CSV sample table. In the newest specification they even only for just using the CSV sample table. This simplifies providing inputs for WDL as well as adhering to a standard. Since there is already parsers for PEP the community effort makes this easier rather than trying to solve this issue by modifying our own input format.
I think PEP support can be quite easily integrated in all of the workflow runners individually, but it would be good to also codify this in the spec to ensure interoperability between execution engines.
Beta Was this translation helpful? Give feedback.
All reactions