Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use eval output for tool versions #1115

Draft
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

bentsherman
Copy link

This PR uses the experimental cmd output type in nextflow-io/nextflow#4493 to simplify the collection of tool versions.

Once the topic channel support is merged into Nextflow, we can merge this PR with #1109 to simplify things further. Instead of emitting versions1, versions2, etc for processes with multiple tools, we can simply send them all to the 'versions' topic.

PR checklist

  • This comment contains a description of changes (with reason).
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@bentsherman

This comment was marked as outdated.

@bentsherman

This comment was marked as outdated.

Copy link

github-actions bot commented Nov 15, 2023

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit e68f451

+| ✅ 144 tests passed       |+
#| ❔   6 tests were ignored |#
!| ❗   5 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: .github/workflows/awstest.yml
  • files_exist - File not found: .github/workflows/awsfulltest.yml
  • nextflow_config - Config manifest.version should end in dev: 3.13.0
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in WorkflowRnaseq.groovy: Optionally add in-text citation tools to this list.

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore or pyproject.toml
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/rnaseq/rnaseq/.github/workflows/awstest.yml
  • multiqc_config - multiqc_config

✅ Tests passed:

Run details

  • nf-core/tools version 2.10
  • Run at 2023-11-16 01:26:58

@bentsherman
Copy link
Author

@drpatelh @ewels now that Nextflow has channel topics, it occurred to me that we could actually simplify a lot by just using env outputs. See my comment here, but I will copy the example code to illustrate my point:

// current nf-core convention
process FOOBAR {
    output:
    path 'versions.yml', topic: versions

    """
    # ...

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        foo: \$(foo --version)
        bar: \$(bar --version)
    END_VERSIONS
    """
}

// env output
process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), env(FOO_VERSION), topic: versions
    tuple val("${task.process}"), val('bar'), env(BAR_VERSION), topic: versions

    """
    # ...

    FOO_VERSION=\$(foo --version)
    BAR_VERSION=\$(bar --version)
    """
}

// cmd output
process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), cmd('foo --version'), topic: versions
    tuple val("${task.process}"), val('bar'), cmd('bar --version'), topic: versions

    """
    # ...
    """
}

I would love to hear what you guys think about (2) vs (3). Keep in mind that in all three cases, the tool version commands are executed in the task script in more or less the same way.

@ewels
Copy link
Member

ewels commented Dec 11, 2023

My preference is for option 3 - the new cmd style. I like keeping the version commands out of the script, as it makes the script commands much cleaner and easier to read.

NB: The versions1/versions2 stuff in the PR code diff can be simplified after #1109 is merged. This new syntax is shown in Ben's comment.

@maxulysse
Copy link
Member

My preference goes to version2, which I find more explicit, easier to read, but I do love the version3 that removes completely the version generation from the script itself.

@edmundmiller
Copy link
Contributor

I like three, my only concern is some of the commands to get the version get pretty long. In theory, we could do something like:

def foo_version = 'foo --version'

output:
    tuple val("${task.process}"), val('foo'), cmd("${foo_version}"), topic: versions
    

@mirpedrol
Copy link
Member

I like option 3!
But I wonder about modules with R or python scripts, where we use those languages to create the versions.yml instead of bash. Will this work? or do we have to continue using the old syntax for these cases?

@pinin4fjords
Copy link
Member

pinin4fjords commented Dec 11, 2023

I would really like the inputs/ outputs section to remain as concise as possible, and I like the separation of concerns where the command to produce the output happens in more or less the same place. I'd do a bit of a WTF if people suddenly started embedding extensive process stuff where I expect the I/O.

So I have a fairly strong dislike for option 3), I think some fairly horrific stuff could happen there and make the processes hard to understand.

So option 2) for me please!

@maxulysse
Copy link
Member

Agreeing with @pinin4fjords there, version3 looks beautiful as long as all works well, when it starts to bug, it's a mess to debug.

@ewels
Copy link
Member

ewels commented Dec 11, 2023

@pinin4fjords - note that one of the limitations of cmd (which will be documented) is that it doesn't support newlines.

That will hopefully prevent people from doing anything too horrendous 😆

We could have an nf-core modules linting rule that checks the string length and fails if it's too long, suggesting that people use env in that particular case instead.

@pinin4fjords
Copy link
Member

@pinin4fjords - note that one of the limitations of cmd (which will be documented) is that it doesn't support newlines.

That will hopefully prevent people from doing anything too horrendous 😆

There's plenty of evil to be done with pipes!

@ewels
Copy link
Member

ewels commented Dec 11, 2023

But I wonder about modules with R or python scripts, where we use those languages to create the versions.yml instead of bash. Will this work? or do we have to continue using the old syntax for these cases?

@mirpedrol - No it won't work. Suggestion would be to use env in these cases as in option 2 (no need for the old syntax with the cat <<-END_VERSIONS stuff). But there are relatively few of these non-bash modules, none in the rnaseq for example I think.

@pinin4fjords
Copy link
Member

pinin4fjords commented Dec 11, 2023

rnaseq has a couple of R modules actually, they're just not obvious because they're local- and we will hopefully fix that at some point, and they will then need templates etc.

@bentsherman
Copy link
Author

Thank you all for your feedback. I still prefer env myself, but Paolo is determined now to add the cmd type, so we will have both and you can use whichever one you prefer.

My preference goes to version2, which I find more explicit, easier to read, but I do love the version3 that removes completely the version generation from the script itself.

Note that the cmd type is still executed in the task script just like an env, it just inserts the command for you

I like three, my only concern is some of the commands to get the version get pretty long.

@emiller88 I don't think you can reference local variables in an output as in your example, but you could reference a global variable, for example:

foo_version = 'really | long | version | command'

process foo {
  output:
  cmd("${foo_version}")
}

But I wonder about modules with R or python scripts, where we use those languages to create the versions.yml instead of bash. Will this work? or do we have to continue using the old syntax for these cases?

@mirpedrol In this PR I changed all the processes to only emit the metadata and then the YAML is constructed at the end of the pipeline. If you usually generate the tool version from within a Python or R script, the cmd output could do something like python script.py --version to retrieve the version from Bash. If the process script itself is not Bash, however, then the cmd output won't work. So whenever cmd isn't supported or would be unwieldy to use, you can always fallback to an env

note that one of the limitations of cmd (which will be documented) is that it doesn't support newlines

You could have a multi-line command by using semi-colons for newlines 😅

Regarding multi-line outputs, we found a way to support them for both env and cmd. So whereas currently env outputs are squashed to a single line, both will support multi-line output going forward.

@ewels
Copy link
Member

ewels commented Dec 12, 2023

So I think everyone agrees that options 2 + 3 are both improvements ✅

For any processes with script blocks written in languages other than bash, we will have to use the env approach. For bash commands I see now three options, which maybe we can vote on in the nf-core Slack:

Option 1: env

// env output
process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), env(FOO_VERSION), topic: versions
    tuple val("${task.process}"), val('bar'), env(BAR_VERSION), topic: versions

    """
    # ...

    FOO_VERSION=\$(foo --version)
    BAR_VERSION=\$(bar --version)
    """
}

Option 2: cmd

// cmd output
process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), cmd('foo --version'), topic: versions
    tuple val("${task.process}"), val('bar'), cmd('bar --version'), topic: versions

    """
    # ...
    """
}

Option 3: cmd + variable

// cmd + variable output
foo_version = 'foo --version'
bar_version = 'bar --version'
process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), cmd(foo_version), topic: versions
    tuple val("${task.process}"), val('bar'), cmd(bar_version), topic: versions

    """
    # ...
    """
}

@edmundmiller
Copy link
Contributor

I think the real thing this could open up is parsing the version string in groovy as another option

// cmd + variable output
foo_version = getVersionFromString('foo --version')
bar_version = 'bar --version'

process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), cmd(foo_version), topic: versions
}

in a lib far far away:

def getVersionFromString(String text) {
    def matcher = text =~ /v(\d+\.\d+\.\d+)/
    return matcher ? matcher[0][1] : null
}

Just a thought.

@bentsherman
Copy link
Author

bentsherman commented Dec 12, 2023

The thing is that the command must be executed in the task environment, because Nextflow might not have access to the tool from outside the task.

You could just emit the raw output of the tool version command, remove the duplicates, and then parse the string in Groovy:

process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), cmd('foo --version'), topic: versions
}

Channel.topic('versions') .map { process, tool, raw_version ->
    [ process, tool, getVersionFromString(tool, raw_version) ]
}

That comes down to whether you would rather parse the version with a Bash one-liner or Groovy code. Note that you have to write a custom parser for every tool, so putting it all in a lib far far away would break the modularity of your modules. Unless you have a way to "register" a parser from the module script.

@bentsherman bentsherman changed the title Use cmd output for tool versions Use eval output for tool versions Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants