Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoints for 'resuming' a pipeline at a fixed point without cacheing #5589

Open
adamrtalbot opened this issue Dec 9, 2024 · 5 comments
Open

Comments

@adamrtalbot
Copy link
Collaborator

New feature

One of the earliest features of Nextflow is a resume feature, allowing users to relaunch pipelines with the same or similar settings and skipping completed work. This feature works very well at using cached results while also ensuring if a pipeline has changed or is deterministic it is re-ran properly.

However, there are many scenarios where resume does not work. Typically, when launching these pipelines it becomes apparent the user expects resume to work differently. A lot of the time, people expect -resume to allow them to skip chunks of their pipeline, saving them costs and time, but they are disappointed when it turns out their updated input files or non-deterministic channels have caused the entire pipeline to restart.

Rather than overloading the resume technique, we could imagine a system where a user could code in specific checkpoints into their pipeline for achieving this. The user could configure known workflow sections and re-start their pipeline from the checkpoint, skipping the previous step with few (if any) checks at all.

Usage scenario

I'm going to imagine a checkpoint operator, which serialises the state of a channel and can use it as input for subsequent runs:

workflow {
    sw1_out = subworkflow_1(inputs)
    sw1_out.checkpoint("checkpoint1")

    sw2_out = subworkflow_2(sw1_out)
    sw2_out.checkpoint("checkpoint2")

    sw3_out = subworkflow_3(sw2_out)
    // outputs
}

This is what the pipeline looks like:

flowchart TD
    A[input] --> C(subworkflow 1)
    subgraph workflow
    C --> D(subworkflow 2)
    D --> E(subworkflow 3)
    end
    checkpoint1 -.-> D
    D -.-> checkpoint2([checkpoint 2])
    C -.-> checkpoint1([checkpoint 1])
    checkpoint2 -.-> E
    E --> outputs[outputs]
Loading

The first time you run the pipeline would be normal:

nextflow run main.nf

The second time, you could re-start from checkpoint 1:

nextflow run main.nf -checkpoint checkpoint1

This would bypass the first subworkflow.

etc...

nextflow run main.nf -checkpoint checkpoint2

The main challenge would be how to save the global state of the pipeline, including all the auxiliary files etc. If the pipeline has dramatically diverged, it might be too difficult to resume. We would need something to catch that scenarioo

Under the hood, I'm imagining Nextflow saves the state of the pipeline as something serialisable in the .nextflow directory, which Nextflow could find on pipeline run 2 onwards. If it was missing it would throw an error, compared to resume which will restart each process.

The main benefit of this is that it will clearly delineate the purpose and structure of resume and checkpoint into two strategies for reducing runtime and costs of pipelines, but with different purposes. Resume is for restarting your pipeline exactly as is, while checkpointing is for restarting your pipeline at fixed locations and ignoring how you go there.

Alternative implementations

Snakemake has one implementation of this which is similar to the above option: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#data-dependent-conditional-execution

The operator suggestion above is quite limited because it's a feature of the pipeline flow. An alternative could be something global which captures the state of the pipeline and allows you to jump back in without checking for consistency, but this feels more like the existing resume function.

Of course, many pipelines do this already. Probably the best known implementation is Sarek, which has inputs for FASTQ, BAM, and VCF allowing you to jump into the pipeline at multiple points. An alternative strategy could be to make this easier for users, so that it's easier to code inputs and outputs within your pipeline.

@bentsherman
Copy link
Member

In many cases it should be possible to update your input files without breaking your cache. For example, you should be able to add/remove rows to a samplesheet and resume the cached rows while computing the new rows. Assuming that you are executing a task for each row in the samplesheet. The tasks should only be re-executed if you modify a row.

This is because the glue logic between processes is always executed, whereas the process executions are cached.

Non-deterministic inputs are generally discouraged because they are inherently not reproducible. Most of the time when this happens, it is on accident or due to some bug in Nextflow. I'm not sure what the legitimate use case would be?

I can imagine a machine learning pipeline with a random initialization... but unless you are explicitly generating a random seed in the workflow logic, the resume should work. The task itself may generate a random seed, but if this is internal to the task, Nextflow doesn't see it as a task input. So a resumed run would simply cache the previous execution even if re-executing would have produced a different result.

@adamrtalbot
Copy link
Collaborator Author

Sure, yet we still see many reports of users claiming, “Resume doesn’t work!” I believe this largely comes from a fundamental misunderstanding of what Nextflow’s resume feature is. Many users seem to think of it as a “jump forward” tool, enabling them to skip ahead in pipelines to save costs and time. In my view, however, it’s more of a caching system that allows you to re-run an existing pipeline almost exactly as it was.

My proposal is to introduce an explicit jump-forward tool—one that lets users pick up a pipeline partway through, even if the cache has been invalidated. For example, imagine you’ve just spent thousands of dollars on a pipeline run, only for it to crash near the end. Instead of starting over and hoping resume works, this tool would allow you to restart from, say, 80% completion.

Under the hood, it could leverage much of the existing caching mechanism. However, from a user perspective, it would offer a distinct alternative to resume.

The main downside to this approach is that it would compromise reproducibility.

@bentsherman
Copy link
Member

When it comes to resuming the same run, I think users should expect it to just work, that is to resume as much as it can depending on what they changed in the code/inputs. Most of the time when resume doesn't work as expected, I think it's due to either a bug in Nextflow or the user doing something non-deterministic in their pipeline.

And to be fair, I think Nextflow makes it too easy in some cases to do non-deterministic things without realizing it. For example many operators -- buffer, collate, distinct, take, the list goes on -- can break resume if used improperly.

So I think we have work to do both in terms of fixing bugs and tightening the standard library to make it very hard to write non-deterministic code. And educating users on best practices instead of just giving them a giant bag of operators to play with. I really believe we just need to make resume work as well as it promises to.

As for an explicit jump-forward mechanism, I would recommend that piipeline developers try to implement this explicitly in their pipeline. You can use params and workflow logic to do things like "if you give me the inputs for step 1 then I'll start from the beginning, if you give me the inputs for step 2 I'll start from step 2..." and so on. It might be messy but it would give us a clearer picture of what it takes to implement that kind of checkpointing, and consider whether we could ever automate it under the hood.

@adamrtalbot
Copy link
Collaborator Author

Most of the time when resume doesn't work as expected, I think it's due to either a bug in Nextflow or the user doing something non-deterministic in their pipeline.

Or infrastructure failures - we often see giant pipelines with many thousands of samples fail. Often it's just hitting the limits of infrastructure and causing the cache to become invalidated, sadly this is a double whammy of a giant run failing alongside Nextflow not being able to resume it.

As for an explicit jump-forward mechanism, I would recommend that piipeline developers try to implement this explicitly in their pipeline. You can use params and workflow logic to do things like "if you give me the inputs for step 1 then I'll start from the beginning, if you give me the inputs for step 2 I'll start from step 2..." and so on. It might be messy but it would give us a clearer picture of what it takes to implement that kind of checkpointing, and consider whether we could ever automate it under the hood.

I agree, I always try to do this in my pipelines now. Sarek is the best example of this, you can basically jump in 80% of the way through. But it's overhead and annoying. In some ways, my request is we make this a built in feature of Nextflow, so it's quicker and easier for a pipeline developer to implement.

@bentsherman
Copy link
Member

Often it's just hitting the limits of infrastructure and causing the cache to become invalidated

Can you elaborate more on this. The cache should not fail in this way, it should be able to handle a large pipeline crashing and being resumed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants