Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize clean up process for files generated when running a pipeline #23

Open
OliviaLynn opened this issue Apr 3, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@OliviaLynn
Copy link
Member

Had a talk with Alex about the clean-up process for files generated when running a pipeline.

Storing some personal notes here while I work on this issue:

Existing clean up scripts

  • These are great
  • Should be actually put into notebooks
  • Have noticed & updated some syntax errors in at least one of them → might be worth taking a closer look at them

Mid-run clean up

  • Alex pointed out that a lot of these files could be cleaned up during the process of running
  • Maybe a "clean up up" flag in the output file, so the stage that consumes the file can know “when I’m done with this I can delete it”
@OliviaLynn OliviaLynn self-assigned this Apr 3, 2023
@sschmidt23
Copy link
Collaborator

For mid-run clean up, is deleting files during a run somewhat in conflict with the philosophy if ceci? I think one of the main ideas of ceci is to define inputs and outputs for each stage, and other stages expect that outputs for stages exist, and if things are interrupted (e.g. perlmutter crashes on day 3 of a big 10 day run of a pipeline with many stages), ceci can figure out which stages have already run and can pick up mid-stream to complete. If we delete intermediate files before everything finishes, that may no longer work.

I could be misremembering, maybe @joezuntz can say if my interpretation is correct?

@joezuntz
Copy link
Contributor

joezuntz commented Apr 3, 2023

@sschmidt23 yes, that's right. When you launch ceci with the resume flag then it looks for missing files and uses that to decide what needs re-running. Having said that if that's not the behaviour that is useful to you we could add options to customize it - I was thinking about this anyway, to deal with the case where you don't want to overwrite existing files. We could add an option to avoid re-generating intermediate files if their descendants all exist.

A few options within the existing framwork though, in case useful:

  • run a script right at the end of the entire pipeline once it's complete to do clean up, so this doesn't matter
  • avoid declaring these as explicit output files and instead just cache them in a directory specified by an option, so they won't be counted as inputs or outputs. The downside then is that you have to manually keep track of them.
  • see if it's possible to remove intermediate files at the end of the same stage that creates them, so again they are never needed as outputs. Doesn't always work of course.

@drewoldag
Copy link
Contributor

This probably makes more sense as a commissioning priority, not necessarily as a release v1.0 priority.

@eacharles eacharles transferred this issue from LSSTDESC/rail_attic Jun 13, 2023
@aimalz aimalz added the enhancement New feature or request label Jul 14, 2023
@OliviaLynn
Copy link
Member Author

Moving this out of 1.0 as per previous comment above and lack of objections

@OliviaLynn OliviaLynn removed their assignment Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants