Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save all files in a task at the same time to avoid recomputing intermediate results #2522

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

bouweandela
Copy link
Member

@bouweandela bouweandela commented Sep 11, 2024

Description

Save all files in a task at the same time to avoid recomputing intermediate results.

This change is not backward compatible because it changes the return value of esmvalcore.preprocessor.save, which is part of the public API. Previously this function returned the filename, not it returns None on immediate saves and a dask.delayed.Delayed for delayed saves that can be requested with the compute=False argument.

Closes #2521
Closes #2042

Links to documentation:


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@bouweandela bouweandela added preprocessor Related to the preprocessor dask related to improvements using Dask labels Sep 11, 2024
Copy link

codecov bot commented Sep 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.72%. Comparing base (d8ad2f0) to head (e60520a).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2522      +/-   ##
==========================================
+ Coverage   94.67%   94.72%   +0.05%     
==========================================
  Files         251      252       +1     
  Lines       14302    14436     +134     
==========================================
+ Hits        13540    13675     +135     
+ Misses        762      761       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bouweandela bouweandela marked this pull request as ready for review October 14, 2024 16:12
Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bouweandela this is very nice, but I have serious concerns related to store and locking, there are many known issues with HDF5 and threads, and even the Dask folk are looking at this type of IO issue, see eg dask/distributed#780 - inherently what iris are doing is also not thread-safe, so I am doubly concerned

@bouweandela
Copy link
Member Author

there are many known issues with HDF5 and threads, and even the Dask folk are looking at this type of IO issue, see eg dask/distributed#780 -

The linked issue is not related to writing to HDF5 files.

inherently what iris are doing is also not thread-safe, so I am doubly concerned

If you have concerns about Iris's capability, I would recommend playing around with it and see if you can get it to crash and report any issues you find on the Iris GitHub page. The code that handles saving to NetCDF with distributed lives in `iris.fileformats.netcdf in case you would like to have a look.

I have serious concerns related to store and locking

Could you elaborate on those and provide an example of a case where it does not work?

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Oct 22, 2024

Could you elaborate on those and provide an example of a case where it does not work?

I need to hatch me a few used cases and test for stress points. Not on the priority list though, so let's see if any issues pop up naturally, am not gonna block this PR, just wanted to see if you have any concerns too 🍺

@valeriupredoi
Copy link
Contributor

it seems that a Lock object is indeed in dask.distributed, so first hurdle I was afraid of is alleviated dask/dask#1892 (comment)

@valeriupredoi
Copy link
Contributor

this, though, is a bit scary dask/dask#2488

@bouweandela
Copy link
Member Author

this, though, is a bit scary dask/dask#2488

We're not using the to_hdf5 method, so I'm not worried about that issue.

I need to hatch me a few used cases and test for stress points. Not on the priority list though, so let's see if any issues pop up naturally, am not gonna block this PR, just wanted to see if you have any concerns too 🍺

It would be great if you could give it a try. I've tested this with the recipe in #2300 and there is seems to work well.

@schlunma schlunma added this to the v2.12.0 milestone Oct 30, 2024
@schlunma
Copy link
Contributor

There are a few things you can do to check whether stdout is going to an interactive terminal or not, such as with the tty coreutils command.

We could only output the progress bar if we are connected to an interactive input.

That sounds like a very reasonable solution to me! @bouweandela what do you think?

@bouweandela
Copy link
Member Author

It is a nice idea and would be better than not having a progress bar at all, but it requires running the tool in interactive mode to see progress. Running recipes interactively on an HPC machine without the program crashing if your internet connection is temporarily lost requires setting up something like a tmux session, which may not be convenient for users either. Having a progress bar that is available even when not running in interactive mode would be nicer, e.g. I typically run tail -f on the SLURM log files to see progress and that case wouldn't be covered.

We do occasionally get questions from users if the tool is actually doing something, see e.g. ESMValGroup/ESMValTool#3738, so having progress bars (perhaps also for downloading data), seems desirable.

@jfrost-mo
Copy link
Contributor

jfrost-mo commented Nov 15, 2024

For a non interactive use case like that, it may be desirable to do something that doesn't overwrite old output, as that doesn't always work in non-interactive systems.

For example, when attached to a non-interactive terminal you could print the percentage completion as

  1%
  2%
  3%
 ...

Or do what many data transfer tools do and just print dots to show progress is happening.

@bouweandela
Copy link
Member Author

Also a nice idea, but that would require implementing our own progress bars instead of the ones provided by Dask, which is more work than I would be willing to put into this.

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now happy with the IO handling - had doubts about thread-safety, but should be fine since there is an internal Lock(), so I'll approve - the progress bar saga still needs resolution, but I think Manu put a Request Changes in for that. Thanks a lot, @bouweandela 🍺

dask.compute(futures)
else:
with dask.diagnostics.ProgressBar():
dask.compute(delayeds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe tweak it via tqdm which is a lot more progress-y bar-y? See tqdm/tqdm#278 @schlunma have a look at that too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tqdms progress bars look nice, but didn't work for me when I tried to use them to track progress for the ESGF downloads a few years ago. The maintainers of that package are not very responsive to the issue either: tqdm/tqdm#1272 (comment). I would be happy to try again though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's see if it does the trick, can't be any worse than that picture Manu posted in #2522 (comment) 🤣

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented some fancy progress bars for you using rich

@valeriupredoi
Copy link
Contributor

this, though, is a bit scary dask/dask#2488

let's see if this hits us, after we merge - no point holding up a performance imrpovement for something that's putative 👍

@bouweandela
Copy link
Member Author

The conclusion from community consultation at the workshop is that we would like to display progress bars (also for downloads) and will provide a configuration option to disable them.

@bouweandela
Copy link
Member Author

For example, when attached to a non-interactive terminal you could print the percentage completion

@jfrost-mo I implemented this solution for the case where a progress bar is not desired.

@bouweandela
Copy link
Member Author

bouweandela commented Nov 29, 2024

@schlunma It is now possible to disable the progress bar by setting a value other than 0 for log_progress_interval in the configuration, e.g.

logging:
  log_progress_interval: 10s

If max_parallel_tasks is set to a value other than 1, the progress bar is also disabled to avoid writing multiple overlapping progress bars.

Does it look better now?

Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bouwe, that look great! I really love the new progress bar ❤️ . It is also nicely rendered in the SLURM log now (I tried VSCode, vim, and cat):

grafik

The only "issue" I found is that with a distributed scheduler, the progress bar sometimes does not finish with "100%" (I didn't see this with the default scheduler):

grafik

One "issue" with the progress logger is that with the default scheduler, there is a new line character after the bar, this does not happen with a distributed scheduler:

grafik

I wonder if it would make sense to add an option to disable the progress logging completely? Maybe with log_progress_interval=-1?

Apart from that, I only have minor comments on the doc. Thank you!!

doc/quickstart/configure.rst Outdated Show resolved Hide resolved
doc/quickstart/configure.rst Show resolved Hide resolved
doc/quickstart/configure.rst Outdated Show resolved Hide resolved
doc/quickstart/configure.rst Outdated Show resolved Hide resolved
esmvalcore/preprocessor/_dask_progress.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards incompatible change dask related to improvements using Dask preprocessor Related to the preprocessor
Projects
No open projects
4 participants