Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save all files in a task at the same time to avoid recomputing intermediate results #2522

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

bouweandela
Copy link
Member

@bouweandela bouweandela commented Sep 11, 2024

Description

Save all files in a task at the same time to avoid recomputing intermediate results.

This change is not backward compatible because it changes the return value of esmvalcore.preprocessor.save, which is part of the public API. Previously this function returned the filename, not it returns None on immediate saves and a dask.delayed.Delayed for delayed saves that can be requested with the compute=False argument.

Closes #2521
Closes #2042

Link to documentation: https://esmvaltool--2522.org.readthedocs.build/projects/ESMValCore/en/2522/api/esmvalcore.preprocessor.html#esmvalcore.preprocessor.save


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@bouweandela bouweandela added preprocessor Related to the preprocessor dask related to improvements using Dask labels Sep 11, 2024
Copy link

codecov bot commented Sep 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.79%. Comparing base (e9fccbd) to head (85725a9).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2522      +/-   ##
==========================================
+ Coverage   94.77%   94.79%   +0.02%     
==========================================
  Files         251      251              
  Lines       14266    14293      +27     
==========================================
+ Hits        13520    13549      +29     
+ Misses        746      744       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bouweandela bouweandela marked this pull request as ready for review October 14, 2024 16:12
Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bouweandela this is very nice, but I have serious concerns related to store and locking, there are many known issues with HDF5 and threads, and even the Dask folk are looking at this type of IO issue, see eg dask/distributed#780 - inherently what iris are doing is also not thread-safe, so I am doubly concerned

@bouweandela
Copy link
Member Author

there are many known issues with HDF5 and threads, and even the Dask folk are looking at this type of IO issue, see eg dask/distributed#780 -

The linked issue is not related to writing to HDF5 files.

inherently what iris are doing is also not thread-safe, so I am doubly concerned

If you have concerns about Iris's capability, I would recommend playing around with it and see if you can get it to crash and report any issues you find on the Iris GitHub page. The code that handles saving to NetCDF with distributed lives in `iris.fileformats.netcdf in case you would like to have a look.

I have serious concerns related to store and locking

Could you elaborate on those and provide an example of a case where it does not work?

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Oct 22, 2024

Could you elaborate on those and provide an example of a case where it does not work?

I need to hatch me a few used cases and test for stress points. Not on the priority list though, so let's see if any issues pop up naturally, am not gonna block this PR, just wanted to see if you have any concerns too 🍺

@valeriupredoi
Copy link
Contributor

it seems that a Lock object is indeed in dask.distributed, so first hurdle I was afraid of is alleviated dask/dask#1892 (comment)

@valeriupredoi
Copy link
Contributor

this, though, is a bit scary dask/dask#2488

@bouweandela
Copy link
Member Author

this, though, is a bit scary dask/dask#2488

We're not using the to_hdf5 method, so I'm not worried about that issue.

I need to hatch me a few used cases and test for stress points. Not on the priority list though, so let's see if any issues pop up naturally, am not gonna block this PR, just wanted to see if you have any concerns too 🍺

It would be great if you could give it a try. I've tested this with the recipe in #2300 and there is seems to work well.

@schlunma schlunma added this to the v2.12.0 milestone Oct 30, 2024
@schlunma
Copy link
Contributor

Thanks Bouwe, looks good to me! I will test this with some real recipes and merge if successful 🚀

@schlunma
Copy link
Contributor

I successfully tested this with many recipes with the default scheduler, a distributed LocalCluster and a distributed SLURMCluster. Everything looks and works as expected 🚀

The only remaining comment I have is regarding the progress bar. This messes up the log files a lot, e.g.

2024-10-31 16:42:07,321 UTC [841493] WARNING /work/bd0854/b309141/micromamba/envs/esm/lib/python3.11/site-packages/iris/fileformats/netcdf/loader.py:422: IrisFactoryCoordNotFoundWarning: Unable to find coordinate for variable 'ps'
  warnings.warn(

^M[#################                       ] | 42% Completed |  7.4s2024-10-31 16:42:07,522 UTC [841207] INFO    Progress: 10 tasks running, 6 tasks waiting for ancestors, 68/84 done
^M[#################                       ] | 44% Completed |  7.5s^M[##################                      ] | 46% Completed |  7.6s^M[###################                     ] | 48% Completed |  7.7s^M[###################                     ] | 49% Completed |  7.8s^M[####################                    ] | 51% Completed |  8.0s^M[#####################                   ] | 53% Completed |  8.1s^M[#####################                   ] | 54% Completed |  8.2s^M[######################                  ] | 55% Completed |  8.3s^M[######################                  ] | 55% Completed |  8.4s^M[######################                  ] | 55% Completed |  8.5s^M[######################                  ] | 55% Completed |  8.6s2024-10-31 16:42:08,633 UTC [841493] WARNING /work/bd0854/b309141/micromamba/envs/esm/lib/python3.11/site-packages/iris/fileformats/netcdf/loader.py:422: IrisFactoryCoordNotFoundWarning: Unable to find coordinate for variable 'ps'
  warnings.warn(

^M[######################                  ] | 55% Completed |  8.7s^M[######################                  ] | 56% Completed |  8.8s^M[######################                  ] | 56% Completed |  8.9s^M[######################                  ] | 56% Completed |  9.0s^M[######################                  ] | 56% Completed |  9.1s^M[######################                  ] | 56% Completed |  9.2s^M[######################                  ] | 56% Completed |  9.3s^M[######################                  ] | 57% Completed |  9.4s^M[######################                  ] | 57% Completed |  9.5s^M[######################                  ] | 57% Completed |  9.6s^M[######################                  ] | 57% Completed |  9.7s2024-10-31 16:42:09,744 UTC [841479] INFO    Computing and saving data for task diag_assess_zhai_constraint_cmip5_x/cl
2024-10-31 16:42:09,751 UTC [841490] INFO    Computing and saving data for task diag_x_zhai_cmip5/cl
^M[######################                  ] | 57% Completed |  9.9s^M[                                        ] | 0% Completed |  1.0s^M[                                        ] | 0% Completed |  1.1s^M[#######################                 ] | 57% Completed | 10.8s^M[                                        ] | 0% Completed |  1.1s^M[                                        ] | 0% Completed |  1.2s^M[#######################                 ] | 58% Completed | 10.9s^M[                                        ] | 1% Completed |  1.2s^M[                                        ] | 1% Completed |  1.3s^M[#######################                 ] | 58% Completed | 11.0s^M[                                        ] | 1% Completed |  1.3s^M[                                        ] | 1% Completed |  1.4s^M[#######################                 ] | 59% Completed | 11.1s^M[                                        ] | 1% Completed |  1.4s^M[                                        ] | 1% Completed |  1.5s^M[#######################                 ] | 59% Completed | 11.2s^M[                                        ] | 1% Completed |  1.5s^M[                                        ] | 1% Completed |  1.6s^M[#######################                 ] | 59% Completed | 11.3s^M[                                        ] | 1% Completed |  1.6s^M[                                        ] | 1% Completed |  1.7s^M[#######################                 ] | 59% Completed | 11.4s^M[                                        ] | 1% Completed |  1.7s^M[                                        ] | 1% Completed |  1.8s^M[#######################                 ] | 59% Completed | 11.5s^M[                                        ] | 1% Completed |  1.8s^M[                                        ] | 1% Completed |  1.9s^M[#######################                 ] | 59% Completed | 11.6s^M[                                        ] | 1% Completed |  1.9s^M[                                        ] | 1% Completed |  2.0s^M[#######################                 ] | 59% Completed | 11.7s^M[                                        ] | 2% Completed |  2.0s^M[                                        ] | 2% Completed |  2.1s^M[#######################                 ] | 59% Completed | 11.8s^M[                                        ] | 2% Completed |  2.1s^M[                                        ] | 2% Completed |  2.2s^M[#######################                 ] | 59% Completed | 11.9s^M[                                        ] | 2% Completed |  2.2s^M[                                        ] | 2% Completed |  2.3s^M[#######################                 ] | 59% Completed | 12.0s^M[                                        ] | 2% Completed |  2.3s^M[                                        ] | 2% Completed |  2.4s^M[#######################                 ] | 59% Completed | 12.1s^M[                                        ] | 2% Completed |  2.4s^M[#                                       ] | 2% Completed |  2.5s^M[########################                ] | 60% Completed | 12.2s^M[#                                       ] | 2% Completed |  2.5s^M[#                                       ] | 2% Completed |  2.6s^M[########################                ] | 60% Completed | 12.3s^M[#                                       ] | 3% Completed |  2.6s^M[#                                       ] | 3% Completed |  2.7s^M[########################                ] | 60% Completed | 12.4s^M[#                                       ] | 3% Completed |  2.7s^M[########################                ] | 60% Completed | 12.5s^M[#                                       ] | 3% Completed |  2.8s^M[#                                       ] | 3% Completed |  2.8s^M[#                                       ] | 3% Completed |  2.9s^M[########################                ] | 60% Completed | 12.6s^M[#                                       ] | 3% Completed |  2.9s^M[#                                       ] | 3% Completed |  3.0s^M[########################                ] | 60% Completed | 12.7s^M[#                                       ] | 3% Completed |  3.1s^M[#                                       ] | 3% Completed |  3.1s^M[########################                ] | 60% Completed | 12.8s^M[#                                       ] | 3% Completed |  3.2s^M[#                                       ] | 3% Completed |  3.2s^M[########################                ] | 60% Completed | 12.9s^M[#                                       ] | 4% Completed |  3.3s^M[#                                       ] | 4% Completed |  3.3s^M[########################                ] | 60% Completed | 13.0s^M[#                                       ] | 4% Completed |  3.4s^M[##                                      ] | 5% Completed |  3.4s^M[########################                ] | 60% Completed | 13.1s^M[##                                      ] | 5% Completed |  3.5s^M[##                                      ] | 5% Completed |  3.5s^M[########################                ] | 60% Completed | 13.2s^M[##                                      ] | 5% Completed |  3.6s^M[##                                      ] | 5% Completed |  3.6s^M[########################                ] | 60% Completed | 13.3s^M[##                                      ] | 6% Completed |  3.7s^M[##                                      ] | 6% Completed |  3.7s^M[########################                ] | 60% Completed | 13.4s^M[##                                      ] | 6% Completed |  3.8s^M[##                                      ] | 6% Completed |  3.8s^M[########################                ] | 60% Completed | 13.5s^M[##                                      ] | 6% Completed |  3.9s^M[##                                      ] | 7% Completed |  3.9s^M[########################                ] | 61% Completed | 13.6s^M[##                                      ] | 7% Completed |  4.0s^M[##                                      ] | 7% Completed |  4.0s^M[########################                ] | 61% Completed | 13.7s^M[###                                     ] | 7% Completed |  4.1s^M[###                                     ] | 7% Completed |  4.1s^M[########################                ] | 61% Completed | 13.8s^M[###                                     ] | 7% Completed |  4.2s^M[###                                     ] | 7% Completed |  4.2s^M[########################                ] | 61% Completed | 13.9s^M[###                                     ] | 8% Completed |  4.3s^M[###                                     ] | 8% Completed |  4.3s^M[########################                ] | 61% Completed | 14.0s^M[###                                     ] | 8% Completed |  4.4s^M[###                                     ] | 8% Completed |  4.4s^M[########################                ] | 61% Completed | 14.1s^M[###                                     ] | 9% Completed |  4.5s^M[###                                     ] | 9% Completed |  4.5s^M[########################                ] | 61% Completed | 14.2s^M[####                                    ] | 10% Completed |  4.6s^M[####                                    ] | 10% Completed |  4.6s^M[########################                ] | 61% Completed | 14.3s^M[####                                    ] | 10% Completed |  4.7s^M[####                                    ] | 11% Completed |  4.7s^M[########################                ] | 61% Completed | 14.4s^M[####                                

In addition, it also sometimes produces overlapping lines in the terminal when the tool is run interactively (I guess caused by multiple parallel processes?):

2024-10-31 17:01:05,150 UTC [1258106] INFO    Computing and saving data for task map/tas
[                                        ] | 0% Completed |  0.9s2024-10-31 17:01:06,120 UTC [1258108] INFO    Computing and saving data for task timeseries/tas_global
2024-10-31 17:01:37,377 UTC [1258106] INFO    Successfully completed task map/tas (priority 1) in 0:00:32.497795
[####                                    ] | 11% Completed | 31.3s2024-10-31 17:01:37,493 UTC [1258080] INFO    Progress: 2 tasks running, 2 tasks waiting for ancestors, 1/5 done

I don't think there is an easy fix for that, so my suggestion would be to just remove the progress bar. Although I quite like that actually 😢

@valeriupredoi
Copy link
Contributor

Can the progress bar not be piped to file?

@bouweandela
Copy link
Member Author

Thanks a lot for testing and reviewing! As far as I can see, the progress bar does not end up in the log files, only in the stdout output that SLURM records. Would it help to disable the progress bar if max_parallel_tasks is larger than 1 to avoid overlapping progress bars?

@schlunma
Copy link
Contributor

schlunma commented Nov 4, 2024

That's right! Sorry, I only checked the SLURM log and was just assuming that the main_log.txt will look the same.

I guess disabling it for max_parallel_tasks > 1 would help for stdout, but not for the SLURM log. I am not sure if many people use the SLURM log (I do for example), but if many people do that, it would be nice to have a version that's actually readable. The current one is definitely not.

@bouweandela
Copy link
Member Author

The SLURM log still looks OK-ish with max_parallel_tasks: 1 if you use a viewer that renders the backspaces used to 'update' the progress bar in place, e.g. if you view the log with cat or less -r. Have you tried any of these options?

@schlunma
Copy link
Contributor

schlunma commented Nov 4, 2024

Well, when I open this file with my editor (VSCode), it looks like this:

grafik

Vim looks much worse:

grafik

I guess that is what many people do. I wouldn't exactly call this readable...

@bouweandela
Copy link
Member Author

Indeed that does not work. So how do you suggest we proceed?

@schlunma
Copy link
Contributor

schlunma commented Nov 5, 2024

I am not sure. The easiest solution would probably be to drop it entirely, and I cannot think of another solution at the moment...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards incompatible change dask related to improvements using Dask preprocessor Related to the preprocessor
Projects
None yet
3 participants