-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a tutorial on configuring Dask for use with ESMValTool #332
base: main
Are you sure you want to change the base?
Conversation
_episodes/11-dask-configuration.md
Outdated
> | ||
> In the config-user.yml file, there is a setting called ``max_parallel_tasks``. | ||
> Any variable or diagnostic script in the recipe is considered a 'task' in this | ||
> context and when settings this to a value larger than 1, these will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
settings ->set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote the sentence, does it look better now?
_episodes/11-dask-configuration.md
Outdated
>> threads_per_worker: 2 | ||
>> memory_limit: 2GiB | ||
>>``` | ||
>> Note that the bars representing the memory use turn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few lines from the log are below. Is it possible to say something about the Key Errors seen in the message and also about workers restarting which I think could be due to limiting the memory? If the users can get a sense of what these messages mean and how to adjust the quota/number of workers that would be helpful. I am using the pre-installed module on JASMIN for this and the 2GiB memory limit as specified in the change above.
2024-06-25 11:07:41,296 - distributed.dashboard.components.scheduler - ERROR - 'any-aggregate-9f4d871819cbc8d47661deff0adaa5e9'
Traceback (most recent call last):
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/utils.py", line 838, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2638, in update_layout
x = max(xs[dep] for dep in dependencies[tg]) + 1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2638, in <genexpr>
x = max(xs[dep] for dep in dependencies[tg]) + 1
~~^^^^^
KeyError: 'any-aggregate-9f4d871819cbc8d47661deff0adaa5e9'
2024-06-25 11:07:41,297 - distributed.dashboard.components.scheduler - ERROR - 'any-aggregate-9f4d871819cbc8d47661deff0adaa5e9'
Traceback (most recent call last):
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/utils.py", line 838, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 4697, in tg_graph_doc
tg_graph.update()
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/core/property/validation.py", line 95, in func
return input_function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2687, in update
self.nodes_layout, self.arrows_layout = self.update_layout()
^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/core/property/validation.py", line 95, in func
return input_function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/utils.py", line 838, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2638, in update_layout
x = max(xs[dep] for dep in dependencies[tg]) + 1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2638, in <genexpr>
x = max(xs[dep] for dep in dependencies[tg]) + 1
~~^^^^^
KeyError: 'any-aggregate-9f4d871819cbc8d47661deff0adaa5e9'
2024-06-25 11:07:41,299 - tornado.application - ERROR - Uncaught exception GET /groups (127.0.0.1)
HTTPServerRequest(protocol='http', host='127.0.0.1:8787', method='GET', uri='/groups', version='HTTP/1.1', remote_ip='127.0.0.1')
Traceback (most recent call last):
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/tornado/web.py", line 1786, in _execute
result = await result
^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/server/views/doc_handler.py", line 54, in get
session = await self.get_session()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/server/views/session_handler.py", line 145, in get_session
session = await self.application_context.create_session_if_needed(session_id, self.request, token)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/server/contexts.py", line 240, in create_session_if_needed
self._application.initialize_document(doc)
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/application/application.py", line 190, in initialize_document
h.modify_document(doc)
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/application/handlers/function.py", line 140, in modify_document
self._func(doc)
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/utils.py", line 838, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 4697, in tg_graph_doc
tg_graph.update()
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/core/property/validation.py", line 95, in func
return input_function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2687, in update
self.nodes_layout, self.arrows_layout = self.update_layout()
^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/core/property/validation.py", line 95, in func
return input_function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/utils.py", line 838, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2638, in update_layout
x = max(xs[dep] for dep in dependencies[tg]) + 1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2638, in <genexpr>
x = max(xs[dep] for dep in dependencies[tg]) + 1
~~^^^^^
KeyError: 'any-aggregate-9f4d871819cbc8d47661deff0adaa5e9'
2024-06-25 12:07:51,734 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 1.62 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:07:52,033 - distributed.worker.memory - WARNING - Worker is at 73% memory usage. Resuming worker. Process memory: 1.46 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:07:52,945 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker. Process memory: 1.63 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:07:53,246 - distributed.worker.memory - WARNING - Worker is at 76% memory usage. Resuming worker. Process memory: 1.52 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:06,841 - distributed.worker.memory - WARNING - Worker is at 85% memory usage. Pausing worker. Process memory: 1.70 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:07,132 - distributed.worker.memory - WARNING - Worker is at 75% memory usage. Resuming worker. Process memory: 1.51 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:11,265 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker. Process memory: 1.62 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:11,296 - distributed.worker.memory - WARNING - Worker is at 79% memory usage. Resuming worker. Process memory: 1.59 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:18,942 - distributed.worker.memory - WARNING - Worker is at 82% memory usage. Pausing worker. Process memory: 1.66 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:19,122 - distributed.worker.memory - WARNING - Worker is at 72% memory usage. Resuming worker. Process memory: 1.44 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:40,154 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker. Process memory: 1.63 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:40,420 - distributed.worker.memory - WARNING - Worker is at 77% memory usage. Resuming worker. Process memory: 1.55 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:10:19,017 - distributed.worker.memory - WARNING - Worker is at 83% memory usage. Pausing worker. Process memory: 1.66 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:10:19,248 - distributed.worker.memory - WARNING - Worker is at 75% memory usage. Resuming worker. Process memory: 1.51 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:10:35,455 - distributed.worker.memory - WARNING - Worker is at 82% memory usage. Pausing worker. Process memory: 1.66 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 11:10:35,597 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:43621 (pid=18651) exceeded 95% memory budget. Restarting...
2024-06-25 11:10:36,331 - distributed.nanny - WARNING - Restarting worker
ERROR 1: PROJ: proj_create_from_database: Open of /apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/share/proj failed
2024-06-25 12:12:08,160 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker. Process memory: 1.63 GiB -- Worker memory limit: 2.00 GiB
Finally my program appears to hang with the following:
KeyError: <TaskState ('getitem-058a474f7282a87a89a6178a2607261b', 5, 0, 0) queued>
2024-06-25 12:16:46,395 - distributed.nanny - ERROR - Worker process died unexpectedly
Should this give the user an indication that things are not working with the allocated resources? If I am correct, can we add some text to guide the user on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note on line 164, but it seems easier and more educational to see this from the Dashboard.
[dask_jobqueue interactive use](https://jobqueue.dask.org/ | ||
en/latest/interactive.html) | ||
for more information. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the example below be part of a Pro-tip? We put some more advanced material as tips like that in other episodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting up the Dask Distributed cluster separately seems very useful for educational purposes, as it allows the training participants to play around with the dashboard at their leisure. Do you think the Python code is too intimidating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be but it is also useful, so perhaps good for a Pro-tip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bouweandela - I have added some comments based on following your instructions in the tutorial as a first pass in the review.
Thanks for reviewing @rswamina! Could you have another look, please? |
Friendly ping @rswamina, would you have time for another look at this? |
Sorry for the delay. WIll do so. |
> ## On using ``max_parallel_tasks`` | ||
> | ||
> In the config-user.yml file, there is a setting called ``max_parallel_tasks``. | ||
> Any variable or diagnostic script in the recipe is considered a 'task' in this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: variable ->variable to be preprocessed.
> | ||
> In the config-user.yml file, there is a setting called ``max_parallel_tasks``. | ||
> Any variable or diagnostic script in the recipe is considered a 'task' in this | ||
> context and this is set to a value larger than 1, these will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean "and when this is" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bouweandela - I have resolved some and left a few comments. Let me know if these are ok and easily made.
Add a tutorial on configuring Dask for use with ESMValTool
Pull Request checklist
We appreciate your time and effort to improve the tutorial. Please keep in mind that lesson maintainers are volunteers and it may be some time before they can respond to your contribution.
Before you start
Tasks
If you are contributing to existing lesson materials, please make sure the content conforms to theLesson development
section in CONTRIBUTING.md and does not contain any spelling or grammatical errors.Lesson organization
andLesson formatting
sections in CONTRIBUTING.md and does not contain any spelling or grammatical errors.Preferably Codacy checks pass. Status can be seen below your pull request. If there is an error, click the link to find out why.make serve
, alternativelymake docker-serve
. Please see thePreviewing your changes locally
section in CONTRIBUTING.md for installation instructions.If you need help with any of the tasks above, please do not hesitate to ask by commenting in the issue or pull request.
Closes #323