Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a tutorial on configuring Dask for use with ESMValTool #332

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

bouweandela
Copy link
Member

@bouweandela bouweandela commented May 29, 2024

Add a tutorial on configuring Dask for use with ESMValTool

Pull Request checklist

We appreciate your time and effort to improve the tutorial. Please keep in mind that lesson maintainers are volunteers and it may be some time before they can respond to your contribution.


Before you start

  • Read CONTRIBUTING.md.
  • Create an issue to discuss your idea. This allows your contributions to be incorporated into the tutorial.

Tasks

  • Give this pull request a descriptive title.
  • If you are contributing to existing lesson materials, please make sure the content conforms to the Lesson development section in CONTRIBUTING.md and does not contain any spelling or grammatical errors.
  • If you are making a new episode, please make sure the content conforms to the Lesson organization and Lesson formatting sections in CONTRIBUTING.md and does not contain any spelling or grammatical errors.
  • Preferably Codacy checks pass. Status can be seen below your pull request. If there is an error, click the link to find out why.
  • Preview changes on your machine before pushing them to GitHub by running make serve, alternatively make docker-serve. Please see the Previewing your changes locally section in CONTRIBUTING.md for installation instructions.
  • All code instructions have been tested.

If you need help with any of the tasks above, please do not hesitate to ask by commenting in the issue or pull request.


Closes #323

>
> In the config-user.yml file, there is a setting called ``max_parallel_tasks``.
> Any variable or diagnostic script in the recipe is considered a 'task' in this
> context and when settings this to a value larger than 1, these will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

settings ->set

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the sentence, does it look better now?

>> threads_per_worker: 2
>> memory_limit: 2GiB
>>```
>> Note that the bars representing the memory use turn
Copy link
Contributor

@rswamina rswamina Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few lines from the log are below. Is it possible to say something about the Key Errors seen in the message and also about workers restarting which I think could be due to limiting the memory? If the users can get a sense of what these messages mean and how to adjust the quota/number of workers that would be helpful. I am using the pre-installed module on JASMIN for this and the 2GiB memory limit as specified in the change above.

2024-06-25 11:07:41,296 - distributed.dashboard.components.scheduler - ERROR - 'any-aggregate-9f4d871819cbc8d47661deff0adaa5e9'
Traceback (most recent call last):
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/utils.py", line 838, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2638, in update_layout
    x = max(xs[dep] for dep in dependencies[tg]) + 1
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2638, in <genexpr>
    x = max(xs[dep] for dep in dependencies[tg]) + 1
            ~~^^^^^
KeyError: 'any-aggregate-9f4d871819cbc8d47661deff0adaa5e9'
2024-06-25 11:07:41,297 - distributed.dashboard.components.scheduler - ERROR - 'any-aggregate-9f4d871819cbc8d47661deff0adaa5e9'
Traceback (most recent call last):
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/utils.py", line 838, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 4697, in tg_graph_doc
    tg_graph.update()
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/core/property/validation.py", line 95, in func
    return input_function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2687, in update
    self.nodes_layout, self.arrows_layout = self.update_layout()
                                            ^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/core/property/validation.py", line 95, in func
    return input_function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/utils.py", line 838, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2638, in update_layout
    x = max(xs[dep] for dep in dependencies[tg]) + 1
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2638, in <genexpr>
    x = max(xs[dep] for dep in dependencies[tg]) + 1
            ~~^^^^^
KeyError: 'any-aggregate-9f4d871819cbc8d47661deff0adaa5e9'
2024-06-25 11:07:41,299 - tornado.application - ERROR - Uncaught exception GET /groups (127.0.0.1)
HTTPServerRequest(protocol='http', host='127.0.0.1:8787', method='GET', uri='/groups', version='HTTP/1.1', remote_ip='127.0.0.1')
Traceback (most recent call last):
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/tornado/web.py", line 1786, in _execute
    result = await result
             ^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/server/views/doc_handler.py", line 54, in get
    session = await self.get_session()
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/server/views/session_handler.py", line 145, in get_session
    session = await self.application_context.create_session_if_needed(session_id, self.request, token)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/server/contexts.py", line 240, in create_session_if_needed
    self._application.initialize_document(doc)
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/application/application.py", line 190, in initialize_document
    h.modify_document(doc)
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/application/handlers/function.py", line 140, in modify_document
    self._func(doc)
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/utils.py", line 838, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 4697, in tg_graph_doc
    tg_graph.update()
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/core/property/validation.py", line 95, in func
    return input_function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2687, in update
    self.nodes_layout, self.arrows_layout = self.update_layout()
                                            ^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/bokeh/core/property/validation.py", line 95, in func
    return input_function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/utils.py", line 838, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2638, in update_layout
    x = max(xs[dep] for dep in dependencies[tg]) + 1
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/lib/python3.11/site-packages/distributed/dashboard/components/scheduler.py", line 2638, in <genexpr>
    x = max(xs[dep] for dep in dependencies[tg]) + 1
            ~~^^^^^
KeyError: 'any-aggregate-9f4d871819cbc8d47661deff0adaa5e9'
2024-06-25 12:07:51,734 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 1.62 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:07:52,033 - distributed.worker.memory - WARNING - Worker is at 73% memory usage. Resuming worker. Process memory: 1.46 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:07:52,945 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 1.63 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:07:53,246 - distributed.worker.memory - WARNING - Worker is at 76% memory usage. Resuming worker. Process memory: 1.52 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:06,841 - distributed.worker.memory - WARNING - Worker is at 85% memory usage. Pausing worker.  Process memory: 1.70 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:07,132 - distributed.worker.memory - WARNING - Worker is at 75% memory usage. Resuming worker. Process memory: 1.51 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:11,265 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 1.62 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:11,296 - distributed.worker.memory - WARNING - Worker is at 79% memory usage. Resuming worker. Process memory: 1.59 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:18,942 - distributed.worker.memory - WARNING - Worker is at 82% memory usage. Pausing worker.  Process memory: 1.66 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:19,122 - distributed.worker.memory - WARNING - Worker is at 72% memory usage. Resuming worker. Process memory: 1.44 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:40,154 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 1.63 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:09:40,420 - distributed.worker.memory - WARNING - Worker is at 77% memory usage. Resuming worker. Process memory: 1.55 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:10:19,017 - distributed.worker.memory - WARNING - Worker is at 83% memory usage. Pausing worker.  Process memory: 1.66 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:10:19,248 - distributed.worker.memory - WARNING - Worker is at 75% memory usage. Resuming worker. Process memory: 1.51 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 12:10:35,455 - distributed.worker.memory - WARNING - Worker is at 82% memory usage. Pausing worker.  Process memory: 1.66 GiB -- Worker memory limit: 2.00 GiB
2024-06-25 11:10:35,597 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:43621 (pid=18651) exceeded 95% memory budget. Restarting...
2024-06-25 11:10:36,331 - distributed.nanny - WARNING - Restarting worker
ERROR 1: PROJ: proj_create_from_database: Open of /apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/share/proj failed
2024-06-25 12:12:08,160 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 1.63 GiB -- Worker memory limit: 2.00 GiB

Finally my program appears to hang with the following:

KeyError: <TaskState ('getitem-058a474f7282a87a89a6178a2607261b', 5, 0, 0) queued>
2024-06-25 12:16:46,395 - distributed.nanny - ERROR - Worker process died unexpectedly

Should this give the user an indication that things are not working with the allocated resources? If I am correct, can we add some text to guide the user on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note on line 164, but it seems easier and more educational to see this from the Dashboard.

[dask_jobqueue interactive use](https://jobqueue.dask.org/
en/latest/interactive.html)
for more information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the example below be part of a Pro-tip? We put some more advanced material as tips like that in other episodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting up the Dask Distributed cluster separately seems very useful for educational purposes, as it allows the training participants to play around with the dashboard at their leisure. Do you think the Python code is too intimidating?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be but it is also useful, so perhaps good for a Pro-tip?

Copy link
Contributor

@rswamina rswamina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bouweandela - I have added some comments based on following your instructions in the tutorial as a first pass in the review.

@bouweandela
Copy link
Member Author

Thanks for reviewing @rswamina! Could you have another look, please?

@bouweandela
Copy link
Member Author

Friendly ping @rswamina, would you have time for another look at this?

@rswamina
Copy link
Contributor

Friendly ping @rswamina, would you have time for another look at this?

Sorry for the delay. WIll do so.

> ## On using ``max_parallel_tasks``
>
> In the config-user.yml file, there is a setting called ``max_parallel_tasks``.
> Any variable or diagnostic script in the recipe is considered a 'task' in this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: variable ->variable to be preprocessed.

>
> In the config-user.yml file, there is a setting called ``max_parallel_tasks``.
> Any variable or diagnostic script in the recipe is considered a 'task' in this
> context and this is set to a value larger than 1, these will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean "and when this is" ?

Copy link
Contributor

@rswamina rswamina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bouweandela - I have resolved some and left a few comments. Let me know if these are ok and easily made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a chapter on configuring Dask for use with ESMValTool
2 participants