-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework parallel doc example using CalcJob's #288
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #288 +/- ##
==========================================
+ Coverage 75.75% 80.64% +4.89%
==========================================
Files 70 66 -4
Lines 4615 5147 +532
==========================================
+ Hits 3496 4151 +655
+ Misses 1119 996 -123
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Hi @agoscinski , thanks for your efforts on this. Yes, using The old example, which utilizes a
Besides, in most cases, after these tasks run in parallel, we need to gather and process the results. The old example is quite valuable for realistic applications, as already mentioned in this post. Therefore, I’d like to suggest keeping the old example but updating it to replace the |
I wanted to put the for loop back into the notebook but compare it to executing WorkGraphs separately so I can relate it a bit to what is written here https://aiida.discourse.group/t/run-only-one-job-on-local-machine/459/2
|
This shouldn't happen. Could you double-check? |
Ah it is because of caching. After I disabled the caching with verdi the caching is still somehow active, not sure why. I will try to reproduce it later and do an issue on aiida-core |
45eb592
to
c977986
Compare
I had to rename the parallel file so no cached version is used from read the docs. I am still confused why this happened. Since I enforced a rerun on the sphinx-gallery side using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @agoscinski thanks for the work. I do have one concern: the current execution time of 2 minutes and 4.786 seconds is quite long.
Here are my suggestions:
- The "Parallelizing WorkGraphs" section is not needed. Take one use case as an example, I have PwRelax workgraph and a large set of structures to relax. Users just need to directly submit multiple workgraphs using a simple loop, without waiting for each to finish. If users write another workgraph to manage parallel execution, users need to think about how to passing input into the sub-workgraph, and handling potential interruptions of the top-level workgraph. While tracking the provenance of 100 workgraphs submitted together might be a potential benefit, it’s not something most users would require.
- Add a description of the workgraph execution mechanism at the beginning of the notebook: Workgraphs are based on dependency-driven execution, meaning parallel tasks run automatically if the tasks have no dependency on each others, and there are sufficient resources. I recall you demonstrated this well in the EuroScipy tutorial, with a workgraph showing how the process works in parallel (the execution order). You could include that example here, and also show the GUI for better visualization.
- Better to show the GUI for every workgraph: This will help users easily see which tasks will be run in parallel.
- Mention the result-gathering process: Typically, users will want to gather the results after parallel tasks are complete. You can add a link to the aggregate notebook, and if it's not ready yet, you can include a comment that it will be added later.
@task.graph_builder( | ||
inputs=[{"name": "integer"}], outputs=[{"name": "sum", "from": "sum_task.result"}] | ||
) | ||
def add10_wg(integer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use this name add10_wg
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add10
because it adds 10 to a number. wg
is I think a bit general indicating the problem that I have when I want to refer a graph builder. It can be a function as defined here, a task when integrated into a WorkGraph, and it is actually a WorkGraph that is executed. Because the whole tutorial is about parallelizing workgraphs I added wg
into the name, but we can also remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to add10
I think that is more consistent with the rest
@task.graph_builder( | ||
inputs=[{"name": "integer"}], outputs=[{"name": "sum", "from": "sum_task.result"}] | ||
) | ||
def add10_wg(integer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add10
because it adds 10 to a number. wg
is I think a bit general indicating the problem that I have when I want to refer a graph builder. It can be a function as defined here, a task when integrated into a WorkGraph, and it is actually a WorkGraph that is executed. Because the whole tutorial is about parallelizing workgraphs I added wg
into the name, but we can also remove it.
I would not rely on any qe dependent documentation pages as I think we should move them to aiida-tutorials. Also this is more focused on the parallel execution, measuring timings and talking about daemons. If there is a user base for which both use cases might be useful, then we should keep it (at least until more feedback from outside). If it is the running time, then we should work an improving this. I will reduce the sleeps and the iterations at the end to reduce the running time to 1 minute.
This is mentioned in the second sentence of the example. I will include the GUI and write comments in the first example emphasizing on this. Maybe you can note in the code where I should also mention it.
Included!
Okay reference it at the end, but the sphinx reference does not work for the moment, since we need to wait for #287 to be merged |
I decreased the time now to 1min11s by reusing the graph builder runs from before for the daemon part. Also I just keep 2 iterations as this is enough for the showcase of the effect of daemons. I reduced the sleep time to 3 seconds. |
Hi @agoscinski , the docs failed to build. The pre-commit failed. Could you please fix them? I will review them as soon as they pass. |
@@ -0,0 +1,278 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that RTD caches the other file somehow, therefore I renamed it. I don't fully understand this.
The previous example did rely on calcfunctions that are always run sequentially. This example now uses CalcJobs to actually achieve parallel executions.
Co-authored-by: Xing Wang <[email protected]>
…re run in parallel
2d3cc3f
to
d7089cc
Compare
# Be aware that for the moment AiiDA can only run 200 WorkGraphs at the same time. | ||
# To increase that limit one can set this variable to a higher value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Be aware that for the moment AiiDA can only run 200 WorkGraphs at the same time. | |
# To increase that limit one can set this variable to a higher value. | |
# Be aware that for the moment, AiiDA can only run 200 processes (WorkGraph, CalcJob etc) at the same time. | |
# To increase that limit, one can set this variable to a higher value. |
# Since each daemon worker can only manage one WorkGraph (handling the results) | ||
# at a time, one can experience slow downs when running many jobs that can be | ||
# run in parallel. The optimal number of workers depends highly on the jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Since each daemon worker can only manage one WorkGraph (handling the results) | |
# at a time, one can experience slow downs when running many jobs that can be | |
# run in parallel. The optimal number of workers depends highly on the jobs | |
# One can experience slow downs when running many jobs (e.g., 100 jobs) that can be | |
# run in parallel. The optimal number of workers depends highly on the jobs |
# The overhead time has shortens a bit as the handling of the CalcJobs and | ||
# WorkGraphs could be parallelized. One can increase the number of iterations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure on he overhead time has shortens a bit
?
Looking at the time, there is no improvement.
Time for running parallelized graph builder 0:00:11.262496
Time for running parallelized graph builder with 2 daemons 0:00:11.264958
# verdi daemon restart | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add a link to the performance page.
The previous example did rely on calcfunctions that are always run sequentially. This example now uses CalcJobs to actually achieve parallel executions.
@superstar54 I removed the old example since it did not make sense to me much, since it was using calcfunctions, but I might be missing something you wanted to actually show with it.