-
I have a Virtual Machine with Suse Tumbleweed under Parallels on my Mac Mini and I gave it 8 CPUs. How can I best control the OpenMPI? cpu,tpool_nthreads=1 into the startup script of GDL or is it better to compile gdl without OpenMpi support? |
Beta Was this translation helpful? Give feedback.
Replies: 20 comments 6 replies
-
It is quite difficult for me to give "the" answer without code, details on hardware, be sure that "Parallels" is OK/perfect ... The facts I have :
If I understand well, in you case, you are running 8 different independant sessions of GDL, which each one may activate OpenMP ? It is easy to estimated that the tasks which may use more than one core will try to do so, but will block others tasks (swapping in the cores ...). Generally speaking it is not good to run codes on one node each one may request all the cores at any time ... And yes, if you really would like to run in parallel, on the same node, 8 jobs, it should be way better to say : Or to recompile GDL and switching off OpenMP HTH A. PS: I do have some experience on multi-cores with Slurm where we request only a limited number of cores (lets 1 or 4) in a big node (36, 48 cores). And yes, if not too much I/O, you can run several independent tasks on few nodes each without loose in perf. (scalability is OK) PS: yes, it is a long time I did not try to run in // some heavy multi-cores GDL codes. When I try, it was clearly not a good way. Clearly the performances are decreasing. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your fast answer. Indeed I am running 8 different jobs in parallel in 8 different GDL sessions. Now I switched off OpenMP and now all of them get 100% and to my feeling each of them run much faster than the other jobs that got 300%. I will do more measurements but to my feeling the OpenMP is not working well in my scenario. BTW, I wrote cpu,tpool_nthreads=1 into my startup script and still the jobs got 300%. Can this be? |
Beta Was this translation helpful? Give feedback.
-
Very helpful your remark!
Indeed I use sp_plan_dft_1d from Intel to do my fft. I call it with call_external.
Maybe that causes the confusion. What do you think?
… On 24. Sep 2021, at 18:30, Alain ***@***.***> wrote:
I just tested on my laptop, with a 4 core cpu, running the benchmark after switching the cpu,tpool_nthreads=1
and, following it by top, it always stay around 99 or 100%. Without that, it can peak above 200 and up to 395 % 😄
(ant the figures are in line (plot_all_benchmark)
Do you use a specific procedure or function in GDL which may be directly using Openmp with following our instruction ? interesting ! Or just maybe somewhere the !cpu value is reseted ?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#1149 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOC5K6AFTRFNILRUVNBYY2LUDTUX7ANCNFSM5EWWDU7A>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
Hi, I'm surprised there are no occurence of |
Beta Was this translation helpful? Give feedback.
-
if you add "MKL" to researches, you will have few answers ... this is a Intel MKL related stuff. And yes, I am curious if it is sensitive to |
Beta Was this translation helpful? Give feedback.
-
Sorry for the late response. I am quite busy right now.
What I can definitely say is the following:
-> if compiled with -DOPENMP=ON then the processes run for quite a time with 100% but somewhen in time they go up to (in my case) 300-400%.
This happens even with cpu,tpool_nthreads=1inside the startup script and it seems that it is not related to the Intel MKL.
-> if compiled with -DOPENMP=OFF then they never go upon 100%
-> with -DOPENMP=ON is much slower than -DOPENMP=OFF. I can not yet say what my program is doing in the moment when it uses 400% of CPU time (it is a complex processing, I need to debug further) but it seems that it is somehow stuck doing "nothing"
I will continue debugging and keep you updated.
… On 26. Sep 2021, at 10:47, Alain ***@***.***> wrote:
if you add "MKL" to researches, you will have few answers ... this is a Intel MKL related stuff.
And yes, I am curious if it is sensitive to export OMP_NUM_THREADS=1 outside GDL
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#1149 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOC5K6EZF34QKOYPLRP2CCTUD4P7ZANCNFSM5EWWDU7A>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
Thanks for clarification.
This night I did a benchmark test with my programs. Here the setup:
Suse Tumbleweed inside Parallels on a Mac Mini (6 Core Intel i7).
This CPU has 12 threads and I dedicate 9 to the virtual machine.
My programs have 24 (main) jobs to do. A controller starts up to 8 jobs in parallel (leaving one CPU free).
If one job finishes, a new job is started automatically in a new bash.
Theoretically there are 3 times 8 jobs in parallel to be executed.
The “jobs” to do are
- read and write big files (some GB), doing this line by line
- read and write ascii files (up to 100Mb each)
- do fft calculations (with call_external)
Here the statistics:
1) IDL (v 7.0.1):
execution time: 23 min
remarks: each job uses between 100% and around 300% CPU
2) GDL (git) compiled with —DOPENMP=OFF
execution time: 28 min
remarks: each job uses up to 100% CPU
3) GDL (git) compiled with —DOPENMP=ON and no limitation in nthreats
execution time: 222 min (3 hours and 42 min)
remarks: each job uses between 100% and 300% CPU
So clearly there is something wrong with OPENMP (at least with my setup).
In point 3) I see the jobs running for at least 30 mins with around 130%CPU (6 of them, one with 20%) and literally doing almost nothing.
How should we proceed?
Maybe I should try to simulate this with a much simpler set of programs so that we can test it easily with different setups.
… On 27. Sep 2021, at 10:53, Giloo ***@***.***> wrote:
DOPENMP=OFF is not equivalent to DOPENMP=ON + "CPU, TPOOL_NTHREADS=1" in many parts of GDL. Iin the first case DOPENMP=OFF the #pragma omp sections are not interpreted at all when in the second case they are interpreted and may construct a (rather) complicated parallel thread structure just to call a simple code once. Some parts of code are thus:
if openmp is present and num_threads is > 1 and the TPOOL_XXX values are such that we will need parallelizing, then do some parallel section of code else do the same code, but without the parallel hassle.
however, doing the above test on the need to resort to parallelizing may be a problem if inside a loop (that happens), since a test inside a loop is a burden.
And, some parts of code have not been reviewed with the above 2 aspects in mind.
So probably you have hit one or more of those numerous cases where parallelizing in the GDL code is not, eh, optimal, and it will be interesting to know what they are.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#1149 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOC5K6A2EWJVBZTPE4ANXNTUEBZPJANCNFSM5EWWDU7A>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
I run now with -DOPENMP=ON and GDL> CPU, TPOOL_NTHREADS=1 and the processing time is 50 min.
The jobs stay for a long time at 100% but as well stay for significantly time at around 300-400%.
… On 28. Sep 2021, at 15:11, Giloo ***@***.***> wrote:
Well OPENMP is OK, it's us GDL programmers that are not.
You have not tested —DOPENMP=ON and GDL> CPU, TPOOL_NTHREADS=1 that could be interesting also.
Because depending on the performances of this case, it will be rather easy to find the commands that are not well 'parallelized'.
As GDL performance w/o OMP is not very different from IDL, this points to an overzealed OMP useage of GDL in a function that does need , or cannot support, parallelization.
Accordingly, it would be interesting to know what IDL does when IDL> CPU, TPOOL_NTHREADS=1 .
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#1149 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOC5K6EW4PNMDIUJNX3P3FDUEIALVANCNFSM5EWWDU7A>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
Update:
I am back on a machine with 80 CPUs and did more tests regarding OpenMP.
I run this simple program with different configurations.
pro test
na=3e5
a=f indgen(na)
again:
time1=systime(1)
print,systime(0)
for i=0l,na-2 do x=where(a eq i)
time2=systime(1)
print,'------------------>',long(time2-time1),' sec'
goto,again
end
Here the results:
gdl (openmp=on):
CPU01/100 115 sec
CPU02/200 58 sec
CPU04/400 31 sec
CPU08/800 18 sec
CPU16/1600 24 sec
CPU32/3200 16 sec
CPU64/6400 17 sec
CPU80/8000 26 sec
idl:
cpu01/100 149 sec
cpu02/200 78 sec
cpu04/300 67 sec
cpu08/500 41 sec
cpu16/700 53 sec
cpu32/1000 92 sec
cpu64/1700 200 sec
cpu80/2000 267 sec
where cpu80/2000 means
cpu,TPOOL_NTHREADS=80 and the machine used 2000% CPU
Here the plots:
What does this mean?
1) we should avoid loops where possible
2) GDL behaves much better than IDL in this example
3) there is almost no difference using 80 CPU or just 4 CPU in GDL
What is GDL doing with this 8000% of CPU?
Well, as @giloo said, OPENMP is ok, it's us GDL programmers that are not.
I am just wondering if there are some rules how to avoid 80 CPUs doing the same work as 4 CPUs….
|
Beta Was this translation helpful? Give feedback.
-
…and now it comes…
starting two of this jobs in parallel gives the following result:
CPU80/4000 6990 sec
CPU80/4000 7800 sec
compared to a single job:
CPU80/8000 26 sec
I expected a time of 52 seconds but got in average 7400 seconds, so around 142 times slower!
Again, the program is just a single loop with a where and if running alone I get a time of 26 seconds compared to more than two hours if running in parallel.
What could we do here?
|
Beta Was this translation helpful? Give feedback.
-
@brandy125 there may be several reasons for these erratic results. |
Beta Was this translation helpful? Give feedback.
-
Yes, I am alone on that machine, observing all the time the CPU consumption. The results are very reliable and reproducible.
I will limit the #CPU now to a much smaller number (8 in my case) just to avoid that strange behaviour.
What I observed as well is that when starting the second job, the number of free CPUs was reported to be 79 while it should have reported 40.
Maybe GDL should take the average of a much shorter time for calculation of the number of free CPU.
Anyway, I will do some more experiments and keep you informed.
… On 28. Oct 2021, at 18:47, Giloo ***@***.***> wrote:
@brandy125 <https://github.com/brandy125> there may be several reasons for these erratic results.
First are you alone on this 80 cpu machine? If not, you suffer from thread competition by other programs.
Parallelizing work by dividing the problem in small chunks, each with its own thread, at some point the overhead of threading is larger than the benefit of computing on a smaller number of values in each thread.
In the example, where() will treat 80 subsets of 3750 points, but painstakingly put the results back in order each time using memory copies. those will be much more limiting than the computation itself. Some other functions like total() do not have this memory impact and shoudl behave better.
In fact, I realize now that several parallel constructs in GDL should somehow be capped to a reasonable max number of threads as to optimise both the number of threads and the size of the problem.
In other words, a 80 cpu machine is good to do 80 different things, not 80 times a 1/80th of one thing.
I'm no specialist but I suspect IDL (hence GDL) made these thread pool CPU commands to fine-tune this or that job, but it has to be tuned by hand.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#1149 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOC5K6BLGHG3BBLC7W5L47TUJHHH7ANCNFSM5EWWDU7A>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
Yes, thanks. It is interesting to know if IDL gives also 79 , or 40. |
Beta Was this translation helpful? Give feedback.
-
Yes, IDL gives as well 79.
… On 29. Oct 2021, at 17:47, Giloo ***@***.***> wrote:
Yes, thanks. It is interesting to know if IDL gives also 79 , or 40.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#1149 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOC5K6A6HP73WU5PAPZG2LTUJMI5TANCNFSM5EWWDU7A>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
So IDL gives as well 80 CPUs for the second job, even waiting for some minutes. But the results are much better:
two jobs in parallel (no limitation in Nthreads):
GDL:
CPU80/4000 7800 sec
CPU80/4000 6990 sec
IDL:
cpu80/1000 200 sec
cpu80/1000 198 sec
Remembering that for a single job the times are:
GDL:
CPU80/8000 26 sec
IDL:
cpu80/2000 267 sec
Seems that IDL uses only 2000%CPU for this kind of calculations independent of the number of jobs even if it could get 8000%. Very weird. What else can I test?
I just want to avoid that someone runs a simple program two times in parallel and the program gets stuck.
… On 29. Oct 2021, at 17:47, Giloo ***@***.***> wrote:
Yes, thanks. It is interesting to know if IDL gives also 79 , or 40.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#1149 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOC5K6A6HP73WU5PAPZG2LTUJMI5TANCNFSM5EWWDU7A>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
I would say IDL (always the clever boy) has some way to avoid the problems of diminishing results due to many small unuseful threads as I described here so it 'caps' itself. |
Beta Was this translation helpful? Give feedback.
-
I found that at many places where TPOOL_MIN_ELTS would trigger openmp parallelism, GDL does not honor the number of threads given by TPOOL_NTHREADS , it is just using all the available threads. This is being corrected, but obviously renders the above discussion somewhat moot. |
Beta Was this translation helpful? Give feedback.
-
Hi @brandy125 if you have access to a many-core machine, #1624 tentatively introduces a more balanced use of threads, triggered by the gdl option ' |
Beta Was this translation helpful? Give feedback.
-
Hi @GillesDuvert I will be happy to test that! |
Beta Was this translation helpful? Give feedback.
-
Hi @GillesDuvert I did a first test with that feature and here comes the result. I have a 20 core machine and a processing that uses 12 cores at the same time. I did 5 runs with and without --smart-tpool enabled. Without smart-tpool I got:
With smart-tpool I got:
So this is a very clear 13% increase in performance! Thank you! |
Beta Was this translation helpful? Give feedback.
It is quite difficult for me to give "the" answer without code, details on hardware, be sure that "Parallels" is OK/perfect ...
The facts I have :
If I understand well, in you case, you are running 8 different independant sessions of GDL, whi…