-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance observations on dual-socket AMD Rome system #29
Comments
Original comment by Thomas Herault (Bitbucket: herault, GitHub: therault). Currently, PaRSEC does not detect automatically that the nodes are running 2 PaRSEC processes or more on the same node, and in that case the default binding policy binds multiple threads on the same cores. Moreover, with the -c 63 passed in the 2-process-per-node setup, you are explicitly asking the runtime to create 63 compute threads per process, so oversubscribing is inevitable. This is probably the source of the performance drop you observe. If you want to run with 2 processes per node or more, you should provide each process with a bitarray of authorized cores to bind to, through the PaRSEC command line argument --parsec_bind 0xffffff. |
Original comment by Joseph Schuchart (Bitbucket: jschuchart, GitHub: jschuchart). Mhh, doesn’t the binding policy of OMPI ( |
Original comment by Thomas Herault (Bitbucket: herault, GitHub: therault). I think it was the goal of cpuset_allowed_mask to restrict where threads can bind, but I don’t see in the code where we set cpuset_allowed_mask by default: parsec_init sets it to NULL, then parsec_parse_binding_parameters modifies it, but just to allocate a new one. It is easy to check with htop where the threads are running / how many cores are busy. I think that today PaRSEC overrides the binding from the runtime system. How many hardware threads can support your node? 128? It would be interesting to look at why PaRSEC detects 256: it should not create threads for hyperthreads unless requested… As for the 63/64: yes, by default we would let the comm thread floating between all the cores, and not bind it, and there are many cases where it is better for performance to dedicate a core to the comm thread. There is a command line option to force that and bind it on a specific core too: --parsec-bind-comm |
Original comment by Joseph Schuchart (Bitbucket: jschuchart, GitHub: jschuchart). OK, I have to correct my statements about HT: I was running with the OMPI I tried to pass --parsec-bind-comm to testing_dpotrf but it complains that it doesn’t know that option. I guess I’m still not familiar with how PaRSEC handles option arguments… |
Original report by Joseph Schuchart (Bitbucket: jschuchart, GitHub: jschuchart).
I’m running different DPLASMA tests (build from current master with Open MPI 4.0.5) on our new AMD Rome system (2x64 cores per node, ConnectX-6 fabric). I’m observing that performance is within expectations when running one PaRSEC process per node, bound to the first socket:
(note that the binding output of Open MPI is truncated, when running with current master it seems that the process is correctly bound to all cores on the first package). The 6.6TF above are about 87% of the max DGEMM performance I’m observing on a single socket:
Now, if I run with 2 PaRSEC processes per node (each bound to one socket) I am actually observing a drop in performance:
I should note that this does not seem to be a hardware problem: if I run run one PaRSEC process across the full node I see reasonable scaling:
Any idea what may cause the performance drop when running two PaRSEC processes per node? Is that caused by MPI transfers not offloaded to hardware?
The text was updated successfully, but these errors were encountered: