-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question about the ior-hard-write result when using MPIIO+collective #68
Comments
Software versions Lustre cluster: |
It looks like the MPIIO run wrote about 860 GiB in 355s: And the POSIX run wrote about 7153 GiB in 1897s: so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high. I've always thought that MPIIO collective IO should be faster for the |
OK, this explains the question, thanks a lot @adilger.
Yes, according to my tests. For a small-scale cluster(1Gb TCP network, several HDD osts/mdts) the It seems the communication overhead of MPI processes counts. Some researches may explain why: |
The reason for the time and performance difference is due to the fact that
with MPI the IO is synchronized.
The benchmark runs on each process independently for 300s, then they
exchange information about how many I/Os were done.
In the final stage each process writes the same number of I/Os.
This is the stonewalling feature with wear-out.
With MPI-IO all processes have every iteration the same number of I/Os
done. Thus they finish quickly after 300s.
With independent I/O some processes might be much faster than others thus
leading to a long wear-out period.
MPI-IO performance with collectives is only good in some cases,
unfortunately.
…On Wed, Apr 24, 2024 at 9:21 AM Xinliang Liu ***@***.***> wrote:
It looks like the MPIIO run wrote about 860 GiB in 355s: Using actual
aggregate bytes moved = 923312332800
And the POSIX run wrote about 7153 GiB in 1897s: Using actual aggregate
bytes moved = 7680164783616
so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the
bandwidth is 1.55x as high.
OK, this explains the question, thanks a lot @adilger
<https://github.com/adilger>.
I thought they both should finish the ior test and write the same file
size Expected aggregate file size = 67691520000000, but the ior tests can
be stopped by Maybe caused by deadlineForStonewalling
I've always thought that MPIIO collective IO *should* be faster for the
ior-hard-write phase, but for some reason it is not. That is far outside
my area of expertise, so I can't speculate why it would be slower, but it
*should* be aggregating the IO into large chunks and writing them
linearly to the storage.
Yes, according to my tests. For a small-scale cluster(1Gb TCP network,
several HDD osts/mdts) the ior-hard-write phase bandwidth results rank is
POSIX < ROMIO < OMPIO. But for a large-scale cluster(100Gb IB network,
dozens of NVMe osts/mdts), the bandwidth results rank is POSIX > ROMIO >>
OMPIO.
It seems the communication overhead of MPI processes counts. Some
researches may explain why:
http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf
https://phillipmdickens.github.io/pubs/paper1.pdf
—
Reply to this email directly, view it on GitHub
<#68 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGW5SUCZMOXFE2GS3MYHCLY65MRNAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGI2TAOBWGE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks, @JulianKunkel for the explanation, I have a more clear understanding now.
|
With -e fsync, it depends on the file system. If it does not have a client
sided write cache - your Lustre shouldn't - then the fsync doesn't add
much. Data was already transferred to the servers during each I/O.
On Wed, Apr 24, 2024 at 10:15 AM Xinliang Liu ***@***.***>
wrote:
… Thanks, @JulianKunkel <https://github.com/JulianKunkel> for the
explanation, I have a more clear understanding now.
For independent I/O, after running 300s, all the processes still need to
wait for the syncs to flush data to disks, right?
Because ior test specify the option -e fsync – perform fsync upon POSIX
write close. So the whole time depends on how long the syncs will finish??
The reason for the time and performance difference is due to the fact that
with MPI the IO is synchronized. The benchmark runs on each process
independently for 300s, then they exchange information about how many I/Os
were done. In the final stage each process writes the same number of I/Os.
This is the stonewalling feature with wear-out. With MPI-IO all processes
have every iteration the same number of I/Os done. Thus they finish quickly
after 300s. With independent I/O some processes might be much faster than
others thus leading to a long wear-out period. MPI-IO performance with
collectives is only good in some cases, unfortunately.
… <#m_7211590092082732659_>
On Wed, Apr 24, 2024 at 9:21 AM Xinliang Liu *@*.*> wrote: It looks like
the MPIIO run wrote about 860 GiB in 355s: Using actual aggregate bytes
moved = 923312332800 And the POSIX run wrote about 7153 GiB in 1897s: Using
actual aggregate bytes moved = 7680164783616 so the POSIX run wrote 8.3x as
much data in 5.35x as much time, so the bandwidth is 1.55x as high. OK,
this explains the question, thanks a lot @adilger
<https://github.com/adilger> https://github.com/adilger
<https://github.com/adilger>. I thought they both should finish the ior
test and write the same file size Expected aggregate file size =
67691520000000, but the ior tests can be stopped by Maybe caused by
deadlineForStonewalling I've always thought that MPIIO collective IO should
be faster for the ior-hard-write phase, but for some reason it is not. That
is far outside my area of expertise, so I can't speculate why it would be
slower, but it should be aggregating the IO into large chunks and writing
them linearly to the storage. Yes, according to my tests. For a small-scale
cluster(1Gb TCP network, several HDD osts/mdts) the ior-hard-write phase
bandwidth results rank is POSIX < ROMIO < OMPIO. But for a large-scale
cluster(100Gb IB network, dozens of NVMe osts/mdts), the bandwidth results
rank is POSIX > ROMIO >> OMPIO. It seems the communication overhead of MPI
processes counts. Some researches may explain why:
http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf
<http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf>
https://phillipmdickens.github.io/pubs/paper1.pdf
<https://phillipmdickens.github.io/pubs/paper1.pdf> — Reply to this email
directly, view it on GitHub <#68 (comment)
<#68 (comment)>>, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/ABGW5SUCZMOXFE2GS3MYHCLY65MRNAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGI2TAOBWGE
<https://github.com/notifications/unsubscribe-auth/ABGW5SUCZMOXFE2GS3MYHCLY65MRNAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGI2TAOBWGE>
. You are receiving this because you are subscribed to this thread.Message
ID: @.*>
—
Reply to this email directly, view it on GitHub
<#68 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGW5STG2PPZNABBRDPCZXLY65SZXAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGM2TEMRSGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
When I run ior-hard-write with API MPIIO+collective(opemMPI ROMIO)
With the same np=144, I find that although the running time is reduced a lot, but the bandwidth is smaller than the POSIX API
result (API: MPIIO+collective)
result (API: POSIX )
Why the running time is better but the bandwidth is worse??
The text was updated successfully, but these errors were encountered: