Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poll/close error storm after adding just one more to fanout (for me 496) #123

Open
smemsh opened this issue May 13, 2020 · 19 comments
Open

Comments

@smemsh
Copy link

smemsh commented May 13, 2020

some time ago, pdsh became very slow for me when connecting to even just 50-100 machines, running something simple like /bin/true. years ago this would take much less than a second (with persistent ssh connections). now it's taking quite a while, more than ten seconds to do this.

even using -R exec running echo, the following takes 12 seconds to run on just 57 hosts, it's not even logging in anywhere so not sure what it could be doing:

cat << % > testgenders
web[100-150] phy,prod,web,site1
db[100-105] phy,db,site1
%
time -p \
src/pdsh/pdsh -F ./testgenders -R exec -g site1 echo %h |
wc -l

on a dual-core Intel 8th Gen 16GB RAM:

57
real 12.86
user 16.66
sys 8.02

built using --without-rsh --without-ssh --with-exec --with-genders. would normally want to use ssh with this, just used exec for testing, but would think that should be instant. this used to be very much faster. trying to run this under strace, it takes about 10 minutes:

strace -qcf src/pdsh/pdsh -F ./testgenders -R exec -g site1 \
  echo %h 2>&1 >/dev/null | head

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 87.16 6423.893472    37348217       172           poll
  3.15  232.366846      355845       653        73 futex
  3.15  232.241463   232241463         1         1 rt_sigtimedwait
  3.15  232.206853     4073804        57           wait4
  2.07  152.765258      585307       261           nanosleep
  1.31   96.749412           1  59769314  59763728 close
  0.00    0.037793         331       114           socketpair
  0.00    0.015195          44       343       285 execve

dang, seems to want to close something an awful lot. the fd numbers it's trying to close get out of hand rapidly:

strace -qqf -e close \
src/pdsh/pdsh -F ./testgenders -R exec -g site1 \
echo %h 2>&1 >/dev/null |
grep --line-buffered unfinished |
awk '{
fd = substr($3, index($3, "(") + 1);
if (fd > top) {
  top = fd
  if (fd % 10 == 0) {
    print fd; fflush();
  }
}'

10
90
990
9990
99990

thinking this would need backtrace, wanted to get one thread running, so tried FANOUT=1... but then PROBLEM GONE!!! found the transition point:

 $ time -p FANOUT=495 src/pdsh/pdsh -F ./testgenders -R exec -g site1 echo %h >/dev/null
real 0.04
user 0.06
sys 0.02

 $ time -p FANOUT=496 src/pdsh/pdsh -F ./testgenders -R exec -g site1 echo %h >/dev/null
real 12.72
user 16.61
sys 8.22

ok so is it wrong to have high fanout? what transition occurs at 496? isn't this number just a limit? shouldn't the fanout be able to work to arbitrary extent if the machine has enough resources to handle it? maybe we have 10k machines in the cluster and we got a big controller to handle them in one batch (our cluster's not that large, but maybe next year ;-)

originally had tried to find a FANOUT=0 option to mean "as many threads as requested for all nodes to be done in one batch," but had just used FANOUT=9999 since that doesn't seem possible to specify. some time ago this was set innocuously, apparently this causes a problem making pdsh slow, which has been bugging me for some time but didn't see the link between the two.

@grondo
Copy link
Member

grondo commented May 13, 2020

Nice debugging! No, it isn't wrong to have a large fanout AFAIK, and I have no idea what could be going on here at the moment.

Can you include output of pdsh -V and what OS you are using it on and I will see if I can reproduce.

@smemsh
Copy link
Author

smemsh commented May 13, 2020

it's just master from last night on a mostly updated debian unstable

@smemsh
Copy link
Author

smemsh commented May 13, 2020

git clone -q https://github.com/chaos/pdsh &&
cd pdsh && (
sh bootstrap &&
sh configure --without-rsh --without-ssh --with-exec --with-genders &&
make -j2
) &>/dev/null

@smemsh
Copy link
Author

smemsh commented May 13, 2020

libc is from libc6_2.31-0experimental0_amd64.deb

interesting, it does not reproduce on an updated ubuntu16

@grondo
Copy link
Member

grondo commented May 13, 2020

Here's what I suspect is happening. In src/common/pipecmd.c the closeall() function fetches the max fd using

int fdlimit = sysconf (_SC_OPEN_MAX);

I wonder if, after creating a certain number of threads, the OS returns a huge number here and the children spend all this time closing fds.

Try this patch as an experiment and see if it helps:

diff --git a/src/common/pipecmd.c b/src/common/pipecmd.c
index 904725c..1ea7f1b 100644
--- a/src/common/pipecmd.c
+++ b/src/common/pipecmd.c
@@ -252,7 +252,7 @@ const char * pipecmd_target (pipecmd_t p)
 
 static void closeall (int fd)
 {
-    int fdlimit = sysconf (_SC_OPEN_MAX);
+    int fdlimit = 1024;
 
     while (fd < fdlimit)
         close (fd++);

(Be sure to do make clean && make after apply patch. I've had problems with the build recently where pdsh doesn't get remade when libcommon is rebuilt lately)

@smemsh
Copy link
Author

smemsh commented May 13, 2020

I had already suspected that, and confirmed the value is 1024 already on my system. and also tried hardcoding. it had no effect unfortunately, I don't think it is the issue.

@grondo
Copy link
Member

grondo commented May 13, 2020

Are you sure that pdsh was rebuilt after the change? (I ran into that when adding the following debugging).

I added a print of the current value for _SC_OPEN_MAX in closeall(), you do seem to have found the magic number:

 grondo@asp:~/git/pdsh$ src/pdsh/pdsh -R exec -f 495 -w foo[0-500] true 2>&1 | tail -1
foo492: sysconf (_SC_OPEN_MAX) = 1024
 grondo@asp:~/git/pdsh$ src/pdsh/pdsh -R exec -f 496 -w foo[0-500] true 2>&1 | tail -1
foo336: sysconf (_SC_OPEN_MAX) = 1048576

@smemsh
Copy link
Author

smemsh commented May 13, 2020

interesting, it does happen on ubuntu18, just another data point.

I thought I had done a distclean but will reconfirm

@smemsh
Copy link
Author

smemsh commented May 13, 2020

you are right, I had relied on the dependencies, without a distclean. a forced rebuild and it's fixed. so I found that last night and chased it a while after concluding that was not it, oh well ;-) FWIW, I had looked into it and thought best way is just ifdef linux, enumerate FDs in /proc/self/fd/ and close them that way, seems ugly but probably fine, other platforms likely have sane limit for that config.

I don't get why the value is high though since a test program it's 1024 on my system, as well as using getconf(1)

@grondo
Copy link
Member

grondo commented May 13, 2020

I was able to reproduce on ubuntu 18.04 VM. Setting an upper limit on fdlimit fixed the slowness for me. Probably the "fix" here is to check for ridiculous return value from sysconf (_SC_OPEN_MAX) and adjust to something sane. Ensuring FD_CLOEXEC is set on fds opened from within pdsh should reduce the chance of fd leak into child processes.

@smemsh
Copy link
Author

smemsh commented May 13, 2020

so this is the same system that exhibits the issue, it has 1024, I don't get it

 $ cat test.c
#include <stdio.h>
#include <unistd.h>
void
main (void)
{
        unsigned max = sysconf (_SC_OPEN_MAX);
        printf("%u\n", max);
}

 $ make test
cc     test.c   -o test

 $ ./test
1024

@grondo
Copy link
Member

grondo commented May 13, 2020

I don't get why the value is high though since a test program it's 1024 on my system, as well as using getconf(1)

Pdsh creates a socketpair for each subcommand it runs. With fanout large enough, more than 1024 fds will be opened at a time and the soft limit will get bumped to the hard limit (ulimit -H -n). On my system the hard limit is 1048576, so that explains the large jump.

@grondo
Copy link
Member

grondo commented May 13, 2020

FWIW, I had looked into it and thought best way is just ifdef linux, enumerate FDs in /proc/self/fd/ and close them that way, seems ugly but probably fine, other platforms likely have sane limit for that config.

Yes, this could work, thanks. If you are able, feel free to submit a PR. If not I will get to it later in the week or perhaps early next week.

BTW, thanks for the great find! (and you would have had the solution too, if not for the broken pdsh build system!)

@smemsh
Copy link
Author

smemsh commented May 13, 2020

ok, see now, ok that makes sense. great! sorry didn't catch the rebuild would have saved us both time! well I think enumerate in /proc/self/fd/ on linux is best but that's a pain to be doing i/o just to get fds seems silly. probably use min(1024, foo) on it I guess, don't see if it could possibly get this large anyways in practice

@smemsh
Copy link
Author

smemsh commented May 13, 2020

btw this means it's preallocating according to fanout, even if actual target list is small. that means setting FANOUT=<largenum> is not a good idea. it would be good if it could figure out the target list before determining the thread count and this way it will limit itself automatically. but I have not looked at the code so maybe that's not realistic. either way good enough since I know the issue now... thanks!

@grondo
Copy link
Member

grondo commented May 13, 2020

it would be good if it could figure out the target list before determining the thread count and this way it will limit itself automatically.

Ah, yes, you are right. The nofile limit is increased automatically in src/pdsh/dsh.c: _increase_nofile_limit() using the following calculation:

    int nfds = (2 * opt->fanout) + 32;

This should probably be:

    int nfds = (2 * MIN(opt->fanout, hostlist_count (opt->wcoll))) + 32;

@grondo
Copy link
Member

grondo commented May 13, 2020

BTW, in case you take a look at a fix, in another project we used fdwalk from webrtc:

https://github.com/flux-framework/flux-core/blob/master/src/common/libutil/fdwalk.c

I think the license would be compatible to bring it in to pdsh as well. (It already has a fallback for non-linux systems)

@smemsh
Copy link
Author

smemsh commented May 13, 2020

is closing fds needed at all if internally opened fds are set close on exec? (there's a common fd function to set it on already opened ones, although I can't find anywhere it's used). naively, could the program maybe just set close-on-exec and re-execute itself on start before even doing anything?

@smemsh
Copy link
Author

smemsh commented May 13, 2020

the fdwalk solution seems perfect, not sure I like using SYS_getdents() directly but it's already ifdef linux so same difference. I can try to create a PR at some point, workaround (lower fanout) is sane for the time being. thanks a lot for your help, been using pdsh for many years. still the best for a lot of use cases!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants