Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When sending "poweroff" command, process hangs #14

Open
amituttam opened this issue Mar 31, 2014 · 22 comments
Open

When sending "poweroff" command, process hangs #14

amituttam opened this issue Mar 31, 2014 · 22 comments

Comments

@amituttam
Copy link

I use the following code to send a poweroff command to a remote machine. I catch the resulting exception otherwise I get RunProcessError. I am trying to test poweroff + wake on lan on a unit repeatedly so I run this in a loop about 100-200 times. I usually the process hang around the 30th iteration:

def run_command(cmd, host):
    shell = spur.SshShell(hostname=host, username="test",
            password="test", missing_host_key=spur.ssh.MissingHostKey.accept)

    with shell:
        try:
            result = shell.run(cmd)
            log.info(result.output)
        except spur.results.RunProcessError:
            log.info("Probably a poweroff command")
        else:
            return result.output

for i in range(count):
    log.info("Powering off") # LAST MESSAGE SEEN ON LOG
    run_command(["sudo", "poweroff"], host) # GETS STUCK IN THIS FUNCTION
    time.sleep(60) # wait for unit to shutdown properly

    # Wake on lan and wait
    log.info("Sending wol magic packet")
    wol.send_magic_packet("00:1f:f2:09:43:e5")
    time.sleep(60) # wait for unit to come up
@mwilliamson
Copy link
Owner

Hello! A few questions that might help me understand the problem a bit better.

What's the actual stack trace that you get when RunProcessError is raised? I'm not sure I understand why the exception is being raised.

You mentioned the process gets stuck in run_command, but that contains a few different calls. At what point in run_command is the process getting stuck? More generally, can you get a stack trace of where the process gets stuck?

What happens if you execute a different command (e.g. echo hello) just before running sudo poweroff? Does it also hang, or does it complete successfully?

What happens if you only run a different command (e.g. echo hello) in the loop, rather than using poweroff and WoL?

@amituttam
Copy link
Author

Thanks for the quick response. Without the try..catch..else block:

...
    with shell:
        result = shell.run(cmd)
        log.info(result.output)
        return result.output

The following exception is raised:

Mon, 31 Mar 2014 10:40:25 - INFO - START: Iteration 0
Mon, 31 Mar 2014 10:40:25 - INFO - Getting systemd-analyze
Mon, 31 Mar 2014 10:40:30 - INFO - Startup finished in 3.825s (kernel) + 16.107s (userspace) = 19.932s
Mon, 31 Mar 2014 10:40:30 - INFO - Powering off
Traceback (most recent call last):
  File "test.py", line 79, in <module>
    run_command(["sudo", "poweroff"], host)
  File "test.py", line 43, in run_command
    result = shell.run(cmd)
  File "/home/amit/projects/zxi-boot-test/local/lib/python2.7/site-packages/spur/ssh.py", line 71, in run
    return self.spawn(*args, **kwargs).wait_for_result()
  File "/home/amit/projects/zxi-boot-test/local/lib/python2.7/site-packages/spur/ssh.py", line 280, in wait_for_result
    self._result = self._generate_result()
  File "/home/amit/projects/zxi-boot-test/local/lib/python2.7/site-packages/spur/ssh.py", line 292, in _generate_result
    stderr_output
  File "/home/amit/projects/zxi-boot-test/local/lib/python2.7/site-packages/spur/results.py", line 9, in result
    raise result.to_error()
spur.results.RunProcessError: return code: -1
output:
stderr output:

I do have a call before powering off (as shown in the above log output) and this basically just calls systemd-analyze and gets the boot time. This command always succeeds.

@mwilliamson
Copy link
Owner

If you're expecting an error, then using allow_error=True in the call to run when calling sudo poweroff might be a more precise means of achieving the same thing. The return code of -1 suggests that the SSH channel is closed without the server providing a return code.

Given that turning the machine off is likely to break the connection, I'd suggest you'd be better off using spawn rather than run.

If you could provide a stack trace of where the command gets stuck (even if using spawn rather than run fixes your problem), that would be great.

@amituttam
Copy link
Author

Thanks. I will try those suggestions out.

Also, how do I provide the stack trace when the command is stuck? Do I attach to the process using pdb/gdb?

@mwilliamson
Copy link
Owner

Yup, pdb/gdb is probably your best bet.

@amituttam
Copy link
Author

Process is currently hung, using gdb to attach to the process:

Note: this is with shell.run(cmd, allow_error=True)

$ sudo gdb python 15357
Reading symbols from /usr/bin/python2.7...Reading symbols from /usr/lib/debug/usr/bin/python2.7...done.
done.
Attaching to program: /usr/bin/python, process 15357
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/lib/x86_64-linux-gnu/ld-2.18.so...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2

warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff30d24000
0x00007fe7a2506000 in ?? ()
(gdb) bt
#0  0x00007fe7a2506000 in ?? ()
#1  0x0000000001c30b20 in ?? ()
#2  0x000000000054590e in type_traverse.25017 (type=0x14fd0a0, visit=0x1, arg=0x7fe797254170) at ../Objects/typeobject.c:2712
#3  0x0000000000000003 in ?? ()
#4  0x00007fe788003aa0 in ?? ()
#5  0x00000000014fd0a0 in ?? ()
#6  0x00007fe7a28bc050 in ?? ()
#7  0x00007fe7a2892b36 in ?? ()
#8  0x00007fe7971f3128 in ?? ()
#9  0x000000000051f18d in BaseException_init.12117 (self=0x7fe79721f530, args=<optimized out>, kwds=<optimized out>) at ../Objects/exceptions.c:65
#10 0x00007fe7a2892ad4 in ?? ()
#11 0x00007fe788003a58 in ?? ()
#12 0x00000000014fd0a0 in ?? ()
#13 0x00007fe7a2752558 in ?? ()
#14 0x0000000000000000 in ?? ()
(gdb) py-bt
(gdb)

The python debug symbols are there but not sure what the system supplied DSO at 0x7fff30d24000 means.

I will leave the process at this state, so if you need me to run any other debugging commands I can run them.

@mwilliamson
Copy link
Owner

Hmm, not sure why py-bt isn't giving any useful output. If you can work out why that isn't giving any output then that would be great, but I'm afraid I probably can't provide much help beyond what's on the Python wiki and whatnot.

@mwilliamson
Copy link
Owner

Oh, and I forgot to ask: did using spawn instead of run help at all?

@amituttam
Copy link
Author

With allow_error=True, I did not need the try catch block and the test ran
for 57 iterations before it hung. With spawn, I was able to run the test
for 76 iterations before it hung.

On Mon, Mar 31, 2014 at 2:49 PM, Michael Williamson <
[email protected]> wrote:

Oh, and I forgot to ask: did using spawn instead of run help at all?

Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-39146602
.

@mwilliamson
Copy link
Owner

Did you have any luck finding out where the program was hanging?

@amituttam
Copy link
Author

Haven't tried it out since our last conversation. However, working on
in it the next couple of days so hopefully I will have better luck.

On Fri, Apr 18, 2014 at 6:56 AM, Michael Williamson
[email protected] wrote:

Did you have any luck finding out where the program was hanging?


Reply to this email directly or view it on GitHub.

@amituttam
Copy link
Author

Running gdb on a virtualenv script doesn't give much info. So ran it
outside virtualenv and here is the backtrace:

Looks like it is locking.

(gdb) py-bt
#4 Frame 0x7f7618005150, for file /usr/lib/python2.7/threading.py, line 339, in wait (self=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f7625eb6fd0>, acquire=<built-in method acquire of thread.lock object at remote 0x7f7625eb6fd0>, _Condition__waiters=[<thread.lock at remote 0x7f7625eb6f70>], release=<built-in method release of thread.lock object at remote 0x7f7625eb6fd0>) at remote 0x7f761c666c10>, timeout=None, waiter=<thread.lock at remote 0x7f7625eb6f70>, saved_state=None)
waiter.acquire()
#8 Frame 0x7f761c642b90, for file /usr/lib/python2.7/threading.py, line 949, in join (self=<Thread(_Thread__ident=140145185519360, _Thread__block=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f7625eb6fd0>, acquire=<built-in method acquire of thread.lock object at remote 0x7f7625eb6fd0>, _Condition__waiters=[<thread.lock at remote 0x7f7625eb6f70>], release=<built-in method release of thread.lock object at remote 0x7f7625eb6fd0>) at remote 0x7f761c666c10>, _Thread__name='Thread-35', _Thread__daemonic=True, _Thread__started=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f7625eb6e10>, acquire=<built-in method acquire of thread.lock object at remote 0x7f7625eb6e10>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f7625eb6e10>) at remote 0x7f761c666f10>) at remote 0x7f761c666150>, _Thread__stderr=<file at remote 0x7f76297791e0>, _Thread__target=...(truncated)
self.__block.wait()
#12 Frame 0x7f761c63dc20, for file /usr/local/lib/python2.7/dist-packages/spur/io.py, line 27, in wait (self=<OutputHandler(_file_in=<ChannelFile(_wbuffer=<cStringIO.StringO at remote 0x7f76208aec00>, _closed=False, _bufsize=8192, _pos=2, _at_trailing_cr=False, _size=0, _rbuffer='', _flags=17, _realpos=2, channel=<Channel(origin_addr=None, in_window_size=65536, in_window_threshold=6553, lock=<thread.lock at remote 0x7f7625eb6c10>, _pipe=None, eof_received=0, in_max_packet_size=34816, out_buffer_cv=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f7625eb6c10>, acquire=<built-in method acquire of thread.lock object at remote 0x7f7625eb6c10>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f7625eb6c10>) at remote 0x7f761c666fd0>, event=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f7625eb6db0>, acquire=<built-in method acquire of thread.lock obj...(truncated)
self._thread.join()
#15 Frame 0x7f761c63cb00, for file /usr/local/lib/python2.7/dist-packages/spur/io.py, line 13, in wait (self=<IoHandler(_handlers=[<OutputHandler(_file_in=<ChannelFile(_wbuffer=<cStringIO.StringO at remote 0x7f76208aec00>, _closed=False, _bufsize=8192, _pos=2, _at_trailing_cr=False, _size=0, _rbuffer='', _flags=17, _realpos=2, channel=<Channel(origin_addr=None, in_window_size=65536, in_window_threshold=6553, lock=<thread.lock at remote 0x7f7625eb6c10>, _pipe=None, eof_received=0, in_max_packet_size=34816, out_buffer_cv=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f7625eb6c10>, acquire=<built-in method acquire of thread.lock object at remote 0x7f7625eb6c10>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f7625eb6c10>) at remote 0x7f761c666fd0>, event=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f7625eb6db0>, acquire=<built-in method acqu...(truncated)
return [handler.wait() for handler in self._handlers]
#18 Frame 0x7f761c647bc0, for file /usr/local/lib/python2.7/dist-packages/spur/ssh.py, line 296, in _generate_result (self=<SshProcess(_result=None, _shell=<SshShell(_username='logicube', _missing_host_key=<AcceptParamikoPolicy at remote 0x7f7625eb2b10>, _closed=False, _port=22, _private_key_file=None, _hostname='zxi-unknown.local', _password='logicube', _connect_timeout=None, _client=<SSHClient(_system_host_keys=<HostKeys(_entries=[<HostKeyEntry(valid=True, key=<RSAKey(e=35L, d=None, n=26179945252132621254728311942885731986439322927124596014323469999177076743909390585749170925978690963581104740549754193713396793634074244903905366003332324566504240057953230873512865614821458742307475940534842440913631105522892387118122966592114183884097399337584202018451068108942964573508320162567273885802569302673293933420244371016883295514006976951501827766551645845558633636308475857029578719329342350669719046258432504811818481529023883819157606925293593972677109635645033739869089186796077238532933760846266209682128986667053415...(truncated)
output, stderr_output = self._io.wait()
#22 Frame 0x7f761c64c050, for file /usr/local/lib/python2.7/dist-packages/spur/ssh.py, line 291, in wait_for_result (self=<SshProcess(_result=None, _shell=<SshShell(_username='logicube', _missing_host_key=<AcceptParamikoPolicy at remote 0x7f7625eb2b10>, _closed=False, _port=22, _private_key_file=None, _hostname='zxi-unknown.local', _password='logicube', _connect_timeout=None, _client=<SSHClient(_system_host_keys=<HostKeys(_entries=[<HostKeyEntry(valid=True, key=<RSAKey(e=35L, d=None, n=261799452521326212547283119428857319864393229271245960143234699991770767439093905857491709259786909635811047405497541937133967936340742449039053660033323245665042400579532308735128656148214587423074759405348424409136311055228923871181229665921141838840973993375842020184510681089429645735083201625672738858025693026732939334202443710168832955140069769515018277665516458455586336363084758570295787193293423506697190462584325048118184815290238838191576069252935939726771096356450337398690891867960772385329337608462662096821289866670534154...(truncated)
self._result = self._generate_result()
#26 Frame 0x2c08200, for file test.py, line 44, in run_command (cmd=['sudo', 'poweroff'], host='zxi-unknown.local', shell=<SshShell(_username='logicube', _missing_host_key=<AcceptParamikoPolicy at remote 0x7f7625eb2b10>, _closed=False, _port=22, _private_key_file=None, _hostname='zxi-unknown.local', _password='logicube', _connect_timeout=None, _client=<SSHClient(_system_host_keys=<HostKeys(_entries=[<HostKeyEntry(valid=True, key=<RSAKey(e=35L, d=None, n=261799452521326212547283119428857319864393229271245960143234699991770767439093905857491709259786909635811047405497541937133967936340742449039053660033323245665042400579532308735128656148214587423074759405348424409136311055228923871181229665921141838840973993375842020184510681089429645735083201625672738858025693026732939334202443710168832955140069769515018277665516458455586336363084758570295787193293423506697190462584325048118184815290238838191576069252935939726771096356450337398690891867960772385329337608462662096821289866670534154970348553037435780702858615703882...(truncated)
Python Exception <type 'exceptions.IOError'> (2, 'No such file or directory', 'test.py'):
Error occurred in Python command: (2, 'No such file or directory', 'test.py')
(gdb) info threads
Id Target Id Frame
5 Thread 0x7f761dfa7700 (LWP 10807) "python" 0x00007f76287a8c33 in select () at ../sysdeps/unix/syscall-template.S:81
4 Thread 0x7f761ce78700 (LWP 8064) "python" 0x00007f76287a472d in poll () at ../sysdeps/unix/syscall-template.S:81
3 Thread 0x7f7617fff700 (LWP 8088) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
2 Thread 0x7f76177fe700 (LWP 8089) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

  • 1 Thread 0x7f7629798700 (LWP 10789) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
    (gdb) py-list
    334 waiter.acquire()
    335 self.**waiters.append(waiter)
    336 saved_state = self._release_save()
    337 try: # restore state no matter what (e.g., KeyboardInterrupt)
    338 if timeout is None:

    339 waiter.acquire()
    340 if __debug**:
    341 self._note("%s.wait(): got it", self)
    342 else:
    343 # Balancing act: We can't afford a pure busy loop, so we
    344 # have to sleep; but if we sleep the whole timeout time,

On Fri, Apr 18, 2014 at 06:56:41AM -0700, Michael Williamson wrote:

Did you have any luck finding out where the program was hanging?


Reply to this email directly or view it on GitHub:
#14 (comment)

@mwilliamson
Copy link
Owner

Could you get a backtrace for the other threads as well?

It looks like it's stuck waiting for the process to terminate, but I'm not quite sure why it's doing that if you're using spawn rather than run, unless you're also calling wait_for_result() on the returned process? It seems to me that you just want to use spawn, and ignore the result.

@amituttam
Copy link
Author

Yes, I'm calling wait_for_result(). I have the following function:

def run_command(cmd, host):
shell = spur.SshShell(hostname=host, username="test",
password="pass", missing_host_key=spur.ssh.MissingHostKey.accept)

with shell:
    proc = shell.spawn(cmd, allow_error=True)
    result = proc.wait_for_result()
    log.info(result.output)
    return result.output

I call run_command with various commands and then finally with a
"['sudo', 'poweroff']". The script runs a loop for a 100 times, and
sometimes calling poweroff works until about the 60th or 70th iteration.
Thus, it seems like the deadlock doesn't happen all the time.

However, I can create a different function just for the poweroff without
waiting for the result if that is the correct way to do it.

On Wed, Apr 30, 2014 at 03:14:16AM -0700, Michael Williamson wrote:

Could you get a backtrace for the other threads as well?

It looks like it's stuck waiting for the process to terminate, but I'm not quite sure why it's doing that if you're using spawn rather than run, unless you're also calling wait_for_result() on the returned process? It seems to me that you just want to use spawn, and ignore the result.


Reply to this email directly or view it on GitHub:
#14 (comment)

@mwilliamson
Copy link
Owner

However, I can create a different function just for the poweroff without waiting for the result if that is the correct way to do it.

That would seem to me to be the most appropriate method. Using spawn and wait_for_result doesn't get you anywhere over using run since that's exactly how run is implemented.

Backtraces for the other threads would still be handy in working out what the issue though.

@amituttam
Copy link
Author

Backtraces for the other threads would still be handy in working out what the issue though.

(gdb) info threads
Id Target Id Frame
5 Thread 0x7f4f592d4700 (LWP 7760) "python" 0x00007f4f63ad5c33 in select () at ../sysdeps/unix/syscall-template.S:81
4 Thread 0x7f4f57964700 (LWP 30772) "python" 0x00007f4f63ad172d in poll () at ../sysdeps/unix/syscall-template.S:81
3 Thread 0x7f4f581a5700 (LWP 30779) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
2 Thread 0x7f4f57163700 (LWP 30780) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

  • 1 Thread 0x7f4f64ac5700 (LWP 7757) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

Is this the backtrace you were talking about? I will leave the process
in this state so if you want more backtraces, I can get them to you.

@mwilliamson
Copy link
Owner

py-bt shows the backtrace for the current thread, so to see the backtraces for all threads, you can run t a a py-bt (short for thread apply all py-bt)

@amituttam
Copy link
Author

Output of 't a a py-bt':

http://paste.debian.net/plain/96602

@mwilliamson
Copy link
Owner

Thanks! I'll take a look when I get a chance, but hopefully just using spawn will fix your specific issue.

@amituttam
Copy link
Author

Thanks will do that. Also, you might want to save the bt, i think the
paste@debian expires in 3 days.

On Wed, Apr 30, 2014 at 10:32 AM, Michael Williamson <
[email protected]> wrote:

Thanks! I'll take a look when I get a chance, but hopefully just using
spawn will fix your specific issue.

Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-41825113
.

@mwilliamson
Copy link
Owner

Thanks for letting me know, I've slapped it in a gist:

https://gist.github.com/mwilliamson/2cdf38bee1fa68f0800e

@amituttam
Copy link
Author

Thanks!

On Wed, Apr 30, 2014 at 10:37 AM, Michael Williamson <
[email protected]> wrote:

Thanks for letting me know, I've slapped it in a gist:

https://gist.github.com/mwilliamson/2cdf38bee1fa68f0800e

Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-41825647
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants