Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix certbot process getting stuck in dstack-proxy #2143

Merged
merged 1 commit into from
Dec 26, 2024

Conversation

jvstme
Copy link
Collaborator

@jvstme jvstme commented Dec 25, 2024

Problem: if obtaining a certificate times out, subprocess.run cancels the certbot command with
SIGKILL. However, the command is run with sudo, so SIGKILL actually kills the sudo process, while its child certbot process is adopted by init and keeps running indefinitely. This breaks adding or
updating services on the gateway, since certbot
refuses to run if another certbot process exists.

Attempts to implement graceful cancelling by
sending SIGTERM to the sudo process were not
successful - for some reason sudo ignores SIGTERM
originating from its parent.

The solution in this commit uses the timeout
shell command to set the timeout on the actual
certbot process instead of the sudo process.

#1595

**Problem**: if obtaining a certificate times out,
`subprocess.run` cancels the certbot command with
SIGKILL. However, the command is run with sudo, so
SIGKILL actually kills the sudo process, while its
child certbot process is adopted by init and keeps
running indefinitely. This breaks adding or
updating services on the gateway, since certbot
refuses to run if another certbot process exists.

Attempts to implement graceful cancelling by
sending SIGTERM to the sudo process were not
successful - for some reason sudo ignores SIGTERM
originating from its parent.

The solution in this commit uses the `timeout`
shell command to set the timeout on the actual
certbot process instead of the sudo process.
@jvstme jvstme requested a review from un-def December 25, 2024 21:46
@un-def
Copy link
Collaborator

un-def commented Dec 26, 2024

I think the core reason why subprocess.run() won't terminate the whole process subtree is that on timeouts it calls Popen.kill(), which, as the name implies, sends SIGKILL, and sudo has no chance to propagate the signal to its children: https://github.com/python/cpython/blob/067145177975eadd61a0c907d0d177f7b6a5a3de/Lib/subprocess.py#L554-L558

man 8 sudo (emphasis added):

Signal handling

When the command is run as a child of the sudo process, sudo will relay signals it receives to the command. [...] Some signals, such as SIGSTOP and SIGKILL, cannot be caught and thus will not be relayed to the command. [...]

Popen() + communiate() + terminate() instead of the wrapper (run()) should fix the issue without external tools, I believe, but I'm not insisting that is should be fixed in that way, just providing some context; the current solution is fine.

@jvstme
Copy link
Collaborator Author

jvstme commented Dec 26, 2024

I tried it first, but it didn't work for some reason.

Attempts to implement graceful cancelling by
sending SIGTERM to the sudo process were not
successful - for some reason sudo ignores SIGTERM
originating from its parent.

I can terminate the sudo process manually by sending SIGTERM from a shell, but when the proxy application was trying to do the same thing with terminate() nothing would happen.

@jvstme jvstme merged commit 3c678b0 into master Dec 26, 2024
24 checks passed
@jvstme jvstme deleted the issue_1595_fix_certbot_stuck branch December 26, 2024 08:59
pranitnaik43 pushed a commit to bahaal-tech/dstack that referenced this pull request Feb 9, 2025
**Problem**: if obtaining a certificate times out,
`subprocess.run` cancels the certbot command with
SIGKILL. However, the command is run with sudo, so
SIGKILL actually kills the sudo process, while its
child certbot process is adopted by init and keeps
running indefinitely. This breaks adding or
updating services on the gateway, since certbot
refuses to run if another certbot process exists.

Attempts to implement graceful cancelling by
sending SIGTERM to the sudo process were not
successful - for some reason sudo ignores SIGTERM
originating from its parent.

The solution in this commit uses the `timeout`
shell command to set the timeout on the actual
certbot process instead of the sudo process.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants