-
Notifications
You must be signed in to change notification settings - Fork 129
blockdev: implement signal-safe loopback device cleanup helper #1449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blockdev: implement signal-safe loopback device cleanup helper #1449
Conversation
Add fork+exec based cleanup helper to prevent loopback device leaks when bootc install --via-loopback is interrupted by signals like SIGINT. - Add loopback-cleanup-helper CLI subcommand - Implement run_loopback_cleanup_helper() with PR_SET_PDEATHSIG - Update LoopbackDevice to spawn cleanup helper process - Add tests for spawn mechanism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a signal-safe cleanup mechanism for loopback devices, which is a solid improvement for robustness. It uses a helper process that listens for parent death and cleans up resources. The implementation is mostly sound, with a robust strategy for finding the helper binary. I've made a couple of suggestions to improve logging in the helper process for better debuggability and to clarify a comment about signal handling.
4571471
to
7074ca6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I like to in situations like this to aid testing is to add an "internals" wrapper that directly calls this code. See an example of this in b696395
So we could add e.g. bootc internals allocate-cleanup-loopback
or something which would make it easy to exercise the code without doing a full install run.
]) | ||
.arg(path) | ||
.run_get_string()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the AI tooling doesn't know about our helper run_get_string()
which handles the .success()
and error printing.
IOW this code was probably fine as is, can we just keep it that way please?
/// Find the bootc binary using multiple strategies | ||
fn find_bootc_binary() -> Result<PathBuf> { | ||
// Strategy 1: Try /proc/self/exe (works in most cases) | ||
if let Ok(exe_path) = std::fs::read_link("/proc/self/exe") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so this is trying to handle the original failure...but it's really unclear to me why it was failing and specifically how that test case run is different than the existing loopback tests we have.
(original issue in #1439 )
As far as I know /proc/self/exe
should really always work; we're using it in reexec_with_guardenv
.
I think what would be good here is that if read_link
fails, let's show what value it points to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm random idea as to the root cause - what if this is happening just when we need to re-exec the current process twice? And inside the container env?
I'm starting to suspect it has to do with that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Def could be. Adding some changes that handles the case where /proc/self/exe
fails (which could happen during re-execution in container environments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes it's this almost for sure
Line 91 in f4b01ab
// OK now, we always copy our binary to a tempfile, set its security context |
In some cases we're hitting that fallback, but not others.
So...this is getting a bit complicated. Your fallback code here will definitely work in general....but I have some ideas to do something more sophisticated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more complex fix I think is to make an new internal "out of process drop helper" crate that we initialize before we do any re-exec dances. I.e. we fork/re-exec that helper and talk to it over IPC or so. It would be more involved though.
// Shared backend for our `close` and `drop` implementations. | ||
fn impl_close(&mut self) -> Result<()> { | ||
// SAFETY: This is the only place we take the option | ||
let Some(dev) = self.dev.take() else { | ||
tracing::trace!("loopback device already deallocated"); | ||
return Ok(()); | ||
}; | ||
|
||
// Kill the cleanup helper since we're cleaning up normally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. We could also send it SIGTERM
and let it do the cleanup. Feels like it's perhaps better that way as then we're always exercising the cleanup via child process.
crates/blockdev/src/blockdev.rs
Outdated
@@ -194,6 +194,10 @@ struct LoopbackCleanupHandle { | |||
impl LoopbackDevice { | |||
// Create a new loopback block device targeting the provided file path. | |||
pub fn new(path: &Path) -> Result<Self> { | |||
let path = Utf8PathBuf::from_path_buf(path.to_path_buf()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to impose this requirement it'd be cleaner to make the API just take a Utf8Path
instead. But I don't think we need the requirement so I think we should drop this change.
crates/blockdev/src/blockdev.rs
Outdated
.args([ | ||
"--show", | ||
format!("--direct-io={direct_io}").as_str(), | ||
"-P", | ||
"--find", | ||
path.as_str(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The utf8 thing probably came from this, but I don't think the change is necessary.
Ok(_) => { | ||
tracing::error!("Failed to clean up loopback device {}", device_path); | ||
Ok(output) => { | ||
let stderr = String::from_utf8_lossy(&output.stderr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This child process has its stdout/stderr going to /dev/null by default so all of this is just lost unfortunately.
- Add robust binary path resolution with multiple fallback strategies - Use /proc/self/exe, argv[0], common paths, and PATH search - Add graceful fallback when cleanup helper can't be spawned - Improve error handling and logging - Add comprehensive tests for binary finding logic This fixes the 'Failed to spawn loopback cleanup helper' error that was causing issues in packaged distributions where the binary path was not easily discoverable.
7074ca6
to
ff004c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let's run with this, we can do the drop helper crate as a followup!
One thing that you could do though is (until #1359 lands) is build and push an image from this PR and then run the e2e test from https://gitlab.com/fedora/bootc/tests/bootc-workflow-test on it
Adds a signal-safe loopback device cleanup mechanism for
bootc install --via-loopback
to prevent device leaks on interruption (e.g., SIGINT). It introduces aloopback-cleanup-helper
subcommand and a robust helper binary path resolution strategy to ensure reliable spawning across environments. The update improves error handling, logging, and includes comprehensive tests.Fixes: #799