Skip to content

blockdev: implement signal-safe loopback device cleanup helper #1449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 28, 2025

Conversation

gursewak1997
Copy link
Contributor

@gursewak1997 gursewak1997 commented Jul 24, 2025

Adds a signal-safe loopback device cleanup mechanism for bootc install --via-loopback to prevent device leaks on interruption (e.g., SIGINT). It introduces a loopback-cleanup-helper subcommand and a robust helper binary path resolution strategy to ensure reliable spawning across environments. The update improves error handling, logging, and includes comprehensive tests.
Fixes: #799

Add fork+exec based cleanup helper to prevent loopback device leaks when
bootc install --via-loopback is interrupted by signals like SIGINT.

- Add loopback-cleanup-helper CLI subcommand
- Implement run_loopback_cleanup_helper() with PR_SET_PDEATHSIG
- Update LoopbackDevice to spawn cleanup helper process
- Add tests for spawn mechanism
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a signal-safe cleanup mechanism for loopback devices, which is a solid improvement for robustness. It uses a helper process that listens for parent death and cleans up resources. The implementation is mostly sound, with a robust strategy for finding the helper binary. I've made a couple of suggestions to improve logging in the helper process for better debuggability and to clarify a comment about signal handling.

@gursewak1997 gursewak1997 force-pushed the fix-loopback-cleanup-helper branch from 4571471 to 7074ca6 Compare July 24, 2025 20:36
Copy link
Collaborator

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I like to in situations like this to aid testing is to add an "internals" wrapper that directly calls this code. See an example of this in b696395

So we could add e.g. bootc internals allocate-cleanup-loopback or something which would make it easy to exercise the code without doing a full install run.

])
.arg(path)
.run_get_string()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the AI tooling doesn't know about our helper run_get_string() which handles the .success() and error printing.

IOW this code was probably fine as is, can we just keep it that way please?

/// Find the bootc binary using multiple strategies
fn find_bootc_binary() -> Result<PathBuf> {
// Strategy 1: Try /proc/self/exe (works in most cases)
if let Ok(exe_path) = std::fs::read_link("/proc/self/exe") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so this is trying to handle the original failure...but it's really unclear to me why it was failing and specifically how that test case run is different than the existing loopback tests we have.

(original issue in #1439 )

As far as I know /proc/self/exe should really always work; we're using it in reexec_with_guardenv.

I think what would be good here is that if read_link fails, let's show what value it points to.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm random idea as to the root cause - what if this is happening just when we need to re-exec the current process twice? And inside the container env?

I'm starting to suspect it has to do with that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Def could be. Adding some changes that handles the case where /proc/self/exe fails (which could happen during re-execution in container environments)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes it's this almost for sure

// OK now, we always copy our binary to a tempfile, set its security context

In some cases we're hitting that fallback, but not others.

So...this is getting a bit complicated. Your fallback code here will definitely work in general....but I have some ideas to do something more sophisticated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more complex fix I think is to make an new internal "out of process drop helper" crate that we initialize before we do any re-exec dances. I.e. we fork/re-exec that helper and talk to it over IPC or so. It would be more involved though.

// Shared backend for our `close` and `drop` implementations.
fn impl_close(&mut self) -> Result<()> {
// SAFETY: This is the only place we take the option
let Some(dev) = self.dev.take() else {
tracing::trace!("loopback device already deallocated");
return Ok(());
};

// Kill the cleanup helper since we're cleaning up normally
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. We could also send it SIGTERM and let it do the cleanup. Feels like it's perhaps better that way as then we're always exercising the cleanup via child process.

@@ -194,6 +194,10 @@ struct LoopbackCleanupHandle {
impl LoopbackDevice {
// Create a new loopback block device targeting the provided file path.
pub fn new(path: &Path) -> Result<Self> {
let path = Utf8PathBuf::from_path_buf(path.to_path_buf())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to impose this requirement it'd be cleaner to make the API just take a Utf8Path instead. But I don't think we need the requirement so I think we should drop this change.

.args([
"--show",
format!("--direct-io={direct_io}").as_str(),
"-P",
"--find",
path.as_str(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The utf8 thing probably came from this, but I don't think the change is necessary.

Ok(_) => {
tracing::error!("Failed to clean up loopback device {}", device_path);
Ok(output) => {
let stderr = String::from_utf8_lossy(&output.stderr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This child process has its stdout/stderr going to /dev/null by default so all of this is just lost unfortunately.

- Add robust binary path resolution with multiple fallback strategies
- Use /proc/self/exe, argv[0], common paths, and PATH search
- Add graceful fallback when cleanup helper can't be spawned
- Improve error handling and logging
- Add comprehensive tests for binary finding logic

This fixes the 'Failed to spawn loopback cleanup helper' error that
was causing issues in packaged distributions where the binary path
was not easily discoverable.
@gursewak1997 gursewak1997 force-pushed the fix-loopback-cleanup-helper branch from 7074ca6 to ff004c9 Compare July 25, 2025 17:37
@gursewak1997 gursewak1997 marked this pull request as ready for review July 28, 2025 19:06
Copy link
Collaborator

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's run with this, we can do the drop helper crate as a followup!

One thing that you could do though is (until #1359 lands) is build and push an image from this PR and then run the e2e test from https://gitlab.com/fedora/bootc/tests/bootc-workflow-test on it

@cgwalters cgwalters merged commit b803ea7 into bootc-dev:main Jul 28, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

drop loopback out-of-process
2 participants