loopback: fix race condition opening loopback device #2039

giuseppe · 2024-07-23T10:16:20Z

the loopback device file could be already used/removed by another process. Since the process is inherently racy, just grab the next available index and try again until it succeeds.

Closes: #2038

move the stat call later after the file is already opened so it is less vulnerable to the file being removed between the stat and the open syscall. Signed-off-by: Giuseppe Scrivano <[email protected]>

rhatdan · 2024-07-23T12:34:20Z

LGTM
@nalind @mtrmac @saschagrunert PTAL

rhatdan · 2024-07-23T12:45:44Z

@edsantiago PTAL

edsantiago

This is not my area of expertise, but I would say that this entire function needs a huge step back and refactoring. For instance, attachLoopDevice() calls getNextFreeLoopbackIndex(), and I don't think that's necessary: it should be done in openNextAvailableLoopback() itself, no? And lose the index arg? And the index++ therein? And possibly a lot more cleanup.

Also... again, the commit sequence is hard to follow.

cgwalters · 2024-07-23T13:45:16Z

A few notes:

If we do this then it's likely that composefs (which also does the same thing as we do here, but we don't link to libcomposefs I guess because linking Go and C is painful?) needs the same changes
systemd has a lot of similar logic https://github.com/systemd/systemd/blob/ae2bf016bc946f47dd8c027ccf76d19143e2e9c5/src/shared/loop-util.c#L528 - one thing it does (and documents well) is keeping a flock() around the loop control device
I think we should keep retrying with getNextFreeLoopbackIndex() and not just calling that once and thereafter incrementing the index, right?

cgwalters · 2024-07-23T13:45:42Z

(Also xref containers/composefs#144 for the larger better fix)

edsantiago · 2024-07-23T13:47:08Z

I think we should keep retrying with getNextFreeLoopbackIndex() and not just calling that once and thereafter incrementing the index, right?

The index++ is now a NOP, and super misleading, hence my earlier comment about a nice fresh cleanup

giuseppe · 2024-07-23T14:19:06Z

I agree the module requires a refactoring, the error handling is quite unusual: everything is logged to stderr and it just reports ErrAttachLoopbackDevice, but I don't think it must be done as part of this PR.

The goal of this PR is to address a specific bug and the diff (except the new test) is very small so it can be easily backported it if needed.

Also... again, the commit sequence is hard to follow

what exactly should be changed?

The index++ is now a NOP, and super misleading, hence my earlier comment about a nice fresh cleanup

how is that a NOP? It is part of a for loop. The index, err = getNextFreeLoopbackIndex() is done only on the error path.

I think we should keep retrying with getNextFreeLoopbackIndex() and not just calling that once and thereafter incrementing the index, right?

I am fine to change that, but it is still racy (another process can use that device) and it is slightly slower as we need an additional syscall

edsantiago · 2024-07-23T14:31:17Z

index++ is a NOP because index is never again referenced after that point. It is redefined on error, but never read.

As for what should be changed, openNextAvailableLoopback() should be rewritten as

func openNextAvailableLoopback(sparseName string, sparseFile ...) {
    index := getNextFreeLoopbackIndex()

There is, AFAICT, no purpose whatsoever to attachLoopDevice() calling getNext.... It does not do anything with startIndex other than pass it as an arg to the function that needs it. As a reader, I find it unnecessary and confusing.

Perhaps I've missed something, though?

edsantiago · 2024-07-23T14:34:42Z

Never mind my index++ comment, I missed the continues in various other cases. I agree with @cgwalters, the code should probably getNext on each iteration even if it's slow.

giuseppe · 2024-07-23T14:47:38Z

changed to always use getNextFreeLoopbackIndex()

cgwalters · 2024-07-23T15:09:10Z

/approve

I like this one better, thanks. I do think there's more cleanup to do here but it can come as a followup.

openshift-ci · 2024-07-23T15:09:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [giuseppe]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mtrmac

ACK overall.

pkg/loopback/attach_loopback.go

the loopback device file could be already used/removed by another process. Since the process is inherently racy, just grab the next available index and try again until it succeeds. Closes: containers#2038 Signed-off-by: Giuseppe Scrivano <[email protected]>

Signed-off-by: Giuseppe Scrivano <[email protected]>

mtrmac · 2024-07-23T17:55:35Z

/lgtm

It actually has been around for years: For containers and other sandbox use cases, there will be thousands (and even more) of authenticated (sub)images running on the same host, unlike OS images. Of course, all scenarios can use the same EROFS on-disk format, but bdev-backed mounts just work well for OS images since golden data is dumped into real block devices. However, it's somewhat hard for container runtimes to manage and isolate so many unnecessary virtual block devices safely and efficiently [1]: they just look like a burden to orchestrators and file-backed mounts are preferred indeed. There were already enough attempts such as Incremental FS, the original ComposeFS and PuzzleFS acting in the same way for immutable fses. As for current EROFS users, ComposeFS, containerd and Android APEXs will be directly benefited from it. On the other hand, previous experimental feature "erofs over fscache" was once also intended to provide a similar solution (inspired by Incremental FS discussion [2]), but the following facts show file-backed mounts will be a better approach: - Fscache infrastructure has recently been moved into new Netfslib which is an unexpected dependency to EROFS really, although it originally claims "it could be used for caching other things such as ISO9660 filesystems too." [3] - It takes an unexpectedly long time to upstream Fscache/Cachefiles enhancements. For example, the failover feature took more than one year, and the deamonless feature is still far behind now; - Ongoing HSM "fanotify pre-content hooks" [4] together with this will perfectly supersede "erofs over fscache" in a simpler way since developers (mainly containerd folks) could leverage their existing caching mechanism entirely in userspace instead of strictly following the predefined in-kernel caching tree hierarchy. After "fanotify pre-content hooks" lands upstream to provide the same functionality, "erofs over fscache" will be removed then (as an EROFS internal improvement and EROFS will not have to bother with on-demand fetching and/or caching improvements anymore.) [1] containers/storage#2039 [2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com [3] https://docs.kernel.org/filesystems/caching/fscache.html [4] https://lore.kernel.org/r/[email protected] Closes: containers/composefs#144 Signed-off-by: Gao Xiang <[email protected]>

It actually has been around for years: For containers and other sandbox use cases, there will be thousands (and even more) of authenticated (sub)images running on the same host, unlike OS images. Of course, all scenarios can use the same EROFS on-disk format, but bdev-backed mounts just work well for OS images since golden data is dumped into real block devices. However, it's somewhat hard for container runtimes to manage and isolate so many unnecessary virtual block devices safely and efficiently [1]: they just look like a burden to orchestrators and file-backed mounts are preferred indeed. There were already enough attempts such as Incremental FS, the original ComposeFS and PuzzleFS acting in the same way for immutable fses. As for current EROFS users, ComposeFS, containerd and Android APEXs will be directly benefited from it. On the other hand, previous experimental feature "erofs over fscache" was once also intended to provide a similar solution (inspired by Incremental FS discussion [2]), but the following facts show file-backed mounts will be a better approach: - Fscache infrastructure has recently been moved into new Netfslib which is an unexpected dependency to EROFS really, although it originally claims "it could be used for caching other things such as ISO9660 filesystems too." [3] - It takes an unexpectedly long time to upstream Fscache/Cachefiles enhancements. For example, the failover feature took more than one year, and the deamonless feature is still far behind now; - Ongoing HSM "fanotify pre-content hooks" [4] together with this will perfectly supersede "erofs over fscache" in a simpler way since developers (mainly containerd folks) could leverage their existing caching mechanism entirely in userspace instead of strictly following the predefined in-kernel caching tree hierarchy. After "fanotify pre-content hooks" lands upstream to provide the same functionality, "erofs over fscache" will be removed then (as an EROFS internal improvement and EROFS will not have to bother with on-demand fetching and/or caching improvements anymore.) [1] containers/storage#2039 [2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com [3] https://docs.kernel.org/filesystems/caching/fscache.html [4] https://lore.kernel.org/r/[email protected] Closes: containers/composefs#144 Reviewed-by: Sandeep Dhavale <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Link: https://lore.kernel.org/r/[email protected]

It actually has been around for years: For containers and other sandbox use cases, there will be thousands (and even more) of authenticated (sub)images running on the same host, unlike OS images. Of course, all scenarios can use the same EROFS on-disk format, but bdev-backed mounts just work well for OS images since golden data is dumped into real block devices. However, it's somewhat hard for container runtimes to manage and isolate so many unnecessary virtual block devices safely and efficiently [1]: they just look like a burden to orchestrators and file-backed mounts are preferred indeed. There were already enough attempts such as Incremental FS, the original ComposeFS and PuzzleFS acting in the same way for immutable fses. As for current EROFS users, ComposeFS, containerd and Android APEXs will be directly benefited from it. On the other hand, previous experimental feature "erofs over fscache" was once also intended to provide a similar solution (inspired by Incremental FS discussion [2]), but the following facts show file-backed mounts will be a better approach: - Fscache infrastructure has recently been moved into new Netfslib which is an unexpected dependency to EROFS really, although it originally claims "it could be used for caching other things such as ISO9660 filesystems too." [3] - It takes an unexpectedly long time to upstream Fscache/Cachefiles enhancements. For example, the failover feature took more than one year, and the deamonless feature is still far behind now; - Ongoing HSM "fanotify pre-content hooks" [4] together with this will perfectly supersede "erofs over fscache" in a simpler way since developers (mainly containerd folks) could leverage their existing caching mechanism entirely in userspace instead of strictly following the predefined in-kernel caching tree hierarchy. After "fanotify pre-content hooks" lands upstream to provide the same functionality, "erofs over fscache" will be removed then (as an EROFS internal improvement and EROFS will not have to bother with on-demand fetching and/or caching improvements anymore.) [1] containers/storage#2039 [2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com [3] https://docs.kernel.org/filesystems/caching/fscache.html [4] https://lore.kernel.org/r/[email protected] Closes: containers/composefs#144 Reviewed-by: Sandeep Dhavale <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Link: https://lore.kernel.org/r/[email protected]

openshift-ci bot added the approved label Jul 23, 2024

loopback: use fstat on the open file descriptor

b23e274

move the stat call later after the file is already opened so it is less vulnerable to the file being removed between the stat and the open syscall. Signed-off-by: Giuseppe Scrivano <[email protected]>

giuseppe force-pushed the fix-loopback-race-condition branch from 9f4cccd to 4132d90 Compare July 23, 2024 10:18

giuseppe mentioned this pull request Jul 23, 2024

composefs? Opening loopback device: open /dev/loop13: no such device or address" #2038

Closed

edsantiago reviewed Jul 23, 2024

View reviewed changes

giuseppe force-pushed the fix-loopback-race-condition branch from 9879117 to 6e2eba6 Compare July 23, 2024 14:47

mtrmac reviewed Jul 23, 2024

View reviewed changes

pkg/loopback/attach_loopback.go Show resolved Hide resolved

giuseppe added 2 commits July 23, 2024 19:43

loopback: treat ENXIO as ENOENT

998e6d4

Signed-off-by: Giuseppe Scrivano <[email protected]>

giuseppe force-pushed the fix-loopback-race-condition branch from 6e2eba6 to 998e6d4 Compare July 23, 2024 17:43

openshift-ci bot assigned mtrmac Jul 23, 2024

openshift-ci bot added the lgtm label Jul 23, 2024

openshift-merge-bot bot merged commit 10cff2a into containers:main Jul 23, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loopback: fix race condition opening loopback device #2039

loopback: fix race condition opening loopback device #2039

giuseppe commented Jul 23, 2024

rhatdan commented Jul 23, 2024

rhatdan commented Jul 23, 2024

edsantiago left a comment

cgwalters commented Jul 23, 2024

cgwalters commented Jul 23, 2024

edsantiago commented Jul 23, 2024

giuseppe commented Jul 23, 2024

edsantiago commented Jul 23, 2024

edsantiago commented Jul 23, 2024

giuseppe commented Jul 23, 2024

cgwalters commented Jul 23, 2024

openshift-ci bot commented Jul 23, 2024

mtrmac left a comment

mtrmac commented Jul 23, 2024

loopback: fix race condition opening loopback device #2039

loopback: fix race condition opening loopback device #2039

Conversation

giuseppe commented Jul 23, 2024

rhatdan commented Jul 23, 2024

rhatdan commented Jul 23, 2024

edsantiago left a comment

Choose a reason for hiding this comment

cgwalters commented Jul 23, 2024

cgwalters commented Jul 23, 2024

edsantiago commented Jul 23, 2024

giuseppe commented Jul 23, 2024

edsantiago commented Jul 23, 2024

edsantiago commented Jul 23, 2024

giuseppe commented Jul 23, 2024

cgwalters commented Jul 23, 2024

openshift-ci bot commented Jul 23, 2024

mtrmac left a comment

Choose a reason for hiding this comment

mtrmac commented Jul 23, 2024