Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop FUSE detection workaround for flatpak #2900

Open
smcv opened this issue Jun 23, 2023 · 43 comments · Fixed by #2901
Open

drop FUSE detection workaround for flatpak #2900

smcv opened this issue Jun 23, 2023 · 43 comments · Fixed by #2901

Comments

@smcv
Copy link
Contributor

smcv commented Jun 23, 2023

As reported on flatpak/flatpak#5452, multiple Fedora users seem to be seeing Flatpak update failures since their OS was upgraded from libostree 2023.3 to 2023.4. @yuntaz0 reports in flatpak/flatpak#5452 (comment) that rolling back to a previous OS image resolves this for them; only a few packages were involved in that rollback, of which libostree seems like the most likely cause for a Flatpak regression.

I haven't packaged 2023.4 for my OS (Debian) yet, so I have no particular insight into this myself.

@Jan200101
Copy link

I've bisected this down to 59a653c, reverting it on 2023.4 fixes this behavior.

@dbnicholson
Copy link
Member

Interesting. Can you tell exactly where in flatpak this starts to fail? Flatpak does some interesting things with temporary and parent/child repos that might not get good test coverage in ostree.

@Jan200101
Copy link

I have no experience with all the different glib abstractions so I haven't found anything yet.

The best I have found so far is that it happens somewhere in or after write_content_object() since that is where the error prefix Writing content object is set.

@dbnicholson
Copy link
Member

The error comes from GZlibDecompressor. Why that would happen from creating the file in the repo's tmp directory rather than in /var/tmp... I don't know. @cgwalters any ideas?

It would be interesting to run flatpak with G_MESSAGES_DEBUG=all (or maybe just --verbose --ostree-verbose) and try to see where the repo is and if there's anything special about it.

@Jan200101
Copy link

running with both G_MESSAGES_DEBUG=all and --verbose --ostree-verbose gives

Partial Log
F: fetch of 298c2271055571ec60ba06b071c49b7f3d6e1ca3eca3d14beccd8ffd093e8965.file complete
F: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5
F: Request caught error: Writing content object: Need more input
Installing 1/8… ██                    10%  12.8 MB/s
F: fetch static delta part c000a200f6ec1a27427e6de616dacbfb5d0ed6102bf8e4a868f1aaad5bd73008 complete
F: _ostree_fetcher_should_retry_request: error: 231:34 Input buffer too small, n_retries_remaining: 4
F: Should retry request (remaining: 4 retries), due to transient error: Input buffer too small
F: starting fetch of deltapart deltas/i0/7kHM6DKth+aoBv2DWawpXqI3aM9RXYjFAqDcq1TKI/0
Installing 1/8… ██▏                   11%  16.1 MB/s
F: fetch of 86ab61f731e70d825dc9483843be73d86e89357465ea6490212422f8f478edcc.file complete
F: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5
F: Request caught error: Writing content object: Need more input
F: fetch static delta part c000a200f6ec1a27427e6de616dacbfb5d0ed6102bf8e4a868f1aaad5bd73008 complete
F: _ostree_fetcher_should_retry_request: error: 231:34 Input buffer too small, n_retries_remaining: 3
F: Should retry request (remaining: 3 retries), due to transient error: Input buffer too small
F: starting fetch of deltapart deltas/i0/7kHM6DKth+aoBv2DWawpXqI3aM9RXYjFAqDcq1TKI/0
F: fetch of 2b76da1f9708c36269e41004e9178b012f56a0ebabcb3c631a7c120906f2dede.file complete
F: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5
F: Request caught error: Writing content object: Need more input
Installing 1/8… ██▍                   12%  19.6 MB/s
F: fetch static delta part c000a200f6ec1a27427e6de616dacbfb5d0ed6102bf8e4a868f1aaad5bd73008 complete
F: _ostree_fetcher_should_retry_request: error: 231:34 Input buffer too small, n_retries_remaining: 2
F: Should retry request (remaining: 2 retries), due to transient error: Input buffer too small
F: starting fetch of deltapart deltas/i0/7kHM6DKth+aoBv2DWawpXqI3aM9RXYjFAqDcq1TKI/0
Installing 1/8… ██▌                   13%  11.6 MB/s
Installing 1/8… ██▌                   13%  13.3 MB/s
F: fetch static delta part c000a200f6ec1a27427e6de616dacbfb5d0ed6102bf8e4a868f1aaad5bd73008 complete
F: _ostree_fetcher_should_retry_request: error: 231:34 Input buffer too small, n_retries_remaining: 1
F: Should retry request (remaining: 1 retries), due to transient error: Input buffer too small
F: starting fetch of deltapart deltas/i0/7kHM6DKth+aoBv2DWawpXqI3aM9RXYjFAqDcq1TKI/0
Installing 1/8… ██▊                   14%  15.0 MB/s
F: fetch static delta part c000a200f6ec1a27427e6de616dacbfb5d0ed6102bf8e4a868f1aaad5bd73008 complete
F: _ostree_fetcher_should_retry_request: error: 231:34 Input buffer too small, n_retries_remaining: 0
F: Request caught error: Input buffer too small
Installing 1/8… ███                   15%  11.1 MB/s
Installing 1/8… ███                   15%  12.1 MB/s
F: fetch of e8ab10dbc3306a082c482a18ba0fedd86776bf710b2faeec0bee34e787e51e28.file complete
F: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5
F: Request caught error: Writing content object: Need more input
Installing 1/8… ███▏                  16%  13.3 MB/s
Installing 1/8… ███▍                  17%  14.3 MB/s
Installing 1/8… ███▍                  17%  11.5 MB/s  00:19
F: fetch of f9f0ca8060bcc8956a61545256788fc4e6f633bbe0fec697798f433cb54b8df0.file complete
F: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5
F: Request caught error: Writing content object: Need more input
F: pull: idle, exiting mainloop
F: Aborting transaction in repository 0x55f0e5bf1540
F: Calling system helper: CancelPull
F: Calling system helper: CancelPull
Warning: While pulling runtime/org.freedesktop.Platform.GL.default/x86_64/22.08 from remote flathub: Writing content object: Need more input

the log includes a different error message which is from _ostree_lzma_return

F: _ostree_fetcher_should_retry_request: error: 231:34 Input buffer too small, n_retries_remaining: 1
F: Should retry request (remaining: 1 retries), due to transient error: Input buffer too small

g_set_error_literal (error, G_IO_ERROR, G_IO_ERROR_PARTIAL_INPUT, "Input buffer too small");

I think the error message might be misleading, at least with LZMA.
Input buffer too small only gets set when LZMA_BUF_ERROR is returned by lzma_code and the documentation for LZMA_BUF_ERROR reads

This error code is returned when the coder cannot consume any new input and produce any new output. The most common reason for this error is that the input stream being decoded is truncated or corrupt.

Both glnx_open_tmpfile_linkable_at and glnx_open_anonymous_tmpfile end up calling open_tmpfile_core which does the heavy lifting, so in theory the logic should be pretty much the same, so why does one work but the other doesn't?

the only difference that is obvious is that one uses the dfd and a relative path while the other just uses whatever is in $TMPDIR or /var/tmp.

@dbnicholson
Copy link
Member

the only difference that is obvious is that one uses the dfd and a relative path while the other just uses whatever is in $TMPDIR or /var/tmp.

Exactly. Do you have the full log or at least the earlier parts? I don't know if it will help, but I'd like to see exactly where the repo is. In order to use the flatpak system repo, there's a funky 2 stage setup where (I believe) it uses a temporary repo writable by the unprivileged user as the intermediary. I'd like to know what's special about this configuration because I don't think the patch breaks typical ostree usage.

@dbnicholson
Copy link
Member

BTW, the 2 errors are essentially the same thing. In one you're getting a .file object, which is zlib compressed. Eventually GZlibDecompressor returns G_IO_ERROR_PARTIAL_INPUT with the message set as "Need more input". The other is a static delta part which is LZMA compressed. In that case, ostree's in repo LZMA decmpressor throws G_IO_ERROR_PARTIAL_INPUT with the likely wrong message set as "Input buffer too small".

In both cases, though, the decompression library is saying it's reached the end of the stream without seeing the expected end. I don't know why that's happening, but I think both errors likely come from the same cause.

@Jan200101
Copy link

Full logs

I've alsog one ahead and captured the output of ls -l /proc/$(pidof flatpak)/fd

(all anon inodes and and sockets removed for clarity)

lrwx------ 1 sentry sentry 64 Jun 24 12:44 0 -> /dev/pts/1
lrwx------ 1 sentry sentry 64 Jun 24 12:44 1 -> /dev/pts/1
lr-x------ 1 sentry sentry 64 Jun 24 12:44 10 -> /var/lib/flatpak/repo
lr-x------ 1 sentry sentry 64 Jun 24 12:44 11 -> /var/lib/flatpak/repo/objects
lr-x------ 1 sentry sentry 64 Jun 24 12:44 12 -> /var/lib/flatpak/repo/tmp
lr-x------ 1 sentry sentry 64 Jun 24 12:44 13 -> /home/sentry/.cache/flatpak/system-cache
lr-x------ 1 sentry sentry 64 Jun 24 12:44 14 -> /home/sentry/.local/share/flatpak/repo
lr-x------ 1 sentry sentry 64 Jun 24 12:44 15 -> /home/sentry/.local/share/flatpak/repo/objects
lr-x------ 1 sentry sentry 64 Jun 24 12:44 16 -> /home/sentry/.local/share/flatpak/repo/tmp
lr-x------ 1 sentry sentry 64 Jun 24 12:44 17 -> /home/sentry/.local/share/flatpak/repo/tmp/cache
lr-x------ 1 sentry sentry 64 Jun 24 12:44 18 -> /var/lib/flatpak/repo
lr-x------ 1 sentry sentry 64 Jun 24 12:44 19 -> /var/lib/flatpak/repo/objects
lrwx------ 1 sentry sentry 64 Jun 24 12:44 2 -> /dev/pts/1
lr-x------ 1 sentry sentry 64 Jun 24 12:44 20 -> /var/lib/flatpak/repo/tmp
lr-x------ 1 sentry sentry 64 Jun 24 12:44 21 -> /home/sentry/.cache/flatpak/system-cache
lr-x------ 1 sentry sentry 64 Jun 24 12:44 27 -> /var/tmp/flatpak-cache-276Y61/org.kde.Platform-RLRC71/repo-IoyxB2/tmp/staging-f58b91f7-d759-49ae-a2da-7fd2e7f19457-QCx7X6
lrwx------ 1 sentry sentry 64 Jun 24 12:44 28 -> /var/tmp/flatpak-cache-276Y61/org.kde.Platform-RLRC71/repo-IoyxB2-lock
lrwx------ 1 sentry sentry 64 Jun 24 12:44 29 -> /var/tmp/flatpak-cache-276Y61/org.kde.Platform-RLRC71/repo-IoyxB2/tmp/staging-f58b91f7-d759-49ae-a2da-7fd2e7f19457-QCx7X6-lock
lrwx------ 1 sentry sentry 64 Jun 24 12:44 3 -> 'anon_inode:[eventfd]'
lr-x------ 1 sentry sentry 64 Jun 24 12:44 33 -> /var/tmp/flatpak-cache-276Y61/org.kde.Platform-RLRC71/repo-IoyxB2
lr-x------ 1 sentry sentry 64 Jun 24 12:44 34 -> /var/tmp/flatpak-cache-276Y61/org.kde.Platform-RLRC71/repo-IoyxB2/objects
lr-x------ 1 sentry sentry 64 Jun 24 12:44 35 -> /var/tmp/flatpak-cache-276Y61/org.kde.Platform-RLRC71/repo-IoyxB2/tmp
lr-x------ 1 sentry sentry 64 Jun 24 12:44 36 -> /var/lib/flatpak/repo
lr-x------ 1 sentry sentry 64 Jun 24 12:44 37 -> /var/lib/flatpak/repo/objects
lr-x------ 1 sentry sentry 64 Jun 24 12:44 38 -> /var/lib/flatpak/repo/tmp
lr-x------ 1 sentry sentry 64 Jun 24 12:44 39 -> /home/sentry/.cache/flatpak/system-cache
lrwx------ 1 sentry sentry 64 Jun 24 12:44 41 -> /var/tmp/flatpak-cache-276Y61/org.kde.Platform-RLRC71/repo-IoyxB2/tmp/tmp.TTGAn5
lrwx------ 1 sentry sentry 64 Jun 24 12:44 42 -> /var/tmp/flatpak-cache-276Y61/org.kde.Platform-RLRC71/repo-IoyxB2/tmp/tmp.o9vVcd

@yuntaz0
Copy link

yuntaz0 commented Jun 24, 2023

Another Fedora user, azarok, has also mentioned that they can install and update Flatpak apps by using sudo privileges. I have also tested this and can confirm that using sudo with Flatpak install and update commands works in my case. While I understand that Flatpak shouldn't normally require sudo, I thought it worth mentioning this observation.

@Jan200101
Copy link

Thats interesting.

running flatpak in the same way I did before, but this time as root, showed different path being used.

lrwx------ 1 root root 64 Jun 24 15:16 0 -> /dev/pts/1
lrwx------ 1 root root 64 Jun 24 15:16 1 -> /dev/pts/1
lr-x------ 1 root root 64 Jun 24 15:16 10 -> /var/lib/flatpak/repo/objects
lr-x------ 1 root root 64 Jun 24 15:16 11 -> /var/lib/flatpak/repo/tmp
lr-x------ 1 root root 64 Jun 24 15:16 12 -> /var/lib/flatpak/repo/tmp/cache
lr-x------ 1 root root 64 Jun 24 15:16 13 -> /root/.local/share/flatpak/repo
lr-x------ 1 root root 64 Jun 24 15:16 14 -> /root/.local/share/flatpak/repo/objects
lr-x------ 1 root root 64 Jun 24 15:16 15 -> /root/.local/share/flatpak/repo/tmp
lr-x------ 1 root root 64 Jun 24 15:16 16 -> /root/.local/share/flatpak/repo/tmp/cache
lr-x------ 1 root root 64 Jun 24 15:16 17 -> /var/lib/flatpak/repo
lr-x------ 1 root root 64 Jun 24 15:16 18 -> /var/lib/flatpak/repo/objects
lr-x------ 1 root root 64 Jun 24 15:16 19 -> /var/lib/flatpak/repo/tmp
lrwx------ 1 root root 64 Jun 24 15:16 2 -> /dev/pts/1
lr-x------ 1 root root 64 Jun 24 15:16 20 -> /var/lib/flatpak/repo/tmp/cache
lrwx------ 1 root root 64 Jun 24 15:16 26 -> /var/lib/flatpak/repo-lock
lrwx------ 1 root root 64 Jun 24 15:16 27 -> /var/lib/flatpak/repo/.lock
lrwx------ 1 root root 64 Jun 24 15:16 30 -> /var/lib/flatpak/repo/tmp/staging-f58b91f7-d759-49ae-a2da-7fd2e7f19457-YzU43z-lock
lrwx------ 1 root root 64 Jun 24 15:16 34 -> '/var/lib/flatpak/repo/tmp/#32375049 (deleted)'
lrwx------ 1 root root 64 Jun 24 15:16 35 -> '/var/lib/flatpak/repo/tmp/#32375062 (deleted)'
lrwx------ 1 root root 64 Jun 24 15:16 36 -> '/var/lib/flatpak/repo/tmp/#32375063 (deleted)'
lrwx------ 1 root root 64 Jun 24 15:16 37 -> '/var/lib/flatpak/repo/tmp/#32375065 (deleted)'
lrwx------ 1 root root 64 Jun 24 15:16 38 -> '/var/lib/flatpak/repo/tmp/#32375068 (deleted)'
lrwx------ 1 root root 64 Jun 24 15:16 39 -> '/var/lib/flatpak/repo/tmp/#32375067 (deleted)'
lrwx------ 1 root root 64 Jun 24 15:16 40 -> '/var/lib/flatpak/repo/tmp/#32375066 (deleted)'
lrwx------ 1 root root 64 Jun 24 15:16 41 -> '/var/lib/flatpak/repo/tmp/#32375069 (deleted)'
lrwx------ 1 root root 64 Jun 24 15:16 7 -> /dev/tty
lr-x------ 1 root root 64 Jun 24 15:16 9 -> /var/lib/flatpak/repo

@dbnicholson
Copy link
Member

@Jan200101 Thanks for all that info. I'm starting to suspect the error is actually in the system helper that pulls the objects from the user's temporary repo used for downloading into the /var/lib/flatpak/repo system repo.

I'd bet if you do a --user install it also works fine. I think the next test is to watch the system helper during the failed run.

@dbnicholson
Copy link
Member

Can you run the system helper directly with sudo /usr/libexec/flatpak-system-helper --replace --no-idle-exit --verbose --ostree-verbose and see if there are any errors?

@Jan200101
Copy link

I'd bet if you do a --user install it also works fine. I think the next test is to watch the system helper during the failed run.

I can confirm that using --user works without a problem

Can you run the system helper directly with sudo /usr/libexec/flatpak-system-helper --replace --no-idle-exit --verbose --ostree-verbose and see if there are any errors?

Sure thing
https://gist.github.com/Jan200101/23a00076dbf92e37beb443a2383f60be

FH: CancelPull default 1 /var/lib/flatpak/repo/tmp/flatpak-cache-7DC461 seems fishy in there.
If the pulls were prematurely canceled this would explain why various compressions functions report incomplete data.

@dbnicholson
Copy link
Member

FH: CancelPull default 1 /var/lib/flatpak/repo/tmp/flatpak-cache-7DC461 seems fishy in there. If the pulls were prematurely canceled this would explain why various compressions functions report incomplete data.

I actually think that's expected. At the end of your earlier verbose log you could see the user side instructing the system helper to cancel because of failures.

So, I think it is the user side in this interaction that's failing like was shown in your earlier logs. There must be something different about the temporary proxy repo the user sets up in /var/tmp that causes this error.

@alexlarsson do you have some time to look at this? I think you probably understand the interaction with the flatpak system helper better than anyone.

@billyblackburn

This comment was marked as duplicate.

@leafi

This comment was marked as duplicate.

@dvzrv
Copy link

dvzrv commented Jun 25, 2023

@leafi thanks for the ping. I'll revert the commit mentioned in #2900 (comment) in the hopes that it'll fix this for Arch Linux. Please check whether 2023.4-2 in extra-testing fixes this for you.

@justanotherinternetguy

This comment was marked as duplicate.

@yochananmarqos
Copy link

@justanotherinternetguy

2023.4-2 in extra-testing

😉

@leafi
Copy link

leafi commented Jun 25, 2023

@dvzrv Hey, thanks. I just enabled extra-testing, updated, and flatpak update ran successfully. Full console log in attached .txt file in case you wish to see it.

updated-ostree-session.txt

@dbnicholson
Copy link
Member

@dvzrv From your arch change, it appears reverting f7f6f87 fixes the problem, right? If so, care to turn it into a PR? I think that's the wise thing to do for now while the root cause is investigated.

@sarayourfriend
Copy link

@dvzrv Also confirming 2023.4-2 in extra-testing fixed the issue for me, FWIW.

@dvzrv
Copy link

dvzrv commented Jun 26, 2023

@dvzrv From your arch change, it appears reverting f7f6f87 fixes the problem, right? If so, care to turn it into a PR? I think that's the wise thing to do for now while the root cause is investigated.

Sure, I can do that today!

@dvzrv Also confirming 2023.4-2 in extra-testing fixed the issue for me, FWIW.

Thanks for verifying! Moved to the stable repos.

@shellheim

This comment was marked as duplicate.

cgwalters added a commit to cgwalters/ostree that referenced this issue Jun 26, 2023
This reverts commit f7f6f87.

This seems to have broken flatpak, so we'll revert and then
investigate.

Closes: ostreedev#2900
@cgwalters
Copy link
Member

Revert PR up in #2901

@ZeddieXX

This comment was marked as duplicate.

@dbnicholson
Copy link
Member

Having only looked at this for 3 minutes so far, one immediate likely thing that jumps out to me here is the interaction with revokefs-fuse.

Oh yeah, revokefs-fuse. I knew there was something unusual about that setup I was forgetting. I'm going to try to setup this environment today and see if I can tease out the real issue.

@supakeen

This comment was marked as duplicate.

@travier
Copy link
Member

travier commented Jun 26, 2023

Fedora 38 update (revert) to test: https://bodhi.fedoraproject.org/updates/FEDORA-2023-464fae1680

@dbnicholson
Copy link
Member

Having only looked at this for 3 minutes so far, one immediate likely thing that jumps out to me here is the interaction with revokefs-fuse.

Oh yeah, revokefs-fuse. I knew there was something unusual about that setup I was forgetting. I'm going to try to setup this environment today and see if I can tease out the real issue.

I mangled together a setup with revokefs-fuse and got the following interesting responses when trying to use glnx_open_tmpfile_linkable_at in a directory where the revokefs-fuse target is mounted. First it gets EOPNOTSUPP:

openat(AT_FDCWD, "rfdst", O_RDWR|O_CLOEXEC|O_TMPFILE, 0600) = -1 EOPNOTSUPP (Operation not supported)

So libglnx tries to fall back to a regular temporary file and gets EIO:

openat(AT_FDCWD, "rfdst/tmp.0T0oh5", O_RDWR|O_CREAT|O_EXCL|O_NOCTTY|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EIO (Input/output error)

Similarly, if I use glnx_open_anonymous_tmpfile but set TMPDIR=/path/to/revokefs-fuse/mount, I get the same errors.

I'm not sure if I'm driving it right yet. I'm also on linux 6.1, which is a bit old at this point. I might have to try this out in a VM with a newer kernel.

@dbnicholson
Copy link
Member

I haven't figured out what's going on yet, but something is going wrong as the data is being read from the fd. Here the LZMA decompressor is being given some data to convert but being told the size is 0:

#0  _ostree_lzma_decompressor_convert
    (converter=0x5555557110e0, inbuf=0x555555a1aef0, inbuf_size=0, outbuf=0x555555856250, outbuf_size=8192, flags=G_CONVERTER_INPUT_AT_END, bytes_read=0x7fffffffa5d8, bytes_written=0x7fffffffa5e0, error=0x7fffffffa5e8) at src/libostree/ostree-lzma-decompressor.c:132
(gdb) x/20x inbuf
0x555555a1aef0:	0xfc	0xd0	0x70	0xb1	0x8a	0x09	0xd3	0x4f
0x555555a1aef8:	0xd6	0x5b	0xb5	0xd5	0x3c	0x5e	0x69	0xb4
0x555555a1af00:	0xe6	0x6b	0x98	0x3e

The inbuf_size=0 is passed along to lzma, which is why it's surprised that hasn't gotten to the end yet.

Teasing out the actual fd and current position:

(gdb) frame 6
#6  0x00007ffff7c63bb6 in _ostree_static_delta_part_open (part_in=0x55555570baa0 [GUnixInputStream], inline_part_bytes=0x0, flags=OSTREE_STATIC_DELTA_OPEN_FLAGS_NONE, 
    expected_checksum=0x555555dc9bc0 "f17640ea4024efd2053d82e8ed16d806d50d4018c051e847a70fb80c31d9573a", out_part=0x7fffffffc8b0, cancellable=0x0, error=0x7fffffffc8a8)
    at src/libostree/ostree-repo-static-delta-core.c:702
702	        g_autoptr (GBytes) buf = ot_map_anonymous_tmpfile_from_content (convin, cancellable, error);
(gdb) call g_unix_input_stream_get_fd (part_in)
$21 = 39
(gdb) call lseek(39, 0, 1)
$22 = 24576

Taking a look at the temporary file outside gdb it looks like it was downloaded successfully and is definitely larger than 24576 bytes.

dan@debian:~/ostree$ readlink /proc/37267/fd/39
/var/tmp/flatpak-cache-4M4X61/org.freedesktop.Platform.openh264-OS1C71/repo-2UuF16/tmp/tmp.QrmRWU
dan@debian:~/ostree$ ls -l /var/tmp/flatpak-cache-4M4X61/org.freedesktop.Platform.openh264-OS1C71/repo-2UuF16/tmp/tmp.QrmRWU
-rw-r--r-- 1 dan _flatpak 269433 Jun 26 13:48 /var/tmp/flatpak-cache-4M4X61/org.freedesktop.Platform.openh264-OS1C71/repo-2UuF16/tmp/tmp.QrmRWU
dan@debian:~/ostree$ sha256sum /var/tmp/flatpak-cache-4M4X61/org.freedesktop.Platform.openh264-OS1C71/repo-2UuF16/tmp/tmp.QrmRWU
f17640ea4024efd2053d82e8ed16d806d50d4018c051e847a70fb80c31d9573a  /var/tmp/flatpak-cache-4M4X61/org.freedesktop.Platform.openh264-OS1C71/repo-2UuF16/tmp/tmp.QrmRWU

dbnicholson pushed a commit to cgwalters/ostree that referenced this issue Jun 27, 2023
This reverts commit f7f6f87.

This seems to have broken flatpak, so we'll revert and then
investigate.

Closes: ostreedev#2900
@dbnicholson
Copy link
Member

It appears that this is a kind of race or at least the reader is going too fast for fuse. When I was breaking in _ostree_lzma_decompressor_convert and therefore artificially slowing down the read loop, it completed successfully. I also looked at the FUSE read callback for revokefs-fuse, and it doesn't really seem like much could be going wrong.

So, I'm starting to suspect this is bug in a corner case in GConverterInputStream, whose read loop is a bit complex since it needs manage a kind of double buffering scheme. Or maybe a bug in FUSE, which certainly wouldn't be the first time I've seen one of those.

The simple reproducer I'm using is to install a flatpak runtime that has no dependencies. I've settled on the openh264 runtime from flathub since it's pretty small:

flatpak uninstall runtime/org.freedesktop.Platform.openh264/x86_64/2.2.0
flatpak install -y --noninteractive flathub runtime/org.freedesktop.Platform.openh264/x86_64/2.2.0

You could probably go even smaller with one of the theme extensions. In my earlier "hacked up revokefs-fuse" test, I was trying to use the demo, but it didn't appear to be doing the right thing since any IO on the mounted target was failing.

@travier
Copy link
Member

travier commented Jun 27, 2023

It appears that this is a kind of race or at least the reader is going too fast for fuse.

I had tried doing an update in a loop and parts of it succeeded after a while, which indeed points to a race.

@cgwalters cgwalters reopened this Jun 27, 2023
@cgwalters
Copy link
Member

Let's keep this open until we figure out how we can safely re-land 59a653c because we do still want to make that change.

Probably to be conservative we may need to do something hacky like detecting whether we're being executed by flatpak, or maybe if we're writing to a FUSE mount?

@cgwalters cgwalters changed the title 2023.4 appears to have made Flatpak regress 59a653c breaks flatpak w/system helper Jun 27, 2023
@dbnicholson
Copy link
Member

Probably to be conservative we may need to do something hacky like detecting whether we're being executed by flatpak, or maybe if we're writing to a FUSE mount?

I think detecting if we're going through FUSE. I think that should be pretty straightforward with an fstatfs call on the dfd and looking for f_type == FUSE_SUPER_MAGIC. I think you'd prefer to do that once per dfd instead of every call into _ostree_fetcher_tmpf_from_flags, so perhaps reusing the OSTREE_FETCHER_REQUEST_LINKABLE flag but making the caller do the fstatfs call and clearing it on FUSE.

@dbnicholson
Copy link
Member

One thing dawned on me about the revokefs-fuse setup. Per the background, the whole point is to let the unprivileged user write to a shared repo and then revoke that ability before the privileged user reads from it.

Before this change, the flow was 99% write based from the unprivileged user's perspective. It was either downloading metadata objects to memory and then writing them into the temporary repo's objects directory, or it was downloading file or delta parts to /var/tmp and then decompressing and writing them to them into the temporary repo's objects directory. The reading happens on the privileged side and happens from the underlying directory, not through FUSE. The only unprivileged user reading I saw in strace was the very early part to read the config file in the repo. So, I suspect revokefs-fuse has never seen any large file reads until this change.

@dbnicholson
Copy link
Member

I was finally able to get the demo going after having it not close the socket the frontend needs. I thought you'd always end up in the pread branch of the read callback. However, what's unusual about this case is that ostree opens the temporary file O_RDWR, writes all the downloaded content into it, seeks to the beginning, and then starts reading through it for processing and actually copying to the staging directory. Since it's opened read-write, that makes the frontend pass the request to the backend, which means the reads are all being proxied over a socket. I think that path is relatively unused since the pattern usually used is either write only files (well tested proxy to the backend) or read only files (not proxied to the backend).

@dbnicholson
Copy link
Member

Found the problem. Since revokefs-fuse is using FUSE's default cached mode, you sometimes get page cache results from read instead of having the request actually passed to the FUSE process. In other words, you really do start getting 0 from read before you get to the end of the file sometimes. The fix is simple and I'll follow up on the flatpak issue, but I want to test a bit more first.

@Freyon77
Copy link

Is this issue releated to
"/usr/lib/systemd/system-generators/ostree-system-generator failed with exit status 1" ?
Because its still happen on arch.

@dbnicholson
Copy link
Member

Is this issue releated to "/usr/lib/systemd/system-generators/ostree-system-generator failed with exit status 1" ? Because its still happen on arch.

I don't know what that issue is, but I doubt it's related. Please open a separate issue with details about it.

@dbnicholson
Copy link
Member

FYI, I think I fixed the actual bug in flatpak/flatpak#5454. As requested there, the preference is not to reinstate f7f6f87 (or an equivalent) until flatpak releases with that fix have been made.

cgwalters added a commit to cgwalters/ostree that referenced this issue Jun 29, 2023
This reverts commit 4e61e6f
and re-instates the fix for ensuring that we download temporary
files into the repository location.

However in order to ensure we don't re-introduce
ostreedev#2900
we detect the case where we're writing to a FUSE mount
and keep the prior behavior.

I've verified that this works with flatpak.

Note a downside of this is the change needs to be triplicated
across the 3 http backends.

This then again
Closes: ostreedev#2571
@cgwalters
Copy link
Member

#2913 is queued; going to retitle this issue to track when we can drop the workaround...since it's not high priority I'd give it say 6 months.

@cgwalters cgwalters changed the title 59a653c breaks flatpak w/system helper drop FUSE detection workaround for flatpak Jun 29, 2023
cgwalters added a commit to cgwalters/ostree that referenced this issue Jul 5, 2023
This reverts commit 4e61e6f
and re-instates the fix for ensuring that we download temporary
files into the repository location.

However in order to ensure we don't re-introduce
ostreedev#2900
we detect the case where we're writing to a FUSE mount
and keep the prior behavior.

I've verified that this works with flatpak.

Note a downside of this is the change needs to be triplicated
across the 3 http backends.

This then again
Closes: ostreedev#2571
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.