RFC: VFS #377

bugaevc · 2018-02-21T07:31:30Z

Summary

Bring back VirtalPrefix and add more stuff on top of it, getting rid of mount namespaces and overlayfs

(sounds like a regression, doesn't it?)

Background

(I'm trying to both make a proposal and document what goes on, hence this section)

What is this all about?

Linux and macOS filesystem layouts, while similar, differ significantly enough that we can't present the host filesystem to programs running under Darling, and when they do agree (e.g. both put executables under /bin & /usr/bin) we want programs to see our versions of those directores.

No matter what exact mechanism we use for that, we deal with the above by having our own macOS-like "chroot" in libexec/darling/, on top of which we overlay so-called "prefixes" aka dprefixes (more about them in the wiki).

What is VirtualPrefix?

Darling used to implement chroot emulation for macOS executables by making use of the fact that we ship a complete libSystem (libc + other stuff) library instead of linking to the one used by the host. While we try not to modify most of libSystem (compared to what Apple ships/publishes), we do change libkernel/libsyscall to bridge between Darwin and Linux syscalls, and that gives us a chance to transform paths passed between libSystem and the Linux kernel.

The overlaying was achieved by just copying libexec/darling into each prefix each time anything changed (this is also what Wine does).

Why was VirtualPrefix removed?

I've found a few bugs in how VirtualPrefix worked but decided that it'd be easier to do away with it altogether than debug and fix those bugs. We were just moving to mount namespace + overlayfs mechanism for overlaying prefix contents over libexec/darling, and chrooting into the resulting directory was a very natural and simple extension of that idea.

To quote myself from #197,

If we're using mount namespaces, do we really need the hackish fakechroot/VirtualPrefix implementation in syscall emulation? The kernel is much better at doing this

It seemed the only "small problem" with this new layout was that ld-linux, Linux's dynamic loader, was unable to find native ELF libraries when invoked from inside the container. We worked that around by setting up a copy of /etc/ld.so.conf and /etc/ld.so.cache at installation time.

See #222 for the merge request that removed VirtualPrefix.

Was it a good idea to remove VirtualPrefix?

I'm not so sure anymore. Using Linux's native mount namespaces and chroot/pivot_root implementation is indeed a much cleaner solution than reimplementing all that logic ourselves, but it started to cause us a lot more headache since then.

Besides ld.so, other things than we had or would have to patch/workaround in some way because of native flies not being where they are expected to be are the X11 and Wayland sockets (we would have to symlink them), fontconfig config files and fonts themselves (which we hacked around by again making a modified copy of the config at installation time), Mesa "drivers" (relatively cleanly fixed by symlinking /usr/lib64/ to that of the host), GTK+ and Qt themes (and icons, and cursors, and the rest of /usr/share — could be fixed by modifying XDG_DATA_DIRS), native open/save file dialogs displaying prefixes intead of host's filesystem layout (my latest idea was to ask the DE to open an appropriate dialog over D-Bus using org.freedesktop.portal API) and probably many more that we haven't thought about / stumbled upon yet.

As you can see, these are spread throughout the stack and are all worked around differently — and yet, incidentally or not, they would all go away if only the native libc saw the host's filesystem layout (which was the case with VirtualPrefix).

Do all those justify bringing VirtualPrefix back? I'm not sure of that either. VirtualPrefix still is an ugly & buggy hack and native solutions are still nicer.

But wait,

There's more to the story

There are other filesystem-related things we have/want to tackle in some way:

Case (in)sensitivity (I believe VirtualPrefix handled that)
Different /dev layout (currently we symlink /dev from the host)
Host mounts appearing directly in /Volumes (see New /Volumes design #220)
File ownership (which we currently fake at syscall level)
Per-thread CWD (same)

And if you think of it, mount namespaces and overlayfs are too parts of the filesystem story, which brings us to the

Proposal

(well, maybe not a proposal, but an idea)

Let's revive [the idea of] VirtualPrefix, rewrite it from scratch to be fully correct and turn it into a more complete Virtual File System (VFS) implementation that would handle all of the above, including case-insensitivity, faking ovnership, tracking CWD, tweaking directory layout, mounting sub-/Volumes the way we want them and overlaying the prefix on top of libexec/darling.

Now that sounds like a much cleaner solution than what we have today.

Why reimplement overlayfs functionality?

...we don't have lots of hacks because of overlayfs, do we?

Not as many as we have because of other things, no, but there are a few problems with it. Firstly, modifying underlying filesystems while overlayfs is mounted is undefined behavior (meaning we can't update our libexec files while a container is running, and putting .init.pid inside the prefix directory is/was UB too). Secondly, it doesn't support encrypted home folders (see #242) nor some other interesting filesystems. Last but not least, we can't make it support case-insensitivity without basically reimplementing it.

Reimplementing overlayfs would mean that we would also no longer need mount namespaces.

Unresolved issues

We do want to mount separate /proc and /dev/shm, but with no kernel-level-chroot, there's nowhere to mount them. Ooops.
- We could still unshare the mount namespace and mount proc to e.g. /proc2 (or $DPREFIX/proc). I don't quite understand how /dev/shm works even now with /dev being a symlink.
That would make our filesystem operations not atomic.
- Do we really care about that?

Alternatives

Do nothing; what we have now works well and we have a pretty good understanding of how to workaround filesystem layout issues we foresee. If it ain't broke, don't fix it.

(Note: RFC stays for 'Request For Comments')

The text was updated successfully, but these errors were encountered:

CuriousTommy · 2019-09-16T15:23:32Z

Case (in)sensitivity (I believe VirtualPrefix handled that)

On a side note, it seems like some filesystem are providing the ability to be case-insensitive (ext4 and F2FS)

CuriousTommy · 2019-12-15T20:50:19Z

@bugaevc I hope you don't mind me asking, but has there been any progress with adding back VFS? If not, I would like to help get the ball rolling (so to speak).

Skimming through #222, I believe that we need to modify the syscall in src/kernel/emulation/linux/ to use the future recreation of VirtualPrefix, correct? Do we also need to change anything in the dyld directory?

Now, about those features your want.

Case (in)sensitivity (I believe VirtualPrefix handled that)

Like I said in my earlier post, some file systems already offer this feature. If we need to, we can implement it, but I rather have the OS take care of this (mainly for performance reasons).

We also need to make sure there is a way to disable this. iOS is case-sensitive by default.

File ownership (which we currently fake at syscall level)

I haven't look into to how darling handles this, but I was thinking of having a hidden file in each directory (we could call it .__darling_owner) that holds information on who owns what.

This file would only be created if you are in the Darling virtual drive. If you are accessing the Linux file system through Darling or adding a file to a writable DMG, this file won't be created.

Different /dev layout (currently we symlink /dev from the host)
Host mounts appearing directly in /Volumes (see #220)

I am not sure how to properly implement this part since I don't have any experience with VFS.

With that being said, I was brainstorming some ideas for how the /Volumes directory would work. We could have a generic Volume class that can be extended to support SystemRoot, DMG mountpoints, and any other volumes.

If this idea is a good idea, we would do something similar for /dev, such has having a generic Disk class and so on.

Per-thread CWD (same)

Dumb question, but I am going to assume that this allows each process to keep track of it's own CWD, right?

We do want to mount separate /proc and /dev/shm, but with no kernel-level-chroot, there's nowhere to mount them. Ooops.

From what I understand, /proc and /dev/shm does not exist on a real Mac. What would be the benefit of giving mac applications access to these stuff?

bugaevc · 2019-12-15T21:19:42Z

has there been any progress with adding back VFS?

The current plan is to do something like this with help from the kernel, dubbed vchroot. There was even a WIP branch.

I believe that we need to modify the syscall in src/kernel/emulation/linux/ to use the future recreation of VirtualPrefix, correct?

Yes

Do we also need to change anything in the dyld directory?

No, dyld should use the same libkernel stuff to access the file system as everything else uses.

I haven't look into to how darling handles this, but I was thinking of having a hidden file in each directory (we could call it .__darling_owner) that holds information on who owns what.

I don't think we need to track the fake owners. I don't remember the exact details, but IIRC it's pretty clear what the fake owner needs to be, it's faking itself that is important.

Dumb question, but I am going to assume that this allows each process to keep track of it's own CWD, right?

That's basically how per-thread CWD is implemented now, yes. With shared CWD the kernel keeps track of it (i.e.we change the CWD of the Linux process), and when a thread wants to switch to a per-thread CWD the userspace implementation takes over.

From what I understand, /proc and /dev/shm does not exist on a real Mac. What would be the benefit of giving mac applications access to these stuff?

They do not exist, so it's unlikely that Darwin software would look for them, so it's not hurting compatibility to expose them. The reason we want then in our container is simple: /proc is Linux's API to access process info; Darwin has its own API for that which we implement on top of Linux's API, /proc, so that's why we need it there.

CuriousTommy · 2019-12-21T00:41:18Z

While the VFS stuff might be too advance for me, I would love to at least try and help you or LubosD with this.

There was even a WIP branch.

Thank's for the link. Am I going to assume that this code is outdated and will probably be removed eventually, right? Regardless, I am going to take a look at the first commit to see what LubosD changed.

I don't think we need to track the fake owners. I don't remember the exact details, but IIRC it's pretty clear what the fake owner needs to be, it's faking itself that is important.

I guess the majority of applications don't need to change ownership. Most of the files and folders on a real Mac are owned by the system account (with the main exception being the users that live the in the /Users folder.

LubosD · 2020-01-15T15:24:27Z

The vchroot branch is based on a kernel-based "acceleration", instead of doing relatively slow userspace resolution.

At the moment, the kernel code is buggy (or simply wrong). It is based on some tricks with dentries, but this area of the Linux kernel isn't very well documented (just like the rest of it), so it works only in specific scenarios and needs a rework.

LubosD · 2020-01-27T22:07:58Z

I have resumed my work on this and hopefully already figured out some of the troubles - once again caused by the kernel not exporting some very useful functions... :-( Such as d_absolute_path().

Case Insensitivity

For now, darling.c could enable the ext4-specific xattr (EXT4_CASEFOLD_FL) to enable case insensitivity on ~/.darling.

As a long-term solution (if we choose to fork overlayfs, for instance), we could maybe reuse some of the insensitive lookup logic from sdcardfs.

LubosD · 2020-02-05T12:05:32Z

Small change of plans: I keep finding scenarios where my kernel-based vchroot just doesn't work. I'm starting to believe that what we need cannot be implemented by making a few simple Linux kernel API calls (or I'm just doing something wrong).

Either way, to speed things up, I'll now implement a user-space based vchroot, with a possible later upgrade to a kernel implementation. Because we really need this issue to be resolved - the sooner the better!

LubosD · 2020-02-08T14:42:31Z

The branch now works to the point that darling shell (including launchd) seems to run OK.

For the first time in god knows how long, I can just start HelloWorld.app without any additional hassle. But it hangs after the first click - does it ring any bells, @bugaevc?

bugaevc · 2020-02-08T15:00:52Z

For the first time in god knows how long, I can just start HelloWorld.app without any additional hassle.

🎉 🎉

But it hangs after the first click - does it ring any bells, @bugaevc?

#368 :)

bugaevc · 2020-02-08T15:06:57Z

Please test various stuff with readlink, pwd, realpath and stat around real mount points and /Volumes/SystemRoot — this is what the old VirtualPrefix has had issues with.

LubosD · 2020-02-08T15:33:55Z

#368 :)

Hmmmmm, but the test case I wrote back then now works (the vchroot branch uses the new XNU).

I did some quick grepping, and I can't find CFRunLoopAddCommonMode(CFRunLoopGetMain(), NSEventTrackingRunLoopMode) being called in Cocotron. Based on my understanding, this is required for common mode sources to work under that event tracking mode.

What do you think?

bugaevc · 2020-02-08T15:49:02Z

Hmmmmm (2), I see I've changed how this works relatively recently: darlinghq/darling-cocotron@56ba4b6 The change itself makes sense, and I remember wanting to do it.

So I guess we now need to add NSEventTrackingRunLoopMode to common modes (and others defined by AppKit? can you check which ones Apple does add?)

TheBrokenRail · 2020-02-08T18:04:06Z

As of tge moment you can't build the vchroot branch because src/CMakeLists.txt includes libelfloader/fakechroot which doesn't exist. Also, the Debian packaging postinst, install, and prerm still try to run the scripts in src which no longer exist.

TheBrokenRail · 2020-02-08T18:08:43Z

Also, where is the source for HelloWorld.app.

LubosD · 2020-02-08T18:13:09Z

@TheBrokenRail I fixed the fakechroot stuff.
I don't have the source handy, but here's the binary of HelloWorld.app.
HelloWorld.app.zip

LubosD · 2020-02-08T18:44:01Z

So I guess we now need to add NSEventTrackingRunLoopMode to common modes (and others defined by AppKit? can you check which ones Apple does add?)

They add NSEventTrackingRunLoopMode and NSEventTrackingRunLoopMode, not more. Tried it out, and it works!

Making a commit now.

TheBrokenRail · 2020-02-08T21:51:26Z

When building the LKM I got:

/var/lib/dkms/darling-mach/0.1/build/lkm/duct/osfmk/dummy-kern-task.c:139:10: fatal error: mach/security_server.h: No such file or directory
  139 | #include <mach/security_server.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~

LubosD · 2020-02-08T21:54:20Z

@TheBrokenRail Strange, the file gets generated during build as build/src/lkm/osfmk/mach/security_server.h.

Try cleaning your build directory. I've seen that the LKM build sometimes forgets to generate new files.

LubosD · 2020-02-08T21:55:26Z

TODO: 32-bit binaries are broken in the branch.

TheBrokenRail · 2020-02-08T22:04:22Z

It seems security_server.h was removed form the LKM in xnu-upgrade. https://github.com/darlinghq/darling-newlkm/blob/xnu-upgrade/osfmk/mach/Makefile

LubosD · 2020-02-08T22:08:37Z

@TheBrokenRail Looks like it's me who should clean the build tree. Fixed. As well as the 32-bit binaries.

TheBrokenRail · 2020-02-09T00:48:07Z

The LKM build is now failing with:

make[2]: *** No rule to make target '/var/lib/dkms/darling-mach/0.1/build/lkm/../miggen/osfmk/mach/memory_object_name_server.o', needed by '/var/lib/dkms/darling-mach/0.1/build/lkm/darling-mach.o'.  Stop.

TheBrokenRail · 2020-02-09T00:57:06Z

I created a PR to fix it: darlinghq/darling-newlkm#10.

TheBrokenRail · 2020-02-09T01:10:30Z

When I tried to run HelloWorld.app, it just gave me:

Darling [~/Documents]$ HelloWorld.app/Contents/MacOS/HelloWorld 
Segmentation fault: 11 (core dumped)

TheBrokenRail · 2020-02-09T03:31:13Z

.. alo seems to not work:

Darling [/Users]$ ls ..
ls: ..: No such file or directory
Darling [/Users]$ cd /Volumes/SystemRoot/
Darling [/Volumes/SystemRoot]$ ls ..
ls: ..: No such file or directory

and running strace on the previous error shows:

execve("/home/<USERNAME>/Documents/HelloWorld.app/Contents/MacOS/HelloWorld", ["HelloWorld.app/Contents/MacOS/He"...], 0x7fd2864071f0 /* 59 vars */) = -1 ENOENT (No such file or directory)

despite the file existing:

$ file /home/<USERNAME>/Documents/HelloWorld.app/Contents/MacOS/HelloWorld
/home/<USERNAME>/Documents/HelloWorld.app/Contents/MacOS/HelloWorld: Mach-O 64-bit x86_64 executable, flags:<NOUNDEFS|DYLDLINK|TWOLEVEL|PIE>

LubosD · 2020-02-09T10:16:26Z

I don't know about strace, but I hopefully fixed the .. bugs.

TheBrokenRail · 2020-02-09T14:45:40Z

It seems Darling is unable to launch anything when being straced-ed.

TheBrokenRail · 2020-02-09T14:56:39Z

Core dump for HelloWorld.app (no strace):
core.HelloWorld.zip
Also whenever I start Darling, audio stops working and I can no longer start new start new windows of gnome-terminal from the GNOME dock. The volume keys also no longer show the volume dialog. When starting PulseAudio manually it gives this:

W: [pulseaudio] server-lookup.c: Unable to contact D-Bus: org.freedesktop.DBus.Error.FileNotFound: Failed to connect to socket /run/user/1000/bus: No such file or directory
W: [pulseaudio] main.c: Unable to contact D-Bus: org.freedesktop.DBus.Error.FileNotFound: Failed to connect to socket /run/user/1000/bus: No such file or directory

LubosD · 2020-02-09T17:29:42Z

I can confirm that something is wiping /run/user/XXXX clean. I'll look into it, because it also breaks my XFCE4 testing environment.

I think strace doesn't work because it interferes with the parent/child relationship as seen by the LKM and this relationship is required for passing down the vchroot information.

I say we should provide a guaranteed working LLDB build for debugging purposes.

bugaevc · 2020-02-09T17:33:27Z

I think strace doesn't work because it interferes with the parent/child relationship as seen by the LKM and this relationship is required for passing down the vchroot information.

If you're running strace -p from outside of the container, there should be no interference. If you're running itmfrom the inside (doesn't it choke on Mach-Os anymore?), then that could be a problem.

TheBrokenRail · 2020-02-09T17:41:01Z

I am using:

sudo strace -fp <PID>

LubosD · 2020-02-09T20:53:23Z

Also whenever I start Darling, audio stops working and I can no longer start new start new windows of gnome-terminal from the GNOME dock. The volume keys also no longer show the volume dialog. When starting PulseAudio manually it gives this:

Fixed.

ahyattdev · 2020-02-09T21:05:19Z

@LubosD I agree that including a working LLDB would be very useful!

bugaevc · 2020-02-10T09:17:12Z

I guess we can close this, #600, and #415?

@TheBrokenRail please open separate GitHub issues if there still are issues with this.

TheBrokenRail · 2020-02-10T12:04:27Z

I have opened #652 and #653.

bugaevc added the Container The emulation container is configured incorrectly label Feb 21, 2018

bugaevc mentioned this issue Aug 21, 2018

Fakechroot ELF code #415

Closed

ahyattdev mentioned this issue Jun 2, 2019

SteamCMD fails to execute (dyld: Symbol not found: _kCFProxyAutoConfigurationURLKey) #507

Closed

ahyattdev added the Discussion Relating to Darling Project strategy label Jan 4, 2020

CuriousTommy mentioned this issue Jan 4, 2020

AppKit fails to load Mesa drivers #600

Closed

TheBrokenRail mentioned this issue Jan 14, 2020

[Suggestion] GSOC 2020 #584

Closed

TheBrokenRail mentioned this issue Feb 10, 2020

Cannot run basic GUI programs #653

Closed

bugaevc closed this as completed Feb 14, 2020

RFC: VFS #377

RFC: VFS #377

Comments

bugaevc commented Feb 21, 2018

Summary

Background

What is this all about?

What is VirtualPrefix?

Why was VirtualPrefix removed?

Was it a good idea to remove VirtualPrefix?

There's more to the story

Proposal

Why reimplement overlayfs functionality?

Unresolved issues

Alternatives

CuriousTommy commented Sep 16, 2019

CuriousTommy commented Dec 15, 2019

bugaevc commented Dec 15, 2019

CuriousTommy commented Dec 21, 2019

LubosD commented Jan 15, 2020

LubosD commented Jan 27, 2020

Case Insensitivity

LubosD commented Feb 5, 2020

LubosD commented Feb 8, 2020

bugaevc commented Feb 8, 2020

bugaevc commented Feb 8, 2020

LubosD commented Feb 8, 2020

bugaevc commented Feb 8, 2020

TheBrokenRail commented Feb 8, 2020 • edited Loading

TheBrokenRail commented Feb 8, 2020

LubosD commented Feb 8, 2020

LubosD commented Feb 8, 2020

TheBrokenRail commented Feb 8, 2020

LubosD commented Feb 8, 2020

LubosD commented Feb 8, 2020

TheBrokenRail commented Feb 8, 2020 • edited Loading

LubosD commented Feb 8, 2020

TheBrokenRail commented Feb 9, 2020

TheBrokenRail commented Feb 9, 2020

TheBrokenRail commented Feb 9, 2020

TheBrokenRail commented Feb 9, 2020

LubosD commented Feb 9, 2020

TheBrokenRail commented Feb 9, 2020

TheBrokenRail commented Feb 9, 2020

LubosD commented Feb 9, 2020

bugaevc commented Feb 9, 2020

TheBrokenRail commented Feb 9, 2020

LubosD commented Feb 9, 2020

ahyattdev commented Feb 9, 2020

bugaevc commented Feb 10, 2020

TheBrokenRail commented Feb 10, 2020 • edited Loading

TheBrokenRail commented Feb 8, 2020 •

edited

Loading

TheBrokenRail commented Feb 8, 2020 •

edited

Loading

TheBrokenRail commented Feb 10, 2020 •

edited

Loading