Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: VFS #377

Closed
bugaevc opened this issue Feb 21, 2018 · 35 comments
Closed

RFC: VFS #377

bugaevc opened this issue Feb 21, 2018 · 35 comments
Labels
Container The emulation container is configured incorrectly Discussion Relating to Darling Project strategy

Comments

@bugaevc
Copy link
Member

bugaevc commented Feb 21, 2018

Summary

Bring back VirtalPrefix and add more stuff on top of it, getting rid of mount namespaces and overlayfs

(sounds like a regression, doesn't it?)

Background

(I'm trying to both make a proposal and document what goes on, hence this section)

What is this all about?

Linux and macOS filesystem layouts, while similar, differ significantly enough that we can't present the host filesystem to programs running under Darling, and when they do agree (e.g. both put executables under /bin & /usr/bin) we want programs to see our versions of those directores.

No matter what exact mechanism we use for that, we deal with the above by having our own macOS-like "chroot" in libexec/darling/, on top of which we overlay so-called "prefixes" aka dprefixes (more about them in the wiki).

What is VirtualPrefix?

Darling used to implement chroot emulation for macOS executables by making use of the fact that we ship a complete libSystem (libc + other stuff) library instead of linking to the one used by the host. While we try not to modify most of libSystem (compared to what Apple ships/publishes), we do change libkernel/libsyscall to bridge between Darwin and Linux syscalls, and that gives us a chance to transform paths passed between libSystem and the Linux kernel.

The overlaying was achieved by just copying libexec/darling into each prefix each time anything changed (this is also what Wine does).

Why was VirtualPrefix removed?

I've found a few bugs in how VirtualPrefix worked but decided that it'd be easier to do away with it altogether than debug and fix those bugs. We were just moving to mount namespace + overlayfs mechanism for overlaying prefix contents over libexec/darling, and chrooting into the resulting directory was a very natural and simple extension of that idea.

To quote myself from #197,

If we're using mount namespaces, do we really need the hackish fakechroot/VirtualPrefix implementation in syscall emulation? The kernel is much better at doing this

It seemed the only "small problem" with this new layout was that ld-linux, Linux's dynamic loader, was unable to find native ELF libraries when invoked from inside the container. We worked that around by setting up a copy of /etc/ld.so.conf and /etc/ld.so.cache at installation time.

See #222 for the merge request that removed VirtualPrefix.

Was it a good idea to remove VirtualPrefix?

I'm not so sure anymore. Using Linux's native mount namespaces and chroot/pivot_root implementation is indeed a much cleaner solution than reimplementing all that logic ourselves, but it started to cause us a lot more headache since then.

Besides ld.so, other things than we had or would have to patch/workaround in some way because of native flies not being where they are expected to be are the X11 and Wayland sockets (we would have to symlink them), fontconfig config files and fonts themselves (which we hacked around by again making a modified copy of the config at installation time), Mesa "drivers" (relatively cleanly fixed by symlinking /usr/lib64/ to that of the host), GTK+ and Qt themes (and icons, and cursors, and the rest of /usr/share — could be fixed by modifying XDG_DATA_DIRS), native open/save file dialogs displaying prefixes intead of host's filesystem layout (my latest idea was to ask the DE to open an appropriate dialog over D-Bus using org.freedesktop.portal API) and probably many more that we haven't thought about / stumbled upon yet.

As you can see, these are spread throughout the stack and are all worked around differently — and yet, incidentally or not, they would all go away if only the native libc saw the host's filesystem layout (which was the case with VirtualPrefix).

Do all those justify bringing VirtualPrefix back? I'm not sure of that either. VirtualPrefix still is an ugly & buggy hack and native solutions are still nicer.

But wait,

There's more to the story

There are other filesystem-related things we have/want to tackle in some way:

  • Case (in)sensitivity (I believe VirtualPrefix handled that)
  • Different /dev layout (currently we symlink /dev from the host)
  • Host mounts appearing directly in /Volumes (see New /Volumes design #220)
  • File ownership (which we currently fake at syscall level)
  • Per-thread CWD (same)

And if you think of it, mount namespaces and overlayfs are too parts of the filesystem story, which brings us to the

Proposal

(well, maybe not a proposal, but an idea)

Let's revive [the idea of] VirtualPrefix, rewrite it from scratch to be fully correct and turn it into a more complete Virtual File System (VFS) implementation that would handle all of the above, including case-insensitivity, faking ovnership, tracking CWD, tweaking directory layout, mounting sub-/Volumes the way we want them and overlaying the prefix on top of libexec/darling.

Now that sounds like a much cleaner solution than what we have today.

Why reimplement overlayfs functionality?

...we don't have lots of hacks because of overlayfs, do we?

Not as many as we have because of other things, no, but there are a few problems with it. Firstly, modifying underlying filesystems while overlayfs is mounted is undefined behavior (meaning we can't update our libexec files while a container is running, and putting .init.pid inside the prefix directory is/was UB too). Secondly, it doesn't support encrypted home folders (see #242) nor some other interesting filesystems. Last but not least, we can't make it support case-insensitivity without basically reimplementing it.

Reimplementing overlayfs would mean that we would also no longer need mount namespaces.

Unresolved issues

  • We do want to mount separate /proc and /dev/shm, but with no kernel-level-chroot, there's nowhere to mount them. Ooops.
    • We could still unshare the mount namespace and mount proc to e.g. /proc2 (or $DPREFIX/proc). I don't quite understand how /dev/shm works even now with /dev being a symlink.
  • That would make our filesystem operations not atomic.
    • Do we really care about that?

Alternatives

  • Do nothing; what we have now works well and we have a pretty good understanding of how to workaround filesystem layout issues we foresee. If it ain't broke, don't fix it.

(Note: RFC stays for 'Request For Comments')

@CuriousTommy
Copy link
Contributor

Case (in)sensitivity (I believe VirtualPrefix handled that)

On a side note, it seems like some filesystem are providing the ability to be case-insensitive (ext4 and F2FS)

@CuriousTommy
Copy link
Contributor

@bugaevc I hope you don't mind me asking, but has there been any progress with adding back VFS? If not, I would like to help get the ball rolling (so to speak).

Skimming through #222, I believe that we need to modify the syscall in src/kernel/emulation/linux/ to use the future recreation of VirtualPrefix, correct? Do we also need to change anything in the dyld directory?


Now, about those features your want.

Case (in)sensitivity (I believe VirtualPrefix handled that)

Like I said in my earlier post, some file systems already offer this feature. If we need to, we can implement it, but I rather have the OS take care of this (mainly for performance reasons).

We also need to make sure there is a way to disable this. iOS is case-sensitive by default.

File ownership (which we currently fake at syscall level)

I haven't look into to how darling handles this, but I was thinking of having a hidden file in each directory (we could call it .__darling_owner) that holds information on who owns what.

This file would only be created if you are in the Darling virtual drive. If you are accessing the Linux file system through Darling or adding a file to a writable DMG, this file won't be created.

Different /dev layout (currently we symlink /dev from the host)
Host mounts appearing directly in /Volumes (see #220)

I am not sure how to properly implement this part since I don't have any experience with VFS.

With that being said, I was brainstorming some ideas for how the /Volumes directory would work. We could have a generic Volume class that can be extended to support SystemRoot, DMG mountpoints, and any other volumes.

If this idea is a good idea, we would do something similar for /dev, such has having a generic Disk class and so on.

Per-thread CWD (same)

Dumb question, but I am going to assume that this allows each process to keep track of it's own CWD, right?

We do want to mount separate /proc and /dev/shm, but with no kernel-level-chroot, there's nowhere to mount them. Ooops.

From what I understand, /proc and /dev/shm does not exist on a real Mac. What would be the benefit of giving mac applications access to these stuff?

@bugaevc
Copy link
Member Author

bugaevc commented Dec 15, 2019

has there been any progress with adding back VFS?

The current plan is to do something like this with help from the kernel, dubbed vchroot. There was even a WIP branch.

I believe that we need to modify the syscall in src/kernel/emulation/linux/ to use the future recreation of VirtualPrefix, correct?

Yes

Do we also need to change anything in the dyld directory?

No, dyld should use the same libkernel stuff to access the file system as everything else uses.

I haven't look into to how darling handles this, but I was thinking of having a hidden file in each directory (we could call it .__darling_owner) that holds information on who owns what.

I don't think we need to track the fake owners. I don't remember the exact details, but IIRC it's pretty clear what the fake owner needs to be, it's faking itself that is important.

Dumb question, but I am going to assume that this allows each process to keep track of it's own CWD, right?

That's basically how per-thread CWD is implemented now, yes. With shared CWD the kernel keeps track of it (i.e.we change the CWD of the Linux process), and when a thread wants to switch to a per-thread CWD the userspace implementation takes over.

From what I understand, /proc and /dev/shm does not exist on a real Mac. What would be the benefit of giving mac applications access to these stuff?

They do not exist, so it's unlikely that Darwin software would look for them, so it's not hurting compatibility to expose them. The reason we want then in our container is simple: /proc is Linux's API to access process info; Darwin has its own API for that which we implement on top of Linux's API, /proc, so that's why we need it there.

@CuriousTommy
Copy link
Contributor

While the VFS stuff might be too advance for me, I would love to at least try and help you or LubosD with this.

There was even a WIP branch.

Thank's for the link. Am I going to assume that this code is outdated and will probably be removed eventually, right? Regardless, I am going to take a look at the first commit to see what LubosD changed.

I don't think we need to track the fake owners. I don't remember the exact details, but IIRC it's pretty clear what the fake owner needs to be, it's faking itself that is important.

I guess the majority of applications don't need to change ownership. Most of the files and folders on a real Mac are owned by the system account (with the main exception being the users that live the in the /Users folder.

@ahyattdev ahyattdev added the Discussion Relating to Darling Project strategy label Jan 4, 2020
@LubosD
Copy link
Member

LubosD commented Jan 15, 2020

The vchroot branch is based on a kernel-based "acceleration", instead of doing relatively slow userspace resolution.

At the moment, the kernel code is buggy (or simply wrong). It is based on some tricks with dentries, but this area of the Linux kernel isn't very well documented (just like the rest of it), so it works only in specific scenarios and needs a rework.

@LubosD
Copy link
Member

LubosD commented Jan 27, 2020

I have resumed my work on this and hopefully already figured out some of the troubles - once again caused by the kernel not exporting some very useful functions... :-( Such as d_absolute_path().

Case Insensitivity

For now, darling.c could enable the ext4-specific xattr (EXT4_CASEFOLD_FL) to enable case insensitivity on ~/.darling.

As a long-term solution (if we choose to fork overlayfs, for instance), we could maybe reuse some of the insensitive lookup logic from sdcardfs.

@LubosD
Copy link
Member

LubosD commented Feb 5, 2020

Small change of plans: I keep finding scenarios where my kernel-based vchroot just doesn't work. I'm starting to believe that what we need cannot be implemented by making a few simple Linux kernel API calls (or I'm just doing something wrong).

Either way, to speed things up, I'll now implement a user-space based vchroot, with a possible later upgrade to a kernel implementation. Because we really need this issue to be resolved - the sooner the better!

@LubosD
Copy link
Member

LubosD commented Feb 8, 2020

The branch now works to the point that darling shell (including launchd) seems to run OK.

For the first time in god knows how long, I can just start HelloWorld.app without any additional hassle. But it hangs after the first click - does it ring any bells, @bugaevc?

@bugaevc
Copy link
Member Author

bugaevc commented Feb 8, 2020

For the first time in god knows how long, I can just start HelloWorld.app without any additional hassle.

🎉 🎉

But it hangs after the first click - does it ring any bells, @bugaevc?

#368 :)

@bugaevc
Copy link
Member Author

bugaevc commented Feb 8, 2020

Please test various stuff with readlink, pwd, realpath and stat around real mount points and /Volumes/SystemRoot — this is what the old VirtualPrefix has had issues with.

@LubosD
Copy link
Member

LubosD commented Feb 8, 2020

#368 :)

Hmmmmm, but the test case I wrote back then now works (the vchroot branch uses the new XNU).

I did some quick grepping, and I can't find CFRunLoopAddCommonMode(CFRunLoopGetMain(), NSEventTrackingRunLoopMode) being called in Cocotron. Based on my understanding, this is required for common mode sources to work under that event tracking mode.

What do you think?

@bugaevc
Copy link
Member Author

bugaevc commented Feb 8, 2020

Hmmmmm (2), I see I've changed how this works relatively recently: darlinghq/darling-cocotron@56ba4b6 The change itself makes sense, and I remember wanting to do it.

So I guess we now need to add NSEventTrackingRunLoopMode to common modes (and others defined by AppKit? can you check which ones Apple does add?)

@TheBrokenRail
Copy link
Contributor

TheBrokenRail commented Feb 8, 2020

As of tge moment you can't build the vchroot branch because src/CMakeLists.txt includes libelfloader/fakechroot which doesn't exist. Also, the Debian packaging postinst, install, and prerm still try to run the scripts in src which no longer exist.

@TheBrokenRail
Copy link
Contributor

Also, where is the source for HelloWorld.app.

@LubosD
Copy link
Member

LubosD commented Feb 8, 2020

@TheBrokenRail I fixed the fakechroot stuff.
I don't have the source handy, but here's the binary of HelloWorld.app.
HelloWorld.app.zip

@LubosD
Copy link
Member

LubosD commented Feb 8, 2020

So I guess we now need to add NSEventTrackingRunLoopMode to common modes (and others defined by AppKit? can you check which ones Apple does add?)

They add NSEventTrackingRunLoopMode and NSEventTrackingRunLoopMode, not more. Tried it out, and it works!

Making a commit now.

@TheBrokenRail
Copy link
Contributor

When building the LKM I got:

/var/lib/dkms/darling-mach/0.1/build/lkm/duct/osfmk/dummy-kern-task.c:139:10: fatal error: mach/security_server.h: No such file or directory
  139 | #include <mach/security_server.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~

@LubosD
Copy link
Member

LubosD commented Feb 8, 2020

@TheBrokenRail Strange, the file gets generated during build as build/src/lkm/osfmk/mach/security_server.h.

Try cleaning your build directory. I've seen that the LKM build sometimes forgets to generate new files.

@LubosD
Copy link
Member

LubosD commented Feb 8, 2020

TODO: 32-bit binaries are broken in the branch.

@TheBrokenRail
Copy link
Contributor

TheBrokenRail commented Feb 8, 2020

It seems security_server.h was removed form the LKM in xnu-upgrade. https://github.com/darlinghq/darling-newlkm/blob/xnu-upgrade/osfmk/mach/Makefile

@LubosD
Copy link
Member

LubosD commented Feb 8, 2020

@TheBrokenRail Looks like it's me who should clean the build tree. Fixed. As well as the 32-bit binaries.

@TheBrokenRail
Copy link
Contributor

The LKM build is now failing with:

make[2]: *** No rule to make target '/var/lib/dkms/darling-mach/0.1/build/lkm/../miggen/osfmk/mach/memory_object_name_server.o', needed by '/var/lib/dkms/darling-mach/0.1/build/lkm/darling-mach.o'.  Stop.

@TheBrokenRail
Copy link
Contributor

I created a PR to fix it: darlinghq/darling-newlkm#10.

@TheBrokenRail
Copy link
Contributor

When I tried to run HelloWorld.app, it just gave me:

Darling [~/Documents]$ HelloWorld.app/Contents/MacOS/HelloWorld 
Segmentation fault: 11 (core dumped)

@TheBrokenRail
Copy link
Contributor

.. alo seems to not work:

Darling [/Users]$ ls ..
ls: ..: No such file or directory
Darling [/Users]$ cd /Volumes/SystemRoot/
Darling [/Volumes/SystemRoot]$ ls ..
ls: ..: No such file or directory

and running strace on the previous error shows:

execve("/home/<USERNAME>/Documents/HelloWorld.app/Contents/MacOS/HelloWorld", ["HelloWorld.app/Contents/MacOS/He"...], 0x7fd2864071f0 /* 59 vars */) = -1 ENOENT (No such file or directory)

despite the file existing:

$ file /home/<USERNAME>/Documents/HelloWorld.app/Contents/MacOS/HelloWorld
/home/<USERNAME>/Documents/HelloWorld.app/Contents/MacOS/HelloWorld: Mach-O 64-bit x86_64 executable, flags:<NOUNDEFS|DYLDLINK|TWOLEVEL|PIE>

@LubosD
Copy link
Member

LubosD commented Feb 9, 2020

I don't know about strace, but I hopefully fixed the .. bugs.

@TheBrokenRail
Copy link
Contributor

It seems Darling is unable to launch anything when being straced-ed.

@TheBrokenRail
Copy link
Contributor

Core dump for HelloWorld.app (no strace):
core.HelloWorld.zip
Also whenever I start Darling, audio stops working and I can no longer start new start new windows of gnome-terminal from the GNOME dock. The volume keys also no longer show the volume dialog. When starting PulseAudio manually it gives this:

W: [pulseaudio] server-lookup.c: Unable to contact D-Bus: org.freedesktop.DBus.Error.FileNotFound: Failed to connect to socket /run/user/1000/bus: No such file or directory
W: [pulseaudio] main.c: Unable to contact D-Bus: org.freedesktop.DBus.Error.FileNotFound: Failed to connect to socket /run/user/1000/bus: No such file or directory

@LubosD
Copy link
Member

LubosD commented Feb 9, 2020

I can confirm that something is wiping /run/user/XXXX clean. I'll look into it, because it also breaks my XFCE4 testing environment.

I think strace doesn't work because it interferes with the parent/child relationship as seen by the LKM and this relationship is required for passing down the vchroot information.

I say we should provide a guaranteed working LLDB build for debugging purposes.

@bugaevc
Copy link
Member Author

bugaevc commented Feb 9, 2020

I think strace doesn't work because it interferes with the parent/child relationship as seen by the LKM and this relationship is required for passing down the vchroot information.

If you're running strace -p from outside of the container, there should be no interference. If you're running itmfrom the inside (doesn't it choke on Mach-Os anymore?), then that could be a problem.

@TheBrokenRail
Copy link
Contributor

I am using:

sudo strace -fp <PID>

@LubosD
Copy link
Member

LubosD commented Feb 9, 2020

Also whenever I start Darling, audio stops working and I can no longer start new start new windows of gnome-terminal from the GNOME dock. The volume keys also no longer show the volume dialog. When starting PulseAudio manually it gives this:

Fixed.

@ahyattdev
Copy link
Member

@LubosD I agree that including a working LLDB would be very useful!

@bugaevc
Copy link
Member Author

bugaevc commented Feb 10, 2020

I guess we can close this, #600, and #415?

@TheBrokenRail please open separate GitHub issues if there still are issues with this.

@TheBrokenRail
Copy link
Contributor

TheBrokenRail commented Feb 10, 2020

I have opened #652 and #653.

@bugaevc bugaevc closed this as completed Feb 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Container The emulation container is configured incorrectly Discussion Relating to Darling Project strategy
Projects
None yet
Development

No branches or pull requests

5 participants