Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Add workarounds to prevent filesystem corruption on Apple Virtualization VMs. #5919

Merged
merged 6 commits into from
May 24, 2024

Conversation

gnattu
Copy link
Contributor

@gnattu gnattu commented Nov 23, 2023

This is a workaround for #4840. Apple introduced a new VZNVMExpressControllerDeviceConfiguration in macOS 14 Sonoma's Virtualization Framework, emulating an NVMe disk for Linux guests. This emulated disk is slower than the virtio disk from my initial testing, but it does not have the filesystem corruption issue present on the default virtio interface in most Linux kernels. macOS guests cannot use this device configuration, but they do not experience the filesystem corruption issue in the first place. On macOS 12+ hosts that cannot use VZNVMExpressControllerDeviceConfiguration, the cache mode of virtio drives in Linux VMs are changed to .cached because it is verified to also fix the filesystem corruption: #4840 (comment)

This PR:

  • Added a configuration in the UI to let users select whether the disk should use the NVMe interface. It only affects Linux VMs; on macOS VMs, this configuration does nothing.
  • Uses NVMe as the default for new Linux VMs on macOS 14+
  • Added localization for Chinese (Simplified and Traditional) and Japanese.
  • On macOS 12+ hosts, the caching mode for virtio drives in Linux VMs is changed to .cached

@wpiekutowski
Copy link

This successfully workarounds the file corruption problem. I didn't notice any new issues so far with NVMe.

@gnattu gnattu changed the title config(apple): use nvme as default disk interface on macOS 14+ fix: Add workarounds to prevent filesystem corruption on Apple Virtualization VMs. Nov 24, 2023
@gnattu
Copy link
Contributor Author

gnattu commented Nov 24, 2023

Updated to include setting the caching mode to .cached for macOS 12 hosts. Now, there is uncertainty about whether to continue using nvme as the default for creating new VMs, as .cached virtio drives seem to address the filesystem problem and perform better than nvme drives.

@wpiekutowski
Copy link

wpiekutowski commented Nov 24, 2023

I'd say for Linux, we use uncached when NVMe is available (macOS >= 14) and viritio + cached. Your PR is spot on.

Edit: ⬆️ was about doing all these behind the scenes or having good defaults for new users. Otherwise if NVMe isn't enabled by default for macOS >= 14 and caching for older macOS, then people will still experience corruption and will have to live with it (and blame Linux) or research what to do. In both cases they will waste their lives on dealing with this.

My reasoning is that it seems that uncached is the de facto default even on 32 GB RAM systems. I couldn't find any information about the logic behind automatic and when it switches to cached vs uncached. It might be dependent on total RAM. It makes sense to use uncached as we don't waste host RAM for caching guest disk, which it does by itself already. Thus we avoid double caching of guest disk access. I could imagine there might be workloads that could benefit from host caching, but I think we should keep the defaults as close as possible to the underlying framework.

BTW It might be good idea to let user decide which mode is the best for their use case, but that would have to be an advanced setting with some explanations. The same applies to synchronization modes. But I think that should be first discussed with @osy as UI changes are needed. Looking at QEMU support, where one can adjust almost everything on a very low level, I'd say this is something that would be consistent with the general look and feel of UTM.

@wpiekutowski
Copy link

wpiekutowski commented Nov 24, 2023

I see that NVMe defaults to false. I think we should have good defaults for new users. Otherwise if NVMe isn't enabled by default for macOS >= 14 and caching for older macOS, then people will still experience corruption and will have to live with it (and blame Linux or UTM) or research what to do. In both cases they will waste their lives dealing with this.

@wpiekutowski
Copy link

Defaulting to NVMe isn't great because it might render some VMs unbootable. It might be a good idea to figure out how to add more info about NVMe and why it was introduced and pros/cons.

@gnattu
Copy link
Contributor Author

gnattu commented Nov 24, 2023

Defaulting to NVMe isn't great because it might render some VMs unbootable. It might be a good idea to figure out how to add more info about NVMe and why it was introduced and pros/cons.

This PR does not change the options of existing disks. Instead, when a user creates a new VM, the new disk created with this VM will be set as an NVMe drive on macOS 14+ hosts by the following lines in the wizard:

if #available(macOS 14, *) {
// Use NVMe as the default disk interface to avoid filesystem corruption on macOS 14+, only available for Linux
useNvmeAsDiskInterface = true
}

This is what I meant by "default". Not magically using NVMe for every disk, but to decide what to use for newly created VMs.

@wpiekutowski
Copy link

Thanks, this makes sense.

@wpiekutowski
Copy link

I did some more testing and for some workloads, like unit tests, I see about 15-20% slowdown with NVMe drive which is getting significant for my use case. I'm going to run my VMs in virtio mode with host caching.

@gnattu
Copy link
Contributor Author

gnattu commented Dec 3, 2023

Now I'm thinking it again on using nvme as default for new VMs. @wpiekutowski did you experience any issues with forced cached mode on virtio yet? If the virtio is good enough then probably we can default to that for better performance.

@wpiekutowski
Copy link

I haven't used UTM much last week, so can't say one way or another. stress-ng and parallel cp work fine with caching + viritio, so probably that's a good compromise between stability and performance. We might want to add a note somewhere about the fact that host disk image caching is enabled with Linux guests when NVMe is off.

@wangqiang1588
Copy link

wangqiang1588 commented Dec 19, 2023

截屏2023-12-19 09 42 02

NVME disk works fine on linux kernel 5.x (ubuntu 22.04.3) , and always fail on macOS Sonoma 14.2 debian 12.4.0 linux kernel 6.1

but works on caching + viritio, this is a new problem.

@gnattu
Copy link
Contributor Author

gnattu commented Dec 19, 2023

When I tested with Fedora, it was using the upstream Fedora kernel (the 6.5 branch kernel) and did not exhibit this issue. It now appears that macOS 14.2 might have modified the NVMe device. Experiencing a straightforward NVMe I/O error is something we haven't encountered before. I'm re-testing in the same environment using stress-ng to see if I can reproduce it. It's possible that the NVMe disk also requires the caching mode to be set to cached now.

@gnattu
Copy link
Contributor Author

gnattu commented Dec 19, 2023

Strange, my environment used with pre-14.2 with a fast 10-minute stress-ng cannot reproduce this. @wangqiang1588 does this always occur or does it happens randomly?

@wangqiang1588
Copy link

wangqiang1588 commented Dec 19, 2023

Strange, my environment used with pre-14.2 with a fast 10-minute stress-ng cannot reproduce this. @wangqiang1588 does this always occur or does it happens randomly?

I Install debian 12.4.0 twice , it happens twice.

  1. install debian from iso, give 24GB memory 、 12 cpu kernel 、500GB disk
  2. send 113 GB android aosp zip file from scp command (success)
  3. unzip 113GB aosp file ( success )
  4. repo init -b android-security-11.0.0_r72 (success)
  5. repo sync -j4 --fail-fast -l ( local sync a lot of files then fails here, twice)

always fails on "repo sync" command

then i change it to caching + viritio ,same steps , success

same steps ubuntu 22.04.3 linux kernel 5.15 nvme disk never failure happens (same aosp zip file)

looks like some nvme driver changed after 5.15 kernel make this failure

@wpiekutowski
Copy link

On NixOS unstable channel + Linux kernel 6.6.7 using uncached NVMe all my tests are passing. No disk corruption during stress-ng, parallel cp or system upgrade.

@wangqiang1588
Copy link

wangqiang1588 commented Dec 20, 2023

On NixOS unstable channel + Linux kernel 6.6.7 using uncached NVMe all my tests are passing. No disk corruption during stress-ng, parallel cp or system upgrade.

maybe the number of files make this bug, "repo sync" make a lot of files at very short time (.h, .c ,.cpp,.java ...)

@wangqiang1588
Copy link

Fedora 39 Test ,not found this bug, looks like only happen on linux kernel 6.1 , let me try to upgrade debian's kernel

@wangqiang1588
Copy link

wangqiang1588 commented Dec 20, 2023

today upgrade macOS Sonoma 14.2 to 14.2.1 every thing works again .

linux kernel 6.1 linux kernel 6.5 both works fine on debian 12.4

maybe this is a bug of macOS Sonoma 14.2 .

@wrmack
Copy link

wrmack commented Jan 7, 2024

Can confirm tart works perfectly for me (see reference to cirruslabs/tart above). No disk corruption after two days. Tart uses NVME disks. Host: Sonoma 14.2.1. Guest: Ubuntu 24.04 (Noble Numbat). Kernel 6.6.7.

@osy
Copy link
Contributor

osy commented Feb 26, 2024

@gnattu Is this ready to be merged? Can you rebase from the main branch?

@osy osy added this to the Future milestone Feb 26, 2024
@gnattu gnattu reopened this Mar 30, 2024
@gnattu
Copy link
Contributor Author

gnattu commented Mar 30, 2024

@osy I just rebased from main.

One change I've made is that I reverted the nvme as default and prefers virtio with cached mode because there are new reports that the nvme mode is still causing errors in some cases where the virtio + cached mode works just fine.

@slycke
Copy link

slycke commented Mar 30, 2024

🎉 can’t wait.
Please include this important fix for Linux vm’s in 4.5

@erazemk
Copy link

erazemk commented Apr 27, 2024

Any news on when this might get merged and released?

@osy osy modified the milestones: Future, v4.5 Apr 27, 2024
@gkyildirim
Copy link

An early beta release with this fix is much appreciated as some other projects did. Otherwise it is practically not possible to use a stable linux vm with apple virtualization, no?

@gkyildirim
Copy link

For reference this seems to be solved here some time ago lima-vm/lima#2026

And there seems to be no issues about corruption after that.

@alanfranz
Copy link

@osy I apologize for pinging, do you think this could be merged? UTM experience is just so buggy without this.

@osy
Copy link
Contributor

osy commented May 16, 2024

This will be included in the next release. If you need this now, you can follow the build instructions along with the prebuilt dependencies. Having multiple people ask about a release isn't going to speed it up. If you want to help maintain this project please let me know.

@alanfranz
Copy link

This will be included in the next release. If you need this now, you can follow the build instructions along with the prebuilt dependencies. Having multiple people ask about a release isn't going to speed it up. If you want to help maintain this project please let me know.

I'm already using a version built by myself with the patch, but I need to keep it up-to-date and it isn't signed.

As I said, I apologize for pinging, I just wanted to make sure that this didn't accidentaly slip your mind. I understand the effort an open source maintainer needs to put into his projects, so, thank you! No claim or demand on my part.

If I can help with testing this (and reporting my test findings) I'll do it. But I don't think I have the bandwidth to become a maintainer for this project.

@gkyildirim
Copy link

I guess this is a fairly helpfull issue. People put info, solution and test. I dont think there is a room for apologize since this is a corruption issue, there is a fix and there are sincere people trying to be helpful. Since this is 4 month old fix and it is we people tend to forget among other things people are just trying to land a kindly reminder.

@osy osy modified the milestones: Future, v4.5 May 21, 2024
@osy osy merged commit 8e84a1e into utmapp:main May 24, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants