From d8a18ed86a52056842deac6dd6bca313c1b05b59 Mon Sep 17 00:00:00 2001 From: marcan Date: Sun, 14 Jan 2024 07:50:37 +0000 Subject: [PATCH] deploy: cdece326c261c01b46aa69ac7fa5c56a7dd184fc --- 2021/03/progress-report-january-february-2021/index.html | 2 +- 2021/08/progress-report-august-2021/index.html | 2 +- 2021/10/progress-report-september-2021/index.html | 2 +- 2021/12/progress-report-oct-nov-2021/index.html | 2 +- 2022/03/asahi-linux-alpha-release/index.html | 2 +- 2022/07/july-2022-release/index.html | 2 +- 2022/11/november-2022-report/index.html | 2 +- 2022/11/tales-of-the-m1-gpu/index.html | 2 +- 2022/12/gpu-drivers-now-in-asahi-linux/index.html | 2 +- 2023/03/road-to-vulkan/index.html | 2 +- 2023/06/opengl-3-1-on-asahi-linux/index.html | 2 +- 2023/08/fedora-asahi-remix/index.html | 2 +- 2024/01/fedora-asahi-new/index.html | 2 +- 13 files changed, 13 insertions(+), 13 deletions(-) diff --git a/2021/03/progress-report-january-february-2021/index.html b/2021/03/progress-report-january-february-2021/index.html index 9ef3513..b74b1f6 100644 --- a/2021/03/progress-report-january-february-2021/index.html +++ b/2021/03/progress-report-january-february-2021/index.html @@ -49,7 +49,7 @@

Progress Report: January / February 2021

Most drivers and devices would break with write-gathering and re-ordering enabled, so those modes are seldom used except by very specific drivers. However, early write completion is actually the standard on PCs, because it is mandated by the PCI specification. Therefore, almost every driver is written to account for this. For this reason, AArch64 Linux also defaults to mapping all I/O memory as nGnRE, with early termination enabled. On other devices, this poses no problem. Many of those devices may not support posted writes as such, but in that case they would simply treat the accesses as nGnRnE. Devices are always allowed to provide stricter guarantees than what the software requests; as long as the device behaves at least as strictly as the software requires, there is no problem.

As we found out, the M1’s internal bus fabric actively enforces that all accesses use nGnRnE mode. If you try to use nGnRE mode, the write is dropped, and instead the system signals an SError (System Error). We were not seeing these SErrors initially due to a CPU configuration setting that had been inadvertently pulled in from another project, which was incorrectly disabling error reporting (though we wouldn’t have been able to see the errors either, since the UART was broken, at least it would’ve caused the system to stop working after UART writes instead of silently dropping them and continuing).

Astute readers might have noticed an interesting detail here: the M1 SoC has PCIe! In fact, some internal devices are PCIe devices (such as Ethernet on the Mac Mini), and, thanks to Thunderbolt, M1 Macs can be connected to any PCIe device. Don’t those use posted writes? Indeed, they do! In fact, the M1 requires nGnRE mappings for PCI devices, rejecting nGnRnE writes.

This poses a conundrum. Linux has no framework for mapping memory as nGnRnE. We could introduce a one-off quirk to use nGnRnE instead of nGnRE mode everywhere, but then that would make it impossible to support PCIe devices which require nGnRE. This became our first real test of upstream interaction: we had to develop a completely bespoke mechanism for mapping memory as nGnRnE, and then a way to instruct Linux to use it for non-PCI devices on Apple Silicon platforms, while still allowing PCI drivers to use nGnRE mode. And we had to do it in a clean, well-engineered way, that balances being non-intrusive to existing code and being potentially useful to other non-Apple devices, and that we could agree on with the maintainers responsible for these subsystems.

In the end, after several weeks of discussion with kernel maintainers across multiple subsystems and multiple patch revisions, we have largely settled on this approach:

This does require some minor driver re-factoring for drivers that use ioremap() directly, but since this is only necessary for hardware that is built into the M1, only a few drivers need to be changed. The vast majority of PCI drivers use a raw ioremap() these days, and all of them could be used with M1 computers via a Thunderbolt adapter; none of those drivers need to be changed, as the default ioremap() will work properly for those by still requesting nGnRE mode.

As part of this change, we also realized that documentation on all the different ioremap() modes in Linux was sorely lacking, as was complete documentation on the I/O read and write functions (which are related, and of which there are many subtle variants). I worked together with Arnd Bergmann to add these missing docs, which you can find here (this will be here once the changes are merged upstream).

Interestingly, since this change applies to the generic “simple-bus” device, it means we had to contribute patches to the core DeviceTree specification and its schemas. Thankfully, as DeviceTree is an open community-driven project, all it takes is a couple GitHub PRs!

You See, It’s AIC

A modern CPU’s job isn’t just to run instructions in order, but also to react to changes in the environment that might require it to stop what it is doing and go do something else. These are often called “exceptions”. You might know these from high-level programming languages as an error of some sort, but in CPUs they are also used to indicate when there is an external need for attention (similar to signals like SIGCHLD and SIGALRM in POSIX userspace programs).

The most important of these is the interrupt request (IRQ), which is used by hardware peripherals to request the attention of the CPU. The CPU then runs some OS code which is in charge of figuring out which peripheral needs attention and handling the request.

On AArch64 CPUs, there is a single IRQ input. That means that something needs to gather together the interrupt requests from all devices in the system, distribute them to the correct CPU cores (as configured by the OS), and tell the OS which underlying devices need attention when an IRQ fires. This is the job of the interrupt controller, or “irqchip” in Linux terminology.

On systems with more than one core, the IRQ controller also has another job: handling inter-processor interrupts (IPIs). Sometimes, software running on one core needs to get the attention of another core. IPIs make this possible: the interrupt controller provides some kind of mechanism where one core can send it a request, which it will then forward as an interrupt to another core. Without IPIs, multi-core systems cannot work properly.

Most AArch64 systems use a standard interrupt controller, called the Generic Interrupt Controller (GIC). This is a rather complex and fairly full-featured interrupt controller, with advanced features such as interrupt priority, virtualization, and more. This is great, because it means Linux does not need to implement proprietary irqchips as the main interrupt controller on most AArch64 systems.

As you’ve probably guessed by now, Apple decided to go their own way. They have their very own, custom Apple Interrupt Controller (AIC). We had to reverse engineer this hardware and build our own irqchip driver for Linux to support it! Thankfully for us, AIC is actually quite simple. By using the few scraps of outdated documentation that exist in the open source portion of macOS/iOS (XNU), and probing the hardware via trial and error, we were able to figure out everything we needed to make interrupts work and write a Linux driver.

Alas, there was one additional wrinkle. Linux needs IPIs to work properly. Specifically, Linux uses 7 different types of IPI: it expects to be able to send 7 different kinds of independent requests from one CPU core to another, and treat them as distinct events. Every other IRQ controller used on AArch64 systems supports this kind of fine-grained IPI separation. Unfortunately, AIC does not: it only supports 2, and in fact was designed to have them be used in different ways (one is meant to be sent to other CPUs, while the other is for “self-IPIs” from one core to itself, which is sometimes necessary). To make this work for Linux, we had to implement a “virtual” interrupt controller. The AIC driver internally manages up to 32 different events that can be pending for any given CPU core, and funnels them all through a single hardware IPI for that core. When the IPI arrives at that core, it checks to see which events are pending, and delivers them to Linux as if they were separate IPIs. The rest of Linux sees an interrupt controller that can handle up to 32 IPIs per CPU, even though the hardware only supports 2 (and we only use one). Phew!

Writing drivers for even simple interrupt controllers like AIC is complex. There are many subtleties to interrupt handling, and if the code is slightly wrong it can cause frustrating heisenbugs that only appear under rare sequences of events – but can hang your entire OS, making debugging nearly impossible. Printing debug info from interrupt handlers is tricky, because changing the timing can make bugs go away, and it can also make everything too slow to be usable. Adding a software IPI multiplexer further complicates things, as we now have to emulate in software what is typically handled by the hardware: getting it wrong could cause things like IPIs going missing due to race conditions.

While trying to understand these details to ensure that the AIC code is correct, I found myself deep in a rabbit hole researching the details of memory ordering and memory barriers on AArch64, and even found a subtle bug in the ARM64 Linux implementation of atomic operations! Talking about this subject would be an entire multi-part saga, but if you are brave enough to want to learn more, I recommend Will Deacon’s talks, such as this one and this one. In particular, this commit answered a lot of questions, and Will also helped clear up some of my remaining doubts. Being confident about the memory model and the soundness of the AIC code will help avoid frustrating debugging sessions much further down the line. Just imagine if we had to trace a subtle GPU hang issue that only happens when you do certain things in a game (but only sometimes, and it takes an hour to reproduce) down to an AIC driver race condition!

For better or for worse, the M1 is particularly good at exposing these kinds of subtle bugs. It is such a highly out-of-order machine that it tickles race conditions which you would never hit on other CPUs. While debugging an earlier m1n1 issue, we even saw it (legitimately) speculating its way out of an interrupt handler… while to the code it seemed like it was still halfway through the handler printing debug info! The underlying problem there turned to have been caused by a subtle misconfiguration of the MMU, which gives you an idea of just how inextricably tied together all these core systems are, and how tricky to debug they can be.

Interestingly, the M1 chip actually has a bit of the standard GIC in it – specifically, it supports natively virtualizing the low-level bits of a GIC to VM guests! This allows for much higher performance interrupt handling, since otherwise the VM hypervisor has to emulate every little detail of the interrupt controller, which means every interrupt requires many calls into hypervisor code and back. Oddly enough… the macOS Hypervisor Framework does not support this (at the time of writing), requiring VM hypervisors to do full GIC emulation in software! We have already tested it and it works, and I’ve been working with Marc Zyngier on the peculiarities of running VMs on these chips; he already has Linux virtual machines booting on top of KVM running on the Asahi Linux kernel on M1 Macs. It’s too early for benchmarks, but we expect that without that support in macOS, once other bits and pieces are in place, this will make native Linux-on-Linux VMs faster than Linux-on-macOS VMs, especially for IPI-heavy workloads.

Finicky FIQs

Next up, every OS needs a system timer to work. When your computer runs multiple applications, the OS needs to be able to switch between them on the same CPU core, to make multitasking work. It also needs to be able to schedule things to be done at certain points in time, from writing buffered data to disk to showing the next frame of a YouTube video to making the clock in your task bar tick forward. All of this is accomplished with some kind of timer hardware, which the OS can program to deliver an IRQ at some point in the future.

AArch64 includes a specification for system timers, and the M1 implements these standard timers as you would expect. But there is a platform-specific bit: the timers need to deliver their interrupt via some IRQ controller. On GIC systems, that is of course via GIC (though the specific interrupt numbers used can vary from system to system). On Apple Silicon, therefore, you’d expect this to end up in AIC.

Yet, making the timers fire and asking AIC to tell us about pending interrupts yielded… nothing. What gives? Apple had yet another surprise for us… you see, the M1’s timers cannot deliver IRQs at all. Instead, they only deliver FIQs.

When we said that AArch64 CPUs only have a single IRQ line, we didn’t mention its oft-neglected sister, the FIQ line. FIQs, or “Fast Interrupt Requests”, are a secondary interrupt mechanism. The “fast” bit refers to how they worked a bit more efficiently on older AArch32 systems, but on AArch64 this is now obsolete: FIQs and IRQs are effectively equal. On GIC systems, the OS can configure individual interrupts to go via IRQ or FIQ – and most AArch64 systems reserve FIQ for use by the secure monitor (TrustZone), so Linux cannot use it. And so, Linux does not use FIQs. At all. AArch64 Linux will panic if it gets a FIQ, as it never expects them.

Without FIQ support, there are no timers on M1, so support isn’t optional. This became yet another major change to the Linux AArch64 support needed by Apple Silicon. Simply adding support for FIQs is easy (at its simplest, it just involves mechanically copying the way IRQs are handled to handle FIQs in a similar way), but there are many different ways to go about the finer details, including deciding how to handle FIQs for systems that don’t need them, and whether to keep FIQs enabled everywhere or disable them on systems that don’t use them.

In the end, after considering several alternatives and iterating through several approaches, Mark Rutland from the Linux ARM64 team volunteered to take over this piece of the puzzle and bring FIQ support to Linux.

There are other things that deliver FIQs too: there is actually a FIQ-based “Fast IPI” mechanism, which we aren’t using yet. There are also hardware performance counters that use it. Effectively, FIQs are used by hardware that is built into individual CPU cores or core clusters, and IRQs are used by the single shared AIC peripheral which manages hardware shared among all CPUs. However, as yet another additional pain point, there is no FIQ controller at all. While AIC serves as an IRQ controller, all of these FIQ sources are “mixed together” (ORed) into a single FIQ, with no way to tell them apart in a centralized manner. Instead, the FIQ handling code has to go and check each of these FIQ sources one by one (in a unique way for each one, as it needs to peek into the specific device registers), figure out which one needs attention, and only then deliver the interrupt to the driver for that device. This is very ugly, and we don’t really know why Apple didn’t think to include a trivial “FIQ controller”. Even a single register indicating the status of each FIQ source as one bit would’ve sufficed. We’ve looked for it, even exhaustively searching all CPU registers, but it sadly doesn’t seem to exist.

What the M1 does have are some extra special features for handling the timer interrupts for VM guests (thankfully, as this is practically a requirement to make VMs work sanely at all). We’ve also reverse engineered these, and they’re now used as part of Marc’s work getting KVM up and running.

On top of the core FIQ support patches, we opted to handle distributing FIQs to downstream device drivers in the AIC driver (even though they are strictly speaking not part of AIC), in order to allow for closer coupling between these paths in the future. This may be needed if we switch from AIC IPIs via IRQ to “Fast IPIs” via FIQ.

An Exyting History

Running Linux on a device is great, but what use is it if you can’t actually interact with it? To be able to get dmesg logs and interact with a Linux console, we need a UART driver for the M1. There are quite a few UART variants out there, though the most popular types are based around the PC standard 16550 UART, which is these days often integrated into all kinds of ARM SoCs. Of course, Apple being Apple, they probably rolled their own instead… right?

Nope! But it’s not at 16550, either. The M1 uses a… Samsung UART?

You see, Apple’s first iPhones ran on Samsung SoCs, and even as Apple famously announced that they were switching to their own designs, the underlying reality is that there was a slower transition away from Samsung over multiple chip generations. “Apple Silicon” chips, like any other SoC, contain IP cores licensed from many other companies; for example, the USB controller in the M1 is by Synopsys, and the same exact hardware is also in chips by Rockchip, TI, and NXP. Even as Apple switched their manufacturing from Samsung to TSMC, some Samsung-isms stayed in their chips… and the UART design remains to this day. We don’t know whether this means that Samsung’s intellectual property is in the M1, or whether Apple merely cloned the interface to keep it software-compatible (UARTs aren’t exactly hard to design), but either way this means that today’s Exynos chips and Apple Silicon still have something in common.

And so, Linux already has a driver for Samsung UARTs. But there’s a catch (of course there’s a catch). There isn’t a single “Samsung UART”. Instead, there are several subtly incompatible variants – and the one that Apple uses is not supported in the Linux Samsung UART driver.

Drivers supporting many variants of the same hardware can get quite messy, and even moreso for drivers as old as this one. Worse, the serial port subsystem in Linux dates back to the earliest versions, and brings with it yet another dimension of cruft: beware all ye who read on. And so, the challenge is figuring out how to integrate support for this new UART variant, without making the code worse. This means refactoring and cleanup! For example, Linux has an ancient concept of serial port types that is exposed to userspace (which means that these types can only ever be added, not removed, as the userspace API must remain backwards-compatible), but this is completely at odds with how devices are handled on modern Linux. There is absolutely no reason why userspace should care about what type a serial port is, and if it does it certainly shouldn’t use clunky TTY APIs with hardcoded define lists (that is what sysfs is for). Each existing Samsung UART variant had its own port type defined there (and there is even an unused one that was never implemented), but adding yet another one was definitely out of the question… so we refactored the driver to have an internal private notion of the UART variant, unrelated to the port type exposed to userspace. Apple Silicon UARTs just identify themselves as a 16550 to this legacy API, which nobody uses for anything anyway.

Yet another challenge is how this variant handles interrupts. Older Samsung UARTs had two independent interrupt outputs for transmit and receive, handled separately in the system’s interrupt controller. Newer Exynos variants handle this internally, with a little interrupt controller in the UART to handle various interrupt types and deliver them as a single one to the system IRQ controller. The Apple variant also does this, but in an incompatible way with different registers, so separate code paths had to be written for it.

On top of that, this UART variant only supports edge-triggered interrupts. An edge-triggered interrupt is an interrupt that fires when a given event occurs, and only on the instant on which it occurs: for example, when the UART transmit buffer becomes empty. Conversely, a level-triggered interrupt is one that fires as long as a given condition is true, and continues to fire until the condition is cleared: as long as the transmit buffer is empty. For various reasons, level-triggered interrupts are much easier to handle and are preferred by modern systems. While AIC itself uses level-triggered interrupts, and the interrupt from the UART itself is level-triggered, the internal events that drive it (such as transmit and receive buffers becoming empty or full) work only in an edge-triggered fashion! Other Samsung UART types support both modes, and Linux uses them in level-triggered mode. This turned into a problem for the Linux code that transmits data via the UART: the existing code worked by just turning on the transmitter, and then doing nothing. With everything configured in level-triggered mode, the empty transmit buffer immediately triggers an interrupt, and the interrupt handler in the driver will then fill the buffer with the first data to be transmitted. In edge-triggered mode this doesn’t work, because the buffer is already empty, not becoming empty. Nothing happens, and the driver never sends any data. We had to make the driver “prime” the transmit buffer immediately when data was ready to be sent to the device, as only once that first chunk of data is sent does the interrupt fire to request more.

Working out these quirks of the UART was doubly confusing because we were using m1n1 to run experiments, which is itself controlled via UART. Trying to figure out how a device works when your debug communications channel is the device itself can get very confusing! Thankfully, this is all done now, and m1n1 is much more pleasant to use to work on any other part of the hardware.

There is another driver that will have to go through the same treatment, though with a completely different lineage. The I²C hardware in the M1 chip comes from… P.A. Semi! It turns out that there is some obvious PowerPC legacy in the M1 after all, and its I²C peripheral is based on the one in PWRficient chips, including the one used in the AmigaOne X1000. Linux supports that platform, but the existing driver is very bare-bones. Fortunately, after contacting the author of the driver, it turns out he still owns a functioning X1000 and can test patches. We were able to get hardware documentation of that chip, to allow us to improve the driver and add missing features that should work on the X1000 (like interrupt support), as well as making any changes required for M1 support. As this driver is a dependency for getting the USB Type-C ports fully up and running, this work will be coming up very soon.

Penguins at Last

To anticlimactically wrap up the Linux bring-up saga, let’s talk about what we needed to do to get the Linux framebuffer console to work on the M1. If you were expecting another 2000 words here, I’m afraid you’ll be disappointed.

On PCs, the UEFI firmware sets up a framebuffer and you can run Linux with no proper display driver at all, using a driver called efifb. Apple Silicon Macs work in much the same way: iBoot sets up a framebuffer that the OS can use. All we need to do is use the generic simplefb Linux driver, and it just works, with no code changes at all. We only had to document some changes to the devicetree binding, because the code already supported some features that we needed but were not documented.

And just like that, after all that work, all it took was a couple lines in the devicetree to turn a blank screen into this:

8 penguins

m1n1 now takes care of doing this properly, taking the framebuffer information (width, height, pixel format, stride, and base address) that iBoot gives us and sticking it into the devicetree for Linux to use.

Of course, this is just a firmware-provided framebuffer. As it is not a proper display driver, there is no way to change resolutions, handle display hotplug, or even put displays to sleep. It is sufficient for development and demos, but we will have to write a proper display controller driver in due course.

And then, of course, there is the GPU, which is not the display controller and is a logically separate piece of hardware. PC users often conflate these two, because they come packaged up into a single chip on a “graphics card”… but they are logically quite distinct, and on an SoC like the M1 there is about as much relationship between the display controller and the GPU as there is between the USB controller and the GPU. GPU support will be its own odyssey, so look forward to hearing more about it!

There’s Even More!

We could keep talking in depth for another 10000 words, but alas, this post is already too long. However, if you’d like to check out more things that have been going on in the community in these two months, here’s a list of things you shouldn’t miss:

Our current Linux bring-up series is in its third version and being reviewed for upstream inclusion. If you’d like to see how the story of this article maps to code, check out the patches; and if you want to see how the process works, read the threads for versions 1 and 2. If all goes well and we don’t hit any new showstoppers, this should be on track to being merged into Linux 5.13. Stay tuned!

Asahi Linux wouldn’t be possible without the entire community of people who have jumped on to help move the project forward, from people new to embedded development to hardware engineers to seasoned kernel folks. If you are interested in contributing, check out our community page and join our IRC channels!

On a personal note, I’m trying to make Asahi Linux my full time job. If you like what I’m doing and would like to help me spend more of my time on the project, you can support me on Patreon and GitHub. Thanks to everyone who has pledged so far; this wouldn’t have been possible without you either!

Thanks to JMC47, David and Ridley for proofreading this article.

marcan · 2021-03-11
-
+