lib: log vCPU diagnostics on triple fault and for some unhandled exit types #795

gjcolombo · 2024-10-17T22:23:55Z

Add a propolis::vcpu::Diagnostics type that captures and pretty-prints the register state of a vCPU. Log this state if a vCPU triple faults or (in propolis-server) if it raises a Paging or InstEmul unhandled exit, which are the failure modes seen in #300, #333, #340, and #755.

To avoid logging guest data outside of development environments, the Diagnostics type only captures data if the dump-guest-state feature is enabled. Enable this feature by default for propolis-server and -standalone, then modify the Omicron image build job to disable it in Omicron zone images for now (so that the feature will be disabled for customer VMs). In the fullness of time we may want a more flexible way to configure this option.

Finally, hand-implement Debug for the InstEmul exit context to redact this exit's instruction bytes if dump-guest-state is disabled.

This helps with #335. In a subsequent change I'd like to add the memory region near guest %rip to the diagnostics, but this requires some additional plumbing to translate from a GVA to a GPA before trying to read memory (there's an ioctl for this already; just needs some track-laying in Propolis).

Tests: added a diagnostics log line to the "reset" suspend exit so that I could force diagnostics to be captured on demand via bhyvectl, then checked both that they're logged with a binary built in the default configuration and elided in a binary built with the Omicron configuration.

pfmooney · 2024-10-17T22:30:25Z

bin/propolis-server/src/lib/vcpu_tasks.rs

+                    VmExitKind::Bogus => {
+                        error!(log,
+                               "lib returned bogus exit from vCPU {}",
+                               vcpu.id);
+
+                        event_handler.unhandled_vm_exit(vcpu.id, exit.kind);
+                        VmEntry::Run
+                    }


Bogus is not an unhandled exit

True, it's unconditionally handled in the lib. It previously fell into the _ match arm, which called unhandled_vm_exit, but as long the lib is always supposed to handle this exit kind it's probably better just to panic directly here (since this is really more of an assertion failure than a truly unhandled exit).

I would use unreachable!, then. I think handling like this is misleading to the reader.

Done in fba85be. I also tried to describe what the bogus exit is for in a comment. LMK if it's not accurate; I'm going mostly off what I could infer from reading around where VM_EXITCODE_BOGUS is used in bhyve.

pfmooney · 2024-10-17T22:30:45Z

lib/propolis/src/exits.rs

+        #[cfg(not(feature = "dump-guest-state"))]
+        let inst_data = "<dump-guest-state disabled>";
+
+        #[cfg(feature = "dump-guest-state")]
+        let inst_data = &self.inst_data;


Rather than gating this output on a feature, perhaps just censor it on the propolis-server side, depending on the configuration there?

I considered that but didn't like the ergonomics of it; it would mean that everyone in server who might want to debug-print a VmExitKind needs to remember to check what variant it is first to avoid printing an InstEmul. We could force that to happen by declining to derive Debug for VmExitKind, but that would put a big match into functions like unhandled_vm_exit to deal with the one variant that doesn't have a Debug impl, which I also don't love the look of. (I don't think there are any other call sites that would be affected, fortunately.)

gjcolombo · 2024-10-17T22:39:20Z

The clippy failure is a new suggestion from 1.82. I'll fix that separately.

iximeow · 2024-10-18T01:23:45Z

.github/buildomat/jobs/image.sh

+# This job produces the propolis-server zone image that the Omicron repository
+# imports to create full-blown Oxide software packages. Disable all default
+# features and opt into the "omicron-build" feature instead to build libraries
+# in their "production" configurations.
 #
 # The 'release' profile is configured for abort-on-panic, so we get an
 # immediate coredump rather than unwinding in the case of an error.
-ptime -m cargo build --release --verbose -p propolis-server --features omicron-build
+ptime -m cargo build --release --verbose -p propolis-server \
+    --no-default-features \
+    --features omicron-build


it took me a second to connect the dots but from #755 and elsewhere we're clearly still collecting cores from a crashed Propolis - all we've done by disabling the logging is prevented ourselves from collecting more timely information (that in the case of #755 helped us tell the guest had gone off to space)

i think a consistent posture here would be either:

A: Propolis stderr/stdout is as sensitive as guest memory; do not collect, if you must collect then keep secret, if you must share, then do so carefully
B: keep Propolis stderr/stdout clean of user data, and when operating Propolis, do not retain user data (cores that could include pages of user data)

i'm inclined towards A on the grounds that if you're looking at or copying a Propolis log around, you probably also can access a core from that same process. and it means we're actually helping a situation like #755 with this change.

i still think the feature is useful, just that we can leave default features enabled in this job. or if we should consider ourselves responsible for keeping Propolis lib/standalone/server's logs free of potentially-sensitive bits, it'd be worth a note to ourselves that we should audit other logs for anything untoward - migration state comes to mind.

i don't want to be excessively black/white on this, but i'm trying to make sure there's a clear line for what is and isn't OK for Propolis to say. what do you think?

I keep finding myself drawn to approach B. At some point we're going to have a situation where we need to debug a Propolis issue using only information from operator-provided support bundles. That's going to be tricky if operators feel they can't ever include Propolis logs in those bundles because Propolis is too cavalier about logging guest data.

i'm inclined towards A on the grounds that if you're looking at or copying a Propolis log around, you probably also can access a core from that same process.

I would put it a bit differently: if you can extract a Propolis log from a sled, you can probably also extract core dumps from that sled. Because Oxide personnel are regularly in this position today (because we access deployed racks over the technician port in order to update/service them), I agree that redacting diagnostic data in logs currently has very little practical upside, at least from the point of view of an operator who wants to limit support personnel access to guest VM state. The long term goal, of course, is to wean ourselves off of the tech port via automated update, support bundles, and the like, and once we've done that I think redaction will be more obviously useful.

#755 is the only case I know of where we've seen one of the diagnostic-generating faults in a production Omicron deployment. All the other cases I know of have been in development environments (ad hoc Propolis server/standalone binaries or a4x2 deployments). So my lean was (and still is, I think) toward disabling this by default in Omicron images, though I think it's a somewhat close call (made even closer by the fact that #755 happened on an Oxide-managed rack and not a customer-operated rack).

With all that said: WWYT of taking some steps now to make this into a runtime-configurable setting? I think that the ideal world is one @pfmooney laid out in chat in yesterday's HV huddle: if this behavior can change dynamically, and Nexus knows how to control it, then an operator who can reproduce a problem in a "clean" VM can enable extra data collection for just that instance (or its project/silo/what-have-you) and provide us with the resulting "full-fat" logs without worry. In the short run (i.e. before Omicron learns about the new API), and in a dire pinch, support personnel accessing a rack through the tech port can access the API, enabling a "user starts clean VM -> support person sets option -> user repros problem" approach if we desperately needed it. I think I can adjust this PR to move in that direction pretty easily (and I suspect it might improve its ergonomics in the process); will probably go ahead and give that a try, at least.

wean ourselves off of the tech port via automated update, support bundles, and the like, and once we've done that I think redaction will be more obviously useful.

ah, yeah, in that context i see what you mean. i was over-indexing on today's diagnostics!

definitely a fan of making it runtime-configurable, then, but i'm even more glad to have piped up on RFD 496 - seems like we need a more nuanced treatment of the relationship between cores and support bundles.

gjcolombo · 2024-10-18T23:32:11Z

I'm going to rebase to pick up #794 and #796 before trying to tweak this PR further (#796 to silence the clippy warning, #794 because I have some Ideas that I think would otherwise conflict with it).

iximeow · 2024-10-20T14:46:19Z

lib/propolis/src/common.rs

@@ -11,13 +11,25 @@ use crate::vmm::SubMapping;

 /// Controls whether items wrapped in a [`GuestData`] are displayed or redacted
 /// when the wrappers are printed via their `Display` or `Debug` impls.
+#[no_mangle]


(definitely worth a note like // The #[no_mangle] here is not for internal correctness, but to ensure that two cohabitant lib-propolis versions produce an error rather than ambiguously work)

gjcolombo requested review from pfmooney and iximeow October 17, 2024 22:23

pfmooney reviewed Oct 17, 2024

View reviewed changes

iximeow reviewed Oct 18, 2024

View reviewed changes

gjcolombo added 7 commits October 18, 2024 23:33

lib: add facility to dump CPU regs on reset/triple fault

48e64a2

server: add diagnostics for unhandled exits

cd759ae

don't dump instr bytes if diag is disabled

f853b1d

put diagnostics behind a default feature

9c0333d

don't bother logging for halt/reset

99e1785

opt standalone into dump-guest-state too

3fb9025

bogus exits should never be returned to the server

e4f891f

gjcolombo force-pushed the gjcolombo/triple-fault-diagnostics branch from fba85be to e4f891f Compare October 18, 2024 23:33

gjcolombo added 3 commits October 18, 2024 23:57

make guest data display in logs nominally runtime-configurable

645545d

make MemCtx reads produce GuestData

941b4cf

add some comments

bf130b8

iximeow reviewed Oct 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib: log vCPU diagnostics on triple fault and for some unhandled exit types #795

lib: log vCPU diagnostics on triple fault and for some unhandled exit types #795

gjcolombo commented Oct 17, 2024

pfmooney Oct 17, 2024

gjcolombo Oct 17, 2024

pfmooney Oct 17, 2024

gjcolombo Oct 17, 2024

pfmooney Oct 17, 2024

gjcolombo Oct 17, 2024

gjcolombo commented Oct 17, 2024

iximeow Oct 18, 2024

gjcolombo Oct 18, 2024

iximeow Oct 18, 2024

gjcolombo commented Oct 18, 2024

iximeow Oct 20, 2024

lib: log vCPU diagnostics on triple fault and for some unhandled exit types #795

Are you sure you want to change the base?

lib: log vCPU diagnostics on triple fault and for some unhandled exit types #795

Conversation

gjcolombo commented Oct 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjcolombo commented Oct 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjcolombo commented Oct 18, 2024

Choose a reason for hiding this comment