Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error reporting mechanism needed #335

Open
pfmooney opened this issue Mar 9, 2023 · 2 comments
Open

Better error reporting mechanism needed #335

pfmooney opened this issue Mar 9, 2023 · 2 comments
Milestone

Comments

@pfmooney
Copy link
Collaborator

pfmooney commented Mar 9, 2023

When running a guest, there are certain conditions which may cause one or more of the vCPUs to exit from VM context with an event we are unable to properly handle. Notable examples of this include #333 and #300, where the guests appear to have jumped off into "space", leaving the instruction fetch/decode/emulate machinery unable to do its job.

The course of action we choose in such situations can have certain trade-offs. The current behavior of aborting the propolis process has the advantage of attempting to preserve the maximum state of both the userspace emulation (saved in the core) as well as the kernel VMM and instance state (residing in the vmm device instance, as long as it is not removed). This may be beneficial for developer applications, but for running hosts in production, it is likely less than ideal.

Consumers in production likely expect a VM encountering an error condition like that to reboot, as if it had tripped over something like a triple-fault on a CPU. Rebooting the instance promptly at least allows it to return to service quickly. In such cases, we need to think about what bits of state we would want preserved from the machine and the fault conditions so it can be used for debugging later. In addition to the details about the vm-exit on the faulting vCPU(s), we could export the entire emulated device state (not counting DRAM) as if a migration were occurring. Customer policy could potentially choose to prune that down, or even augment it with additional state from the guest (perhaps the page of memory underlying %rip at the time of exit?)

With such a mechanism in place, we could still preserve the abort-on-unhandled-vmexit behavior if it is desired by developer workflows, but default to the more graceful mechanism for all other cases.

@hawkw
Copy link
Member

hawkw commented Sep 1, 2024

There's work currently in progress in the control plane to allow Nexus to automatically restart instances whose propolis-server has crashed (if configured to do so). In particular, oxidecomputer/omicron#6455 moves instances to the Failed state when their VMM has crashed, and oxidecomputer/omicron#6503 will add a RPW for restarting Failed instances, if they have an "auto-restart" configuration set.

Potentially, we could leverage that here and just allow propolis-servers that encounter this kind of guest misbehavior to crash and leave behind a core dump, knowing that the control plane will restart the instance if that's what the user wanted. On the other hand, this is potentially less efficient than restarting the guest within the same propolis-server, since it requires the control plane to spin up a whole new VMM and start the instance there. But, I figured it was worth mentioning!

@pfmooney
Copy link
Collaborator Author

pfmooney commented Sep 1, 2024

In the case of #755 (and similar circumstances), I don't think that crashing is at all ideal. If we have a mechanism for surfacing information for support, it's probably more sensible to collect additional information about the state of the guest (registers, etc), since figuring that out from the propolis core dump alone with be challenging, if not impossible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants