Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Unbreak handling of Propolis error codes #6726

Merged
merged 9 commits into from
Oct 1, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Sep 30, 2024

At present, sled-agent's handling of the error codes used by Propolis to
indicate that it has crashed and been restarted is woefully incorrect.
In particular, there are two cases where such an error may be
encountered by a sled-agent:

  1. When attempting to ask the VMM to change state (e.g. reboot or stop
    the instance)
  2. When hitting Propolis' instance-state-monitor API endpoint to
    proactively check the VM's current state

Neither of these are handled correctly today,

In the first case, if a sled-agent operation on behalf of Nexus
encounters a client error from Propolis, it will forward that error code
to Nexus...but, Nexus only moves instances to Failed in the face of
sled-agent's 410 NO_SUCH_INSTANCE, and not if it sees the
NoInstance error code from Propolis
, which means that the ghosts
left behind by crashed and restarted Propolii still won't be moved to
Failed. Agh. Furthermore, in that case, sled-agent itself will not
perform the necessary cleanup actions to deal with the now-defunct
Propolis zone (e.g. collecting a zone bundle and then destroying the
zone).

In the second case, where we hit the instance-state-monitor endpoint and
get back a NoInstance error, things are equally dire. The
InstanceMonitorRunner task, which is responsible for polling the state
monitor endpoint, will just bail out on receipt of any error from
Propolis:

impl InstanceMonitorRunner {
async fn run(self) -> Result<(), anyhow::Error> {
let mut gen = 0;
loop {
// State monitoring always returns the most recent state/gen pair
// known to Propolis.
let response = self
.client
.instance_state_monitor()
.body(propolis_client::types::InstanceStateMonitorRequest {
gen,
})
.send()
.await?
.into_inner();
let observed_gen = response.gen;
// Now that we have the response from Propolis' HTTP server, we
// forward that to the InstanceRunner.
//
// It will decide the new state, provide that info to Nexus,
// and possibly identify if we should terminate.
let (tx, rx) = oneshot::channel();
self.tx_monitor
.send(InstanceMonitorRequest::Update { state: response, tx })
.await?;
if let Reaction::Terminate = rx.await? {
return Ok(());
}
// Update the generation number we're asking for, to ensure the
// Propolis will only return more recent values.
gen = observed_gen + 1;
}
}
}

We would expect this to drop the send-side of the channel it uses to
communicate with the InstanceRunner, closing the channel, and hitting
this select, which would correctly mark the VMM as failed and do what we
want, despite eating the actual error message from propolis:

// NOTE: This case shouldn't really happen, as we keep a copy
// of the sender alive in "self.tx_monitor".
None => {
warn!(self.log, "Instance 'VMM monitor' channel closed; shutting down");
let mark_failed = true;
self.terminate(mark_failed).await;
},

HOWEVER, as the comment points out, the InstanceRunner is also
holding its own clone of the channel sender, keeping us from ever
hitting that case:

// Request channel on which monitor requests are made.
tx_monitor: mpsc::Sender<InstanceMonitorRequest>,

AFAICT, this means we'll just totally give up on monitoring the instance
as soon as we hit any error here, which seems...quite bad. I think
that this actually means that when a Propolis process crashes
unexpectedly, we'll get an error when the TCP connection closes, bail
out, and then never try hitting the instance monitor endpoint ever
again
1. So, we won't notice that the Propolis is actually out to
lunch until we try to ask it to change state. This Is Not Great!

This commit fixes both of these cases, by making sled-agent actually
handle Propolis' error codes correctly. I've added a dependency on the
propolis_api_types crate, which is where the error code lives, and
some code in sled-agent to attempt to parse these codes when receiving
an error response from the Propolis client. Now, when we try to PUT a
new state to the instance, we'll look at the error code that comes back,
mark it as Failed if the error indicates that we should do so, and
publish the Failed VMM state to Nexus before tearing down the zone.
The InstanceMonitorTask now behaves similarly, and I've changed it to
retry all other errors with a backoff, rather than just bailing out
immediately on receipt fo the first error.

I've manually tested this on london as discussed here:
#6726 (comment)

Unfortunately, it's hard to do any kind of automated testing of this
with the current test situation, as none of this code is exercised by
the simulated sled-agent.

Fixes #3209
Fixes #3238

Footnotes

  1. Although I've not actually verified that this is what happens.

@hawkw
Copy link
Member Author

hawkw commented Sep 30, 2024

Okay, it looks like this does work. If I have a running instance, and I zlogin into the oxz_propolis_server zone and restart the Propolis service, I can see the instance go to Failed more or less immediately:

eliza@theseus ~/Code/oxide/oxide.rs $ ssh londonswitch
The illumos Project     helios-2.0.22923        September 2024
root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (106.0.0)
ID                                   NAME   STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
2433197b-fd8f-4655-9f49-19339b8a327a labrat running 7d64e906-a0ab-4662-9330-6de0eb691831 30a2fe41-4a2c-44cd-a1f4-7b17c5181371 BRM42220036
root@oxz_switch1:~# pilot host login BRM42220036

    #####
   ##   ##
  ##   # ##  ##   ##
  ##  #  ##   ## ##     Oxide Computer Company
  ## #   ##    ###      Engineering
   ##   ##    ## ##
    #####    ##   ##    Gimlet

BRM42220036 # /opt/oxide/oxlog/oxlog zones
global
oxz_cockroachdb_7f8b3453-e625-424b-8306-69507a55e1e4
oxz_crucible_07d6e3ee-b489-4698-b7c7-b2dfc456cd84
oxz_crucible_0974390c-b98c-4116-bca0-f716cda87587
oxz_crucible_1d3b3b88-feaa-4cf3-af31-627fcd1dfb06
oxz_crucible_31d9ddca-164d-49a2-80ee-f047cf7874e6
oxz_crucible_7471f78f-7b3d-4aaf-86ff-779c8644b335
oxz_crucible_8fac3aae-227a-4f62-a06d-e777a3fdbe54
oxz_crucible_c1aa0788-68cb-4083-95d6-cca9c15a8e12
oxz_crucible_f7d1b214-55ef-4db6-91ad-cdec7a367169
oxz_crucible_f8c20512-e9da-4329-8569-be0733a70217
oxz_crucible_pantry_99df66d0-bccc-4d44-b793-164036844faa
oxz_external_dns_a73b9c9c-fa12-453c-9546-0ddc96411674
oxz_internal_dns_63cb81fd-47fd-4065-9fc8-e0df2f85f396
oxz_ntp_a2995263-07d9-48d5-945d-ef5dbd58fcda
oxz_oximeter_2ab67cd8-2917-4191-8fcf-3e80de09328b
oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831
oxz_switch
BRM42220036 # zlogin oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831
[Connected to zone 'oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831' pts/5]
The illumos Project     helios-2.0.22923        September 2024
root@oxz_propolis:~# svcs
STATE          STIME    FMRI
legacy_run     19:23:58 lrc:/etc/rc2_d/S20sysetup
legacy_run     19:23:58 lrc:/etc/rc2_d/S89PRESERVE
online         19:23:52 svc:/system/svc/restarter:default
online         19:23:53 svc:/system/early-manifest-import:default
online         19:23:53 svc:/system/filesystem/root:default
online         19:23:53 svc:/network/ip-interface-management:default
online         19:23:53 svc:/network/netcfg:default
online         19:23:53 svc:/network/datalink-management:default
online         19:23:53 svc:/system/boot-archive:default
online         19:23:54 svc:/network/loopback:default
online         19:23:54 svc:/system/filesystem/usr:default
online         19:23:54 svc:/system/device/local:default
online         19:23:54 svc:/network/physical:default
online         19:23:54 svc:/milestone/devices:default
online         19:23:54 svc:/network/ipsec/ipsecalgs:default
online         19:23:54 svc:/system/identity:node
online         19:23:54 svc:/system/device/audio:default
online         19:23:54 svc:/system/filesystem/minimal:default
online         19:23:54 svc:/network/ipsec/policy:default
online         19:23:54 svc:/system/rmtmpfiles:default
online         19:23:54 svc:/milestone/network:default
online         19:23:54 svc:/network/initial:default
online         19:23:54 svc:/network/iptun:default
online         19:23:54 svc:/system/hostid:default
online         19:23:54 svc:/network/netmask:default
online         19:23:54 svc:/system/auditset:default
online         19:23:54 svc:/system/logadm-upgrade:default
online         19:23:54 svc:/system/pfexec:default
online         19:23:54 svc:/system/pkgserv:default
online         19:23:54 svc:/system/fmd:default
online         19:23:56 svc:/system/rbac:default
online         19:23:57 svc:/network/service:default
online         19:23:57 svc:/system/cryptosvc:default
online         19:23:57 svc:/system/identity:domain
online         19:23:57 svc:/system/keymap:default
online         19:23:57 svc:/system/name-service-cache:default
online         19:23:57 svc:/milestone/name-services:default
online         19:23:57 svc:/system/manifest-import:default
online         19:23:57 svc:/milestone/single-user:default
online         19:23:57 svc:/system/coreadm:default
online         19:23:57 svc:/milestone/sysconfig:default
online         19:23:57 svc:/system/filesystem/local:default
online         19:23:57 svc:/system/utmp:default
online         19:23:58 svc:/system/system-log:default
online         19:23:58 svc:/system/cron:default
online         19:23:58 svc:/system/console-login:default
online         19:23:58 svc:/system/boot-archive-update:default
online         19:23:58 svc:/network/shares/group:default
online         19:23:58 svc:/milestone/multi-user:default
online         19:23:58 svc:/system/boot-config:default
online         19:23:58 svc:/network/shares/group:zfs
online         19:23:58 svc:/network/routing-setup:default
online         19:23:58 svc:/system/update-man-index:default
online         19:23:58 svc:/network/ssh:default
online         19:23:58 svc:/milestone/multi-user-server:default
online         19:23:58 svc:/system/illumos/propolis-server:default
online         19:24:08 svc:/network/routing/ndp:default
root@oxz_propolis:~# svcadm restart svc:/system/illumos/propolis-server:default
root@oxz_propolis:~#
[Connection to zone 'oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831' pts/5 closed]
BRM42220036 # exit
logout
root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (106.0.0)
ID                                   NAME   STATE  PROPOLIS_ID          SLED_ID           HOST_SERIAL
2433197b-fd8f-4655-9f49-19339b8a327a labrat failed <no active Propolis> <not on any sled> -
root@oxz_switch1:~#

Meanwhile, in the sled-agent logs, we can see it observe the Propolis go away, and recognize the error code from the restarted Propolis as indicating VM-gone-ness:

19:29:43.487Z WARN SledAgent (InstanceManager): Failed to poll Propolis state
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:267
    generation = 3
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
    retry_in = 59.516814ms
19:29:43.548Z WARN SledAgent (InstanceManager): Failed to poll Propolis state
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:267
    generation = 3
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
    retry_in = 143.190326ms
19:29:45.050Z WARN SledAgent (InstanceManager): Failed to poll Propolis state
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:267
    generation = 3
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
    retry_in = 177.540776ms
19:29:45.231Z INFO SledAgent (InstanceManager): Error code from Propolis after instance initialized, assuming it's gone
    error = NoInstance
    file = sled-agent/src/instance.rs:446
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
19:29:45.231Z INFO SledAgent (StorageManager): Received GetLatestResources(...)
    file = sled-storage/src/manager.rs:451
19:29:45.231Z WARN SledAgent: target removed with queued samples
    file = oximeter/instruments/src/kstat/sampler.rs:602
    id = 7542715260913182298
    n_samples = 18
    sled_id = 30a2fe41-4a2c-44cd-a1f4-7b17c5181371
19:29:45.231Z INFO SledAgent (StorageManager): Received GetLatestResources(...)
    file = sled-storage/src/manager.rs:451
19:29:45.231Z INFO SledAgent (ZoneBundler): creating zone bundle
    context = ZoneBundleContext { storage_dirs: ["/pool/int/88e574a7-1a37-47ae-a29c-34fe1d60353a/debug/bundle/zone", "/pool/int/a15c936d-bf01-4411-aab2-d401b66910c6/debug/bundle/zone"], cause: TerminatedInstance, extra_log_dirs: ["/pool/ext/35078395-5274-410d-ac65-4b941df7d371/crypt/debug", "/pool/ext/9c13a5e9-e5e0-4b55-b973-66ae886fc8f5/crypt/debug", "/pool/ext/613d06ff-248e-4720-b91a-fa0802bab36d/crypt/debug", "/pool/ext/85e1166a-5217-4a23-b10a-b280d063cdab/crypt/debug", "/pool/ext/0987bf84-86d8-4c01-ab83-908e27a89cc3/crypt/debug", "/pool/ext/ef785d36-c3f3-47ec-ad4d-97915e4769b9/crypt/debug", "/pool/ext/fa041119-3b83-4998-bc02-dfe4ad392887/crypt/debug", "/pool/ext/bd4a215e-f7aa-4b38-94d6-0e3e823bb6e2/crypt/debug", "/pool/ext/fcf8ee81-3fc7-4ba6-987c-bbc4b1b0aa07/crypt/debug"] }
    file = sled-agent/src/zone_bundle.rs:339
    zone_name = oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831
19:29:45.232Z WARN SledAgent: target removed with queued samples
    file = oximeter/instruments/src/kstat/sampler.rs:602
    id = 7721952280714464990
    n_samples = 18
    sled_id = 30a2fe41-4a2c-44cd-a1f4-7b17c5181371
19:29:45.583Z INFO SledAgent: request completed
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950
    latency_us = 6724
    local_addr = [fd00:1122:3344:101::1]:54111
    method = GET
    remote_addr = [fd00:1122:3344:101::5]:38069
    req_id = b660d4d8-46a9-44b0-93b7-7bd36a81a549
    response_code = 200
    sled_id = 30a2fe41-4a2c-44cd-a1f4-7b17c5181371
    uri = /30a2fe41-4a2c-44cd-a1f4-7b17c5181371
19:29:46.187Z INFO SledAgent (ZoneBundler): finished zone bundle
    file = sled-agent/src/zone_bundle.rs:1142
    metadata = ZoneBundleMetadata { id: ZoneBundleId { zone_name: "oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831", bundle_id: aefd1ca0-c095-4245-989e-cb5f2fad1584 }, time_created: 2024-09-30T19:29:45.232807722Z, version: 0, cause: TerminatedInstance }
19:29:46.187Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831
    file = sled-agent/src/instance.rs:922
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
19:29:49.039Z INFO SledAgent (InstanceManager): halt_and_remove_logged: Previous zone state: Running
    file = illumos-utils/src/zone.rs:270
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
19:29:49.059Z INFO SledAgent (PortManager): Removed route set for subnet
    file = illumos-utils/src/opte/port_manager.rs:904
    id = RouterId { vni: Vni(14782550), kind: System }
19:29:49.059Z INFO SledAgent (PortManager): Removed route set for subnet
    file = illumos-utils/src/opte/port_manager.rs:904
    id = RouterId { vni: Vni(14782550), kind: Custom(V4(Ipv4Net { addr: 172.30.0.0, width: 22 })) }
19:29:49.060Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:593
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
    state = SledVmmState { vmm_state: VmmRuntimeState { state: Failed, gen: Generation(5), time_updated: 2024-09-30T19:29:49.060460162Z }, migration_in: None, migration_out: None }
19:29:49.144Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:593
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
    state = SledVmmState { vmm_state: VmmRuntimeState { state: Failed, gen: Generation(5), time_updated: 2024-09-30T19:29:49.060460162Z }, migration_in: None, migration_out: None }
19:29:49.150Z INFO SledAgent (InstanceManager): State monitoring task complete
    file = sled-agent/src/instance.rs:859
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831

And, if we pop back over to BRM2220036, we can see the defunct zone has been removed. So that seems good!

root@oxz_switch1:~# pilot host login BRM42220036

    #####
   ##   ##
  ##   # ##  ##   ##
  ##  #  ##   ## ##     Oxide Computer Company
  ## #   ##    ###      Engineering
   ##   ##    ## ##
    #####    ##   ##    Gimlet

BRM42220036 # zoneadm list
global
oxz_switch
oxz_internal_dns_63cb81fd-47fd-4065-9fc8-e0df2f85f396
oxz_ntp_a2995263-07d9-48d5-945d-ef5dbd58fcda
oxz_cockroachdb_7f8b3453-e625-424b-8306-69507a55e1e4
oxz_crucible_f8c20512-e9da-4329-8569-be0733a70217
oxz_crucible_0974390c-b98c-4116-bca0-f716cda87587
oxz_crucible_1d3b3b88-feaa-4cf3-af31-627fcd1dfb06
oxz_crucible_31d9ddca-164d-49a2-80ee-f047cf7874e6
oxz_crucible_7471f78f-7b3d-4aaf-86ff-779c8644b335
oxz_crucible_pantry_99df66d0-bccc-4d44-b793-164036844faa
oxz_crucible_8fac3aae-227a-4f62-a06d-e777a3fdbe54
oxz_crucible_c1aa0788-68cb-4083-95d6-cca9c15a8e12
oxz_crucible_f7d1b214-55ef-4db6-91ad-cdec7a367169
oxz_external_dns_a73b9c9c-fa12-453c-9546-0ddc96411674
oxz_crucible_07d6e3ee-b489-4698-b7c7-b2dfc456cd84
oxz_oximeter_2ab67cd8-2917-4191-8fcf-3e80de09328b
BRM42220036 #

@hawkw
Copy link
Member Author

hawkw commented Sep 30, 2024

(from the sled-agent logs it does look like it publishes the state generation 5 to nexus twice, which i ought to figure out...)

Comment on lines 437 to 442
// TODO(eliza): this is kinda jank but Propolis
// currently could return the same error when the
// instance has not been ensured and when it's still
// coming up, which is sad. I'm not sure whether
// that happens IRL or not --- investigate this.
let reaction = if has_ever_seen_a_good_state {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra junk is because I noticed that Propolis' instance-state-monitor endpoint will presently return the NoInstance error on both VmError::NotCreated and VmError::WaitingToInitialize:

https://github.com/oxidecomputer/propolis/blob/11371b0f3743f8df5b047dc0edc2699f4bdf3927/bin/propolis-server/src/lib/server.rs#L397-L399

This gave me some anxiety about sled-agent potentially deciding that a Propolis has "died" while it's actually still waiting to be born, although I'm not sure whether we would actually see the WaitingToInitialize error there in practice or not. I think the proper thing to do would be to change that behavior in Propolis, but I wanted to get something working in sled-agent regardless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after actually looking in Propolis, it looks like the Vm::state_watcher() function will never actually return VmError::WaitingToInitialize; it will only return VmError::NotCreated:
https://github.com/oxidecomputer/propolis/blob/11371b0f3743f8df5b047dc0edc2699f4bdf3927/bin/propolis-server/src/lib/vm/mod.rs#L353-L364

So, hitting the state-monitor endpoint should just return a Starting VMM state in that case, which is good. I still don't like that the method can return any of the error types, though, even though it doesn't currently --- I'll go try to clean that up on the Propolis side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, oxidecomputer/propolis#771 will make the propolis side way less scary. But I think this code can also just be removed as propolis' current behavior is also correct.

"error" => ?code,
);
self.terminate(true).await;
self.publish_state_to_nexus().await;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably why we sent the failed state to nexus twice in my manual testing, so I think it can be removed.

hawkw added a commit to oxidecomputer/propolis that referenced this pull request Sep 30, 2024
Currently, the `Vm::get()` and `Vm::state_watcher()` methods return a
`Result<{something}, VmError>`. This means that they can return _any_
`VmError` variant, but the current implementation of these methods will
only ever return `VmError::NotCreated`.

This gave me a bit of anxiety when I noticed that we were presently
handling the `VmError::WaitingForInit` variant identically to
`NotCreated`, which seemed wrong as it implied that we could,
potentially, return an error code indicating that no instance has ever
been ensured when in fact, one _has_ been ensured but is still being
initialized. I freaked out about this a bit here:
oxidecomputer/omicron#6726 (comment)

It turns out the current behavior is actually fine, since if the VM is
still initializing, we still allow accessing it and just return a
`Starting` state, which is correct. But the type signatures of these
functions allow them to return any of these errors, and forces their
callers to handle those errors, and that's the part we were doing
somewhat incorrectly (although, again, in practice it doesn't matter).

This commit changes those functions to return an `Option` rather than a
`Result`. Now, we return `None` if no VM has ever been created, making
that the _only_ error case that callers have to handle. There's no
longer any risk of the HTTP API handlers accidentally returning a "VM
never ensured" error when the VM is ensured-but-initializing. Also, we
can remove the `NotCreated` variant, since it's now no longer returned
anywhere.
@hawkw hawkw changed the title [wip] unbreak propolis monitoring in sled-agent [sled-agent] Unbreak handling of Propolis error codes Sep 30, 2024
@hawkw hawkw requested review from gjcolombo and ahl September 30, 2024 21:16
@hawkw hawkw marked this pull request as ready for review September 30, 2024 21:16
@hawkw hawkw requested review from smklein and removed request for ahl September 30, 2024 21:18
hawkw added a commit to oxidecomputer/propolis that referenced this pull request Sep 30, 2024
Currently, the `Vm::get()` and `Vm::state_watcher()` methods return a
`Result<{something}, VmError>`. This means that they can return _any_
`VmError` variant, but the current implementation of these methods will
only ever return `VmError::NotCreated`.

This gave me a bit of anxiety when I noticed that we were presently
handling the `VmError::WaitingForInit` variant identically to
`NotCreated`, which seemed wrong as it implied that we could,
potentially, return an error code indicating that no instance has ever
been ensured when in fact, one _has_ been ensured but is still being
initialized. I freaked out about this a bit here:
oxidecomputer/omicron#6726 (comment)

It turns out the current behavior is actually fine, since if the VM is
still initializing, we still allow accessing it and just return a
`Starting` state, which is correct. But the type signatures of these
functions allow them to return any of these errors, and forces their
callers to handle those errors, and that's the part we were doing
somewhat incorrectly (although, again, in practice it doesn't matter).

This commit changes those functions to return an `Option` rather than a
`Result`. Now, we return `None` if no VM has ever been created, making
that the _only_ error case that callers have to handle. There's no
longer any risk of the HTTP API handlers accidentally returning a "VM
never ensured" error when the VM is ensured-but-initializing. Also, we
can remove the `NotCreated` variant, since it's now no longer returned
anywhere.
@hawkw
Copy link
Member Author

hawkw commented Sep 30, 2024

I'm going to redo my manual testing from #6726 (comment) with the latest changes from b59cfc1, but I'm pretty sure that it'll still work. Figured it was worth making sure, though.

Copy link
Contributor

@ahl ahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest @pfmooney take a look.

Perhaps something like oxidecomputer/dropshot#39 would better allow propolis to communicate actionable errors.

sled-agent/src/instance.rs Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
@hawkw hawkw enabled auto-merge (squash) October 1, 2024 17:06
@hawkw hawkw disabled auto-merge October 1, 2024 17:06
@hawkw hawkw enabled auto-merge (squash) October 1, 2024 17:06
@hawkw hawkw self-assigned this Oct 1, 2024
@hawkw hawkw added this to the 11 milestone Oct 1, 2024
@hawkw hawkw merged commit 3093818 into main Oct 1, 2024
18 checks passed
@hawkw hawkw deleted the eliza/dude-wheres-my-instance branch October 1, 2024 19:09
hawkw added a commit that referenced this pull request Oct 2, 2024
At present, sled-agent's handling of the error codes used by Propolis to
indicate that it has crashed and been restarted is woefully incorrect.
In particular, there are two cases where such an error may be
encountered by a sled-agent:

1. When attempting to ask the VMM to change state (e.g. reboot or stop
   the instance)
2. When hitting Propolis' `instance-state-monitor` API endpoint to
   proactively check the VM's current state

Neither of these are handled correctly today,

In the first case, if a sled-agent operation on behalf of Nexus
encounters a client error from Propolis, it will forward that error code
to Nexus...but, _Nexus_ only moves instances to `Failed` in the face of
sled-agent's `410 NO_SUCH_INSTANCE`, and _not_ [if it sees the
`NoInstance` error code from Propolis][1], which means that the ghosts
left behind by crashed and restarted Propolii still won't be moved to
`Failed`. Agh. Furthermore, in that case, sled-agent itself will not
perform the necessary cleanup actions to deal with the now-defunct
Propolis zone (e.g. collecting a zone bundle and then destroying the
zone).

In the second case, where we hit the instance-state-monitor endpoint and
get back a `NoInstance` error, things are equally dire. The
`InstanceMonitorRunner` task, which is responsible for polling the state
monitor endpoint, will just bail out on receipt of any error from
Propolis:

https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L253-L289

We would _expect_ this to drop the send-side of the channel it uses to
communicate with the `InstanceRunner`, closing the channel, and hitting
this select, which would correctly mark the VMM as failed and do what we
want, despite eating the actual error message from propolis:

https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L388-L394

HOWEVER, as the comment points out, the `InstanceRunner` is _also_
holding its own clone of the channel sender, keeping us from ever
hitting that case:

https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L308-L309

AFAICT, this means we'll just totally give up on monitoring the instance
as soon as we hit any error here, which seems...quite bad. I *think*
that this actually means that when a Propolis process crashes
unexpectedly, we'll get an error when the TCP connection closes, bail
out, and then _never try hitting the instance monitor endpoint ever
again_[^1]. So, we won't notice that the Propolis is actually out to
lunch until we try to ask it to change state. **This Is Not Great!**

This commit fixes both of these cases, by making sled-agent actually
handle Propolis' error codes correctly. I've added a dependency on the
`propolis_api_types` crate, which is where the error code lives, and
some code in sled-agent to attempt to parse these codes when receiving
an error response from the Propolis client. Now, when we try to PUT a
new state to the instance, we'll look at the error code that comes back,
mark it as `Failed` if the error indicates that we should do so, and
publish the `Failed` VMM state to Nexus before tearing down the zone.
The `InstanceMonitorTask` now behaves similarly, and I've changed it to
retry all other errors with a backoff, rather than just bailing out
immediately on receipt fo the first error.

I've manually tested this on `london` as discussed here:

#6726 (comment)

Unfortunately, it's hard to do any kind of automated testing of this
with the current test situation, as none of this code is exercised by
the simulated sled-agent.

Fixes #3209
Fixes #3238

[^1]: Although I've not actually verified that this is what happens.

[1]: https://github.com/oxidecomputer/propolis/blob/516dabe473cecdc3baea1db98a80441968c144cf/crates/propolis-api-types/src/lib.rs#L434-L441
@hawkw
Copy link
Member Author

hawkw commented Oct 2, 2024

Okay, I've re-tested this after the Dogfood update to c50cf01, and it definitely still works --- if I restart a propolis_server process, the instance gets reincarnated pretty much immediately.

In the switch zone, inspecting my newly created test instance:

root@oxz_switch0:~# omdb db instance info cb44ae27-befb-424a-b9f1-087ac9d0c898
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (107.0.0)

== INSTANCE ====================================================================
                        ID: cb44ae27-befb-424a-b9f1-087ac9d0c898
                project ID: 50d839b4-2d18-49c0-94cf-7d6883e00178
                      name: test-instance-please-ignore
               description: let's see what happens when propolis crashes
                created at: 2024-10-02 21:10:26.781544 UTC
          last modified at: 2024-10-02 21:10:26.781544 UTC

== CONFIGURATION ===============================================================
                     vCPUs: 2
                    memory: 8 GiB
                  hostname: test-instance-please-ignore
                 boot disk: None
              auto-restart:
                  InstanceAutoRestart {
                      policy: None,
                      cooldown: None,
                  }

== RUNTIME STATE ===============================================================
               nexus state: Vmm
(i)     external API state: Running
           last updated at: 2024-10-02T21:10:26.781544Z (generation 3)
       needs reincarnation: false
             karmic status: bound to saṃsāra
      last reincarnated at: None
             active VMM ID: Some(89437bf4-da6b-441d-bc9c-c9344b6893e3)
             target VMM ID: None
              migration ID: None
              updater lock: UNLOCKED at generation: 1

         active VMM record:
             Vmm {
                 id: 89437bf4-da6b-441d-bc9c-c9344b6893e3,
                 time_created: 2024-10-02T21:10:30.927858Z,
                 time_deleted: None,
                 instance_id: cb44ae27-befb-424a-b9f1-087ac9d0c898,
                 sled_id: 5f6720b8-8a31-45f8-8c94-8e699218f28b,
                 propolis_ip: V6(
                     Ipv6Network {
                         addr: fd00:1122:3344:109::1:4f6,
                         prefix: 128,
                     },
                 ),
                 propolis_port: SqlU16(
                     12400,
                 ),
                 runtime: VmmRuntimeState {
                     time_state_updated: 2024-10-02T21:10:41.851556Z,
                     gen: Generation(
                         Generation(
                             4,
                         ),
                     ),
                     state: Running,
                 },
             }
root@oxz_switch0:~#

log into the instance's sled and restart its propolis-server

root@oxz_switch0:~# pilot host login BRM42220017

    #####
   ##   ##
  ##   # ##  ##   ##
  ##  #  ##   ## ##     Oxide Computer Company
  ## #   ##    ###      Engineering
   ##   ##    ## ##
    #####    ##   ##    Gimlet

BRM42220017 # zlogin oxz_propolis-server_89437bf4-da6b-441d-bc9c-c9344b6893e3
[Connected to zone 'oxz_propolis-server_89437bf4-da6b-441d-bc9c-c9344b6893e3' pts/3]
root@oxz_propolis:~# svcadm restart svc:/system/illumos/propolis-server:default
root@oxz_propolis:~#
[Connection to zone 'oxz_propolis-server_89437bf4-da6b-441d-bc9c-c9344b6893e3' pts/3 closed]
BRM42220017 # exit
logout
root@oxz_switch0:~# 

instance is restarted successfully!

root@oxz_switch0:~# omdb db instance info cb44ae27-befb-424a-b9f1-087ac9d0c898
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (107.0.0)

== INSTANCE ====================================================================
                        ID: cb44ae27-befb-424a-b9f1-087ac9d0c898
                project ID: 50d839b4-2d18-49c0-94cf-7d6883e00178
                      name: test-instance-please-ignore
               description: let's see what happens when propolis crashes
                created at: 2024-10-02 21:10:26.781544 UTC
          last modified at: 2024-10-02 21:10:26.781544 UTC

== CONFIGURATION ===============================================================
                     vCPUs: 2
                    memory: 8 GiB
                  hostname: test-instance-please-ignore
                 boot disk: None
              auto-restart:
                  InstanceAutoRestart {
                      policy: None,
                      cooldown: None,
                  }

== RUNTIME STATE ===============================================================
               nexus state: Vmm
(i)     external API state: Starting
           last updated at: 2024-10-02T21:14:44.002990Z (generation 5)
       needs reincarnation: false
/!\          karmic status: cooling down (TimeDelta { secs: 3589, nanos: 929383990 } remaining)
      last reincarnated at: Some(2024-10-02T21:14:45.996494Z)
             active VMM ID: Some(9111dfe7-0d55-4c6b-a72c-1f35ba9b5fc2)
             target VMM ID: None
              migration ID: None
              updater lock: UNLOCKED at generation: 4

         active VMM record:
             Vmm {
                 id: 9111dfe7-0d55-4c6b-a72c-1f35ba9b5fc2,
                 time_created: 2024-10-02T21:14:45.918847Z,
                 time_deleted: None,
                 instance_id: cb44ae27-befb-424a-b9f1-087ac9d0c898,
                 sled_id: f15774c1-b8e5-434f-a493-ec43f96cba06,
                 propolis_ip: V6(
                     Ipv6Network {
                         addr: fd00:1122:3344:105::1:5d5,
                         prefix: 128,
                     },
                 ),
                 propolis_port: SqlU16(
                     12400,
                 ),
                 runtime: VmmRuntimeState {
                     time_state_updated: 2024-10-02T21:14:46.684417Z,
                     gen: Generation(
                         Generation(
                             2,
                         ),
                     ),
                     state: Starting,
                 },
             }
root@oxz_switch0:~#

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants