You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This disk was physically removed a long time ago (February 2024), but was still marked as in-service and active in the database and the blueprint.
We expunged the disk via omdb, generated a new blueprint, and made it the target. Execution still failed, but now due to a timeout trying to PUT /omicron-physical-disks to the sled where that disk was removed:
I believe the request itself is correct: it no longer notes the disk that was removed (serial number A079E79D). The timeout is the same symptom we saw in #6904, but the logs (and the cause) are different in this case. Grepping for the sequence of physical disks ensure logs inside SledAgent::omicron_physical_disks_ensure() and honing in on the point at which we switched from Generation(2) to Generation(3) (i.e., the new config that omits the removed disk):
22:27:12.822Z INFO SledAgent: physical disks ensure
file = sled-agent/src/sled_agent.rs:895
sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:12.822Z INFO SledAgent: physical disks ensure: Updated storage
file = sled-agent/src/sled_agent.rs:899
sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:12.822Z INFO SledAgent: physical disks ensure: Propagating new generation of disks
file = sled-agent/src/sled_agent.rs:925
generation = Generation(2)
sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:12.822Z INFO SledAgent: physical disks ensure: Updated storage monitor
file = sled-agent/src/sled_agent.rs:930
sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:12.822Z INFO SledAgent: physical disks ensure: Updated zone bundler
file = sled-agent/src/sled_agent.rs:935
sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:12.823Z INFO SledAgent: physical disks ensure: Updated probes
file = sled-agent/src/sled_agent.rs:941
sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:12.823Z INFO SledAgent: physical disks ensure: Updated instances
file = sled-agent/src/sled_agent.rs:946
sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:33.211Z INFO SledAgent: physical disks ensure
file = sled-agent/src/sled_agent.rs:895
sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:33.213Z INFO SledAgent: physical disks ensure: Updated storage
file = sled-agent/src/sled_agent.rs:899
sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:33.213Z INFO SledAgent: physical disks ensure: Propagating new generation of disks
file = sled-agent/src/sled_agent.rs:925
generation = Generation(3)
sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:42.589Z INFO SledAgent: physical disks ensure
file = sled-agent/src/sled_agent.rs:895
sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:42.590Z INFO SledAgent: physical disks ensure: Updated storage
file = sled-agent/src/sled_agent.rs:899
sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:42.590Z INFO SledAgent: physical disks ensure: Propagating new generation of disks
file = sled-agent/src/sled_agent.rs:925
generation = Generation(3)
sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
The logs continue like this: in all subsequent requests, we see the Updated storage and Propagating new generation of disks logs for generation 3, but we never see the next expected log, Updated storage monitor. This strongly implies we're getting stuck inside StorageMonitorHandle::await_generation(). That method uses watch::Receiver::wait_for() with no timeout, which will wait indefinitely for the condition to be true. I think this means we're "leaking" a tokio task on every PUT /omicron-physical-disks call from a blueprint executor RPW, which the logs appear to confirm. Prior to the switch to Generation(3), we see many request completed logs; e.g.,
... many more "request completed" logs ...
22:26:12.525Z INFO SledAgent (dropshot (SledAgent)): request completed
file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
latency_us = 1322
local_addr = [fd00:1122:3344:111::1]:12345
method = PUT
remote_addr = [fd00:1122:3344:108::3]:39077
req_id = b0159a61-1cca-4716-bab4-00ebfb4c3bbb
response_code = 200
uri = /omicron-physical-disks
22:27:05.778Z INFO SledAgent (dropshot (SledAgent)): request completed
file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
latency_us = 1160
local_addr = [fd00:1122:3344:111::1]:12345
method = PUT
remote_addr = [fd00:1122:3344:111::3]:61977
req_id = bba3e7eb-c02a-4826-8d43-caca21615199
response_code = 200
uri = /omicron-physical-disks
22:27:12.823Z INFO SledAgent (dropshot (SledAgent)): request completed
file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
latency_us = 1132
local_addr = [fd00:1122:3344:111::1]:12345
method = PUT
remote_addr = [fd00:1122:3344:108::3]:39297
req_id = 5d0697ed-026f-456a-abb9-7e4f03726f52
response_code = 200
uri = /omicron-physical-disks
After we start receiving the Generation(3) requests, we see the warnings from dropshot that the client is disconnecting (with a latency_us consistent with the 60 second timeout Nexus uses for its sled-agent clients):
but we never see the logs dropshot would emit once the handler actually completes (request completed after handler was already cancelled), which is consistent with all of those requests still being stuck in the wait_for() call, still waiting for the storage monitor to realize we're now at generation 3.
It's kinda hard to follow how the storage monitor is supposed to get updates here. Working backwards, I think what is supposed to happen is:
There are a handful of places inside StorageResources where new values are sent into the disk_updates channel, but I believe only one of them is in the PUT /omicron-physical-disks path: the one at the end of synchronize_disk_management() (because synchronize_disk_management is called by omicron_physical_disks_ensure_internal which is ultimately called by the endpoint handler)
The disk_updates write in synchronize_disk_management() is conditional on the updated boolean, and I think it is not set to true in the "Nexus told us to stop using a disk that is no longer physically present" case. There are only two places where it's set to true: if we start managing a new disk, which is not applicable in this case, or if we stop managing an old disk. If this disk was still present, we would be in the latter case, but it's not present in the disks map at all, which the logs confirm: prior to our move to generation 3, we see the warnings emitted in the "the control plane told us to use a disk we don't have" branch:
22:27:12.822Z WARN SledAgent (StorageResources): Control plane disk requested, but not detected within sled
disk_identity = DiskIdentity { vendor: "1b96", model: "WUS4C6432DSP3X3", serial: "A079E79D" }
file = sled-storage/src/resources.rs:331
I'm not sure what the fix here should be. The generation inside self.disks is written in set_config. I'm not sure if it would be correct to send an update to the disks_updated channel there, before any actual disk management changes have taken place? Maybe synchronize_disk_management() should be told whether the generation has changed (somehow?) and seed the updated boolean with that value? Also on the table: we have a couple issues in this area that are leading toward major rework of the PUT endpoints (both to merge zones+disks+datasets into one request, #7309, and to make the handling of those requests more asynchronous, #5086), and we've had several bugs like this one (some of which are still open, such as #7546) that we could address as part of that rework. (A goal of that rework should be to simplify a lot of this; tracing disk updates through multiple long-running async tasks and chained watch channels makes it pretty hard to figure out what's supposed to be happening.)
The text was updated successfully, but these errors were encountered:
(This is a derived from one of the issues described in https://github.com/oxidecomputer/colo/issues/102.)
During the colo R13 upgrade, blueprint execution was failing due to a nonexistent disk:
This disk was physically removed a long time ago (February 2024), but was still marked as in-service and active in the database and the blueprint.
We expunged the disk via
omdb
, generated a new blueprint, and made it the target. Execution still failed, but now due to a timeout trying toPUT /omicron-physical-disks
to the sled where that disk was removed:I believe the request itself is correct: it no longer notes the disk that was removed (serial number
A079E79D
). The timeout is the same symptom we saw in #6904, but the logs (and the cause) are different in this case. Grepping for the sequence ofphysical disks ensure
logs insideSledAgent::omicron_physical_disks_ensure()
and honing in on the point at which we switched fromGeneration(2)
toGeneration(3)
(i.e., the new config that omits the removed disk):The logs continue like this: in all subsequent requests, we see the
Updated storage
andPropagating new generation of disks
logs for generation 3, but we never see the next expected log,Updated storage monitor
. This strongly implies we're getting stuck insideStorageMonitorHandle::await_generation()
. That method useswatch::Receiver::wait_for()
with no timeout, which will wait indefinitely for the condition to be true. I think this means we're "leaking" a tokio task on everyPUT /omicron-physical-disks
call from a blueprint executor RPW, which the logs appear to confirm. Prior to the switch toGeneration(3)
, we see manyrequest completed
logs; e.g.,After we start receiving the
Generation(3)
requests, we see the warnings from dropshot that the client is disconnecting (with alatency_us
consistent with the 60 second timeout Nexus uses for its sled-agent clients):but we never see the logs dropshot would emit once the handler actually completes (
request completed after handler was already cancelled
), which is consistent with all of those requests still being stuck in thewait_for()
call, still waiting for the storage monitor to realize we're now at generation 3.It's kinda hard to follow how the storage monitor is supposed to get updates here. Working backwards, I think what is supposed to happen is:
StorageMonitor::handle_resource_update()
updates the watch channel with a new generation after updating the dump deviceStorageMonitor::run()
callshandle_resource_update()
whenStorageManager::wait_for_changes()
returnsStorageManager::wait_for_changes()
is itself waiting for changes on a different watch channel:self.disk_updates
self.disk_updates
channel is the receiving side; the sending side comes fromStorageResources
StorageResources
where new values are sent into thedisk_updates
channel, but I believe only one of them is in thePUT /omicron-physical-disks
path: the one at the end ofsynchronize_disk_management()
(becausesynchronize_disk_management
is called byomicron_physical_disks_ensure_internal
which is ultimately called by the endpoint handler)The
disk_updates
write insynchronize_disk_management()
is conditional on theupdated
boolean, and I think it is not set to true in the "Nexus told us to stop using a disk that is no longer physically present" case. There are only two places where it's set to true: if we start managing a new disk, which is not applicable in this case, or if we stop managing an old disk. If this disk was still present, we would be in the latter case, but it's not present in thedisks
map at all, which the logs confirm: prior to our move to generation 3, we see the warnings emitted in the "the control plane told us to use a disk we don't have" branch:I'm not sure what the fix here should be. The generation inside
self.disks
is written inset_config
. I'm not sure if it would be correct to send an update to thedisks_updated
channel there, before any actual disk management changes have taken place? Maybesynchronize_disk_management()
should be told whether the generation has changed (somehow?) and seed theupdated
boolean with that value? Also on the table: we have a couple issues in this area that are leading toward major rework of thePUT
endpoints (both to merge zones+disks+datasets into one request, #7309, and to make the handling of those requests more asynchronous, #5086), and we've had several bugs like this one (some of which are still open, such as #7546) that we could address as part of that rework. (A goal of that rework should be to simplify a lot of this; tracing disk updates through multiple long-running async tasks and chained watch channels makes it pretty hard to figure out what's supposed to be happening.)The text was updated successfully, but these errors were encountered: