Expunging a physically removed disk wedges sled-agent #7622

jgallagher · 2025-02-25T14:53:53Z

(This is a derived from one of the issues described in https://github.com/oxidecomputer/colo/issues/102.)

During the colo R13 upgrade, blueprint execution was failing due to a nonexistent disk:

    target blueprint: b41a1d6e-9be3-41e0-8cfe-040607885858                                                                                                                             
    execution:        enabled                                                                                                                                                          
    status:           failed at: Deploy physical disks (step 4/16)                                                                                                                     
    error:            step failed: Deploy physical disks                                                                                                                               
      caused by:      failure deploying disks: [DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS4C6432DSP3X3", serial: "A079E79D" }, err: Some(NotFound) }]

This disk was physically removed a long time ago (February 2024), but was still marked as in-service and active in the database and the blueprint.

We expunged the disk via omdb, generated a new blueprint, and made it the target. Execution still failed, but now due to a timeout trying to PUT /omicron-physical-disks to the sled where that disk was removed:

    target blueprint: 7d11a7c0-0467-4c8b-9736-fd8e12c931ee                                                                             
    execution:        enabled                                                                                                          
    status:           failed at: Deploy physical disks (step 4/16)                                                                     
    error:            step failed: Deploy physical disks                                                                               
      caused by:      Failed to put BlueprintPhysicalDisksConfig {                                                                     
                          generation: Generation(                                                                                      
                              3,                                                                                                       
                          ),                                                                                                           
                          disks: IdMap {                                                                                               
                              inner: {                                                                                                 
                                  35a17a7f-b476-4107-b7e3-c7a97f7826d5 (physical_disk): BlueprintPhysicalDiskConfig {                  
                                      disposition: InService,                                                                          
                                      identity: DiskIdentity {                                                                         
                                          vendor: "1b96",                                                                              
                                          model: "WUS4C6432DSP3X3",                                                                    
                                          serial: "A079E5EA",                                                                          
                                      },                                                                                               
                                      id: 35a17a7f-b476-4107-b7e3-c7a97f7826d5 (physical_disk),                                        
                                      pool_id: 8e955f54-fbef-4021-9eec-457825468813 (zpool),                                           
                                  },                                                                                                   
                                  3cef614f-fafe-4592-a366-bb8feaf24b12 (physical_disk): BlueprintPhysicalDiskConfig {                  
                                      disposition: InService,                                                                          
                                      identity: DiskIdentity {                                                                         
                                          vendor: "1b96",                                                                              
                                          model: "WUS4C6432DSP3X3",                                                                    
                                          serial: "A079E6AB",                                                                          
                                      },                                                                                               
                                      id: 3cef614f-fafe-4592-a366-bb8feaf24b12 (physical_disk),                                        
                                      pool_id: f6925045-363d-4e18-9bde-ee2987b33d21 (zpool),                                           
                                  },                                                                                                   
                                  437420ec-2e98-4cf8-a708-d3990fe4f522 (physical_disk): BlueprintPhysicalDiskConfig {                  
                                      disposition: InService,                                                                          
                                      identity: DiskIdentity {                                                                         
                                          vendor: "1b96",                                                                              
                                          model: "WUS4C6432DSP3X3",                                                                    
                                          serial: "A084A671",                                                                          
                                      },                                                                                               
                                      id: 437420ec-2e98-4cf8-a708-d3990fe4f522 (physical_disk),                                        
                                      pool_id: a1856609-b49f-4d30-a0e3-2e8154dbcdbf (zpool),                                           
                                  },                                                                                                   
                                  4583f6e1-06fc-4814-b213-92faee366111 (physical_disk): BlueprintPhysicalDiskConfig {                  
                                      disposition: InService,                                                                          
                                      identity: DiskIdentity {                                                                         
                                          vendor: "1b96",                                                                              
                                          model: "WUS4C6432DSP3X3",                                                                    
                                          serial: "A079E75B",                                                                          
                                      },                                                                                               
                                      id: 4583f6e1-06fc-4814-b213-92faee366111 (physical_disk),                                        
                                      pool_id: 58276eba-a53c-4ef3-b374-4cdcde4d6e12 (zpool),                                           
                                  },                                                                                                   
                                  47398802-747a-43c6-912d-73b49b980848 (physical_disk): BlueprintPhysicalDiskConfig {                  
                                      disposition: InService,                                                                          
                                      identity: DiskIdentity {                                                                         
                                          vendor: "1b96",                                                                              
                                          model: "WUS4C6432DSP3X3",                                                                    
                                          serial: "A079E581",                                                                          
                                      },                                                                                               
                                      id: 47398802-747a-43c6-912d-73b49b980848 (physical_disk),                                        
                                      pool_id: 4353b00b-937e-4d07-aea6-014c57b6f12c (zpool),                                           
                                  },                                                                                                   
                                  4d6f3555-5b99-4f45-918f-8004c68c02d4 (physical_disk): BlueprintPhysicalDiskConfig {                  
                                      disposition: InService,                                                                          
                                      identity: DiskIdentity {                                                                         
                                          vendor: "1b96",                                                                              
                                          model: "WUS4C6432DSP3X3",                                                                    
                                          serial: "A079E8F0",                                                                          
                                      },                                                                                               
                                      id: 4d6f3555-5b99-4f45-918f-8004c68c02d4 (physical_disk),                                        
                                      pool_id: 6601065c-c172-4118-81b4-16adde7e9401 (zpool),                                           
                                  },                                                                                                   
                                  823a9027-7bca-4420-ae10-29a355ff66dd (physical_disk): BlueprintPhysicalDiskConfig {                  
                                      disposition: InService,                                                                          
                                      identity: DiskIdentity {                                                                         
                                          vendor: "1b96",                                                                              
                                          model: "WUS4C6432DSP3X3",                                                                    
                                          serial: "A079E7D8",                                                                          
                                      },                                                                                               
                                      id: 823a9027-7bca-4420-ae10-29a355ff66dd (physical_disk),                                        
                                      pool_id: 435d7a1b-2865-4d49-903f-a68f464ade4d (zpool),                                           
                                  },                                                                                                   
                                  bb993bbd-e427-4768-a905-d8a8eca071af (physical_disk): BlueprintPhysicalDiskConfig {                  
                                      disposition: InService,                                                                          
                                      identity: DiskIdentity {                                                                         
                                          vendor: "1b96",                                                                              
                                          model: "WUS4C6432DSP3X3",                                                                    
                                          serial: "A079E75D",                                                                          
                                      },                                                                                               
                                      id: bb993bbd-e427-4768-a905-d8a8eca071af (physical_disk),                                        
                                      pool_id: 24d7e250-9fc6-459e-8155-30f8e8ccb28c (zpool),                                           
                                  },                                                                                                   
                                  cb6046be-7a6b-408d-8602-6031a6727b24 (physical_disk): BlueprintPhysicalDiskConfig {                  
                                      disposition: InService,                                                                          
                                      identity: DiskIdentity {                                                                         
                                          vendor: "1b96",                                                                              
                                          model: "WUS4C6432DSP3X3",                                                                    
                                          serial: "A079E5C7",                                                                          
                                      },                                                                                               
                                      id: cb6046be-7a6b-408d-8602-6031a6727b24 (physical_disk),                                        
                                      pool_id: 9ab5aba5-47dc-4bc4-8f6d-7cbe0f98a9a2 (zpool),                                           
                                  },                                                                                                   
                                  d5345689-fdc8-4009-bef8-978d2492f302 (physical_disk): BlueprintPhysicalDiskConfig {                  
                                      disposition: InService,                                                                          
                                      identity: DiskIdentity {                                                                         
                                          vendor: "1b96",                                                                              
                                          model: "WUS4C6432DSP3X3",                                                                    
                                          serial: "A079E5EE",                                                                          
                                      },                                                                                               
                                      id: d5345689-fdc8-4009-bef8-978d2492f302 (physical_disk),                                        
                                      pool_id: ee55b053-6874-4e20-86b5-2e105e64c068 (zpool),                                           
                                  },                                                                                                   
                              },                                                                                                       
                          },                                                                                                           
                      } to sled da89d292-1e33-4c21-8cc2-65b6ef9cbb1b                                                                   
      caused by:      Communication Error: error sending request for url (http://[fd00:1122:3344:111::1]:12345/omicron-physical-disks) 
      caused by:      error sending request for url (http://[fd00:1122:3344:111::1]:12345/omicron-physical-disks)                      
      caused by:      operation timed out

I believe the request itself is correct: it no longer notes the disk that was removed (serial number A079E79D). The timeout is the same symptom we saw in #6904, but the logs (and the cause) are different in this case. Grepping for the sequence of physical disks ensure logs inside SledAgent::omicron_physical_disks_ensure() and honing in on the point at which we switched from Generation(2) to Generation(3) (i.e., the new config that omits the removed disk):

22:27:12.822Z INFO SledAgent: physical disks ensure
    file = sled-agent/src/sled_agent.rs:895
    sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:12.822Z INFO SledAgent: physical disks ensure: Updated storage
    file = sled-agent/src/sled_agent.rs:899
    sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:12.822Z INFO SledAgent: physical disks ensure: Propagating new generation of disks
    file = sled-agent/src/sled_agent.rs:925
    generation = Generation(2)
    sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:12.822Z INFO SledAgent: physical disks ensure: Updated storage monitor
    file = sled-agent/src/sled_agent.rs:930
    sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:12.822Z INFO SledAgent: physical disks ensure: Updated zone bundler
    file = sled-agent/src/sled_agent.rs:935
    sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:12.823Z INFO SledAgent: physical disks ensure: Updated probes
    file = sled-agent/src/sled_agent.rs:941
    sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:12.823Z INFO SledAgent: physical disks ensure: Updated instances
    file = sled-agent/src/sled_agent.rs:946
    sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:33.211Z INFO SledAgent: physical disks ensure
    file = sled-agent/src/sled_agent.rs:895
    sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:33.213Z INFO SledAgent: physical disks ensure: Updated storage
    file = sled-agent/src/sled_agent.rs:899
    sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:33.213Z INFO SledAgent: physical disks ensure: Propagating new generation of disks
    file = sled-agent/src/sled_agent.rs:925
    generation = Generation(3)
    sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:42.589Z INFO SledAgent: physical disks ensure
    file = sled-agent/src/sled_agent.rs:895
    sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:42.590Z INFO SledAgent: physical disks ensure: Updated storage
    file = sled-agent/src/sled_agent.rs:899
    sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b
22:27:42.590Z INFO SledAgent: physical disks ensure: Propagating new generation of disks
    file = sled-agent/src/sled_agent.rs:925
    generation = Generation(3)
    sled_id = da89d292-1e33-4c21-8cc2-65b6ef9cbb1b

The logs continue like this: in all subsequent requests, we see the Updated storage and Propagating new generation of disks logs for generation 3, but we never see the next expected log, Updated storage monitor. This strongly implies we're getting stuck inside StorageMonitorHandle::await_generation(). That method uses watch::Receiver::wait_for() with no timeout, which will wait indefinitely for the condition to be true. I think this means we're "leaking" a tokio task on every PUT /omicron-physical-disks call from a blueprint executor RPW, which the logs appear to confirm. Prior to the switch to Generation(3), we see many request completed logs; e.g.,

... many more "request completed" logs ...
22:26:12.525Z INFO SledAgent (dropshot (SledAgent)): request completed
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
    latency_us = 1322
    local_addr = [fd00:1122:3344:111::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:108::3]:39077
    req_id = b0159a61-1cca-4716-bab4-00ebfb4c3bbb
    response_code = 200
    uri = /omicron-physical-disks
22:27:05.778Z INFO SledAgent (dropshot (SledAgent)): request completed
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
    latency_us = 1160
    local_addr = [fd00:1122:3344:111::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:111::3]:61977
    req_id = bba3e7eb-c02a-4826-8d43-caca21615199
    response_code = 200
    uri = /omicron-physical-disks
22:27:12.823Z INFO SledAgent (dropshot (SledAgent)): request completed
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
    latency_us = 1132
    local_addr = [fd00:1122:3344:111::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:108::3]:39297
    req_id = 5d0697ed-026f-456a-abb9-7e4f03726f52
    response_code = 200
    uri = /omicron-physical-disks

After we start receiving the Generation(3) requests, we see the warnings from dropshot that the client is disconnecting (with a latency_us consistent with the 60 second timeout Nexus uses for its sled-agent clients):

22:28:33.254Z WARN SledAgent (dropshot (SledAgent)): request handling cancelled (client disconnected)
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:801
    latency_us = 60038812
    local_addr = [fd00:1122:3344:111::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:111::3]:64818
    req_id = 2088f270-9ed8-4daa-8eb8-4a25ec49e58a
    uri = /omicron-physical-disks
22:28:42.631Z WARN SledAgent (dropshot (SledAgent)): request handling cancelled (client disconnected)
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:801
    latency_us = 60038728
    local_addr = [fd00:1122:3344:111::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:108::3]:59686
    req_id = ac83b7dc-e2eb-4155-939e-00e38d67fe48
    uri = /omicron-physical-disks
22:29:36.278Z WARN SledAgent (dropshot (SledAgent)): request handling cancelled (client disconnected)
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:801
    latency_us = 60037600
    local_addr = [fd00:1122:3344:111::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:111::3]:49317
    req_id = fbe99f77-d7e7-4c6b-b27f-1c61ca04ad4f
    uri = /omicron-physical-disks
... many more "client disconnected" logs ...

but we never see the logs dropshot would emit once the handler actually completes (request completed after handler was already cancelled), which is consistent with all of those requests still being stuck in the wait_for() call, still waiting for the storage monitor to realize we're now at generation 3.

It's kinda hard to follow how the storage monitor is supposed to get updates here. Working backwards, I think what is supposed to happen is:

StorageMonitor::handle_resource_update() updates the watch channel with a new generation after updating the dump device
StorageMonitor::run() calls handle_resource_update() when StorageManager::wait_for_changes() returns
StorageManager::wait_for_changes() is itself waiting for changes on a different watch channel: self.disk_updates
The self.disk_updates channel is the receiving side; the sending side comes from StorageResources
There are a handful of places inside StorageResources where new values are sent into the disk_updates channel, but I believe only one of them is in the PUT /omicron-physical-disks path: the one at the end of synchronize_disk_management() (because synchronize_disk_management is called by omicron_physical_disks_ensure_internal which is ultimately called by the endpoint handler)

The disk_updates write in synchronize_disk_management() is conditional on the updated boolean, and I think it is not set to true in the "Nexus told us to stop using a disk that is no longer physically present" case. There are only two places where it's set to true: if we start managing a new disk, which is not applicable in this case, or if we stop managing an old disk. If this disk was still present, we would be in the latter case, but it's not present in the disks map at all, which the logs confirm: prior to our move to generation 3, we see the warnings emitted in the "the control plane told us to use a disk we don't have" branch:

22:27:12.822Z WARN SledAgent (StorageResources): Control plane disk requested, but not detected within sled
    disk_identity = DiskIdentity { vendor: "1b96", model: "WUS4C6432DSP3X3", serial: "A079E79D" }
    file = sled-storage/src/resources.rs:331

I'm not sure what the fix here should be. The generation inside self.disks is written in set_config. I'm not sure if it would be correct to send an update to the disks_updated channel there, before any actual disk management changes have taken place? Maybe synchronize_disk_management() should be told whether the generation has changed (somehow?) and seed the updated boolean with that value? Also on the table: we have a couple issues in this area that are leading toward major rework of the PUT endpoints (both to merge zones+disks+datasets into one request, #7309, and to make the handling of those requests more asynchronous, #5086), and we've had several bugs like this one (some of which are still open, such as #7546) that we could address as part of that rework. (A goal of that rework should be to simplify a lot of this; tracing disk updates through multiple long-running async tasks and chained watch channels makes it pretty hard to figure out what's supposed to be happening.)

The text was updated successfully, but these errors were encountered:

jgallagher added the Sled Agent Related to the Per-Sled Configuration and Management label Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expunging a physically removed disk wedges sled-agent #7622

Expunging a physically removed disk wedges sled-agent #7622

jgallagher commented Feb 25, 2025

Expunging a physically removed disk wedges sled-agent #7622

Expunging a physically removed disk wedges sled-agent #7622

Comments

jgallagher commented Feb 25, 2025