U.2 Physical Disk Deactivation

This issue covers "the cleanup that Nexus and other services need to do, in order for a physical disk to be removable from the fleet without causing problems".

## Terminology

There are many distinct phases to removing a physical disk from the rack, and terminology can be confusing, so I propose the following terms to help disambiguate:

- **Physical Disk**: A U.2 (or M.2) device which physically exists and may or may not be attached to a sled. This is distinct from a **virtual disk**, which is an abstracted resource provided by the control plane for use by end users.
- **Deactivated Disk**: A physical disk is deactivated when the control plane has made a decision, for any reason, to stop using it. Although the rationale may be part of the identification process (e.g., is it non-responsive? is it erroneous? are we re-shuffling our data center?), ultimately, the control plane must cope with deactivation by marking removed resources and re-provisioning them where appropriate.
- **Unplugged Disk**: A physical disk is unplugged when it has been pulled out of the rack by a human. Notably, this is only one scenario where a disk may not be "reporting" up to the control plane.
- **Functioning Disk**: A physical disk which is capable of servicing requests to zpools, datasets, and services above it without  error is considered "functioning".
- **Impaired Disk**: If, for any reason, temporary or permanent, the physical disk cannot fulfill requests from the broader control plane, it is referred to as an **impaired disk**. This is distinguished from the traditional FMA terminology of "error" and "fault" because access to a disk may be limited for reasons **other** than disk errors -- power cycling a disk or sled may both be valid reasons for a disk to be "impaired". This is a blanket term that is useful for these catch-all conditions, though the diagnosis is an important decision in "whether or not an impaired disk should be deactivated. How do we cope?".

# Action Items

## Identification

- [ ] Automatically detect and provide signal about disk impairment. Note that this is not a trivial boolean -- disks may not have been parsed yet, disks may be temporarily power cycled, disks may be unplugged/re-attached, disk drivers may not have fully loaded.
  - Note: This would be a great intersection point with a distributed fault management system, if we had one. Conceptually, this overlaps with "filing an EREPORT" as one of the many signals for an operator to interpret.
- [x] Add an operator API to explicitly "deactivate" a specific physical disk.
- [x] Provide a mechanism to re-activate disks that have previously been deactivated. Disks contain a variety of datasets that can be interpreted after a disk has been removed and re-attached. Example: "Disk was unplugged and put on a shelf, the system reacted by migrating data off, than it was re-attached into the rack a month later".

## Marking and Adapting to Disk Deactivation

We will need to handle failures differently for the myriad of storage use-cases, though Crucible is arguably the most complex.

- [ ] Crucible...
  - [x] Mark all {regions, region snapshots} as deleted -- although a user has not requested that these resources have been removed, the "resource provisioned in this specific dataset" no longer exists, and the control plane must respond accordingly.
  - [x] Signal that these regions require re-allocation to restore redundancy (e.g., with one failure, we'd be at 2 / 3 redundancy for Crucible regions, and Nexus should be responsible for allocating and provisioning a backup copy).
  - [ ] Communicate with all currently running Crucible services (upstairs? pantry?) that the Crucible downstairs service for the region no longer exists, and should no longer be contacted.
    - [ ] We will likely also need to tell these services about the **new** Crucible service that gets provisioned as a back-fill.
  - [ ] Ensure that all sagas relying on provisioning / de-provisioning resources can check for this "physical disk deactivated" scenario. As a concrete example: the disk creation saga, on the unwind path, tries to send an HTTP `DELETE` request to the Crucible service forever. If the disk was unplugged, this will never respond, and this saga would be stuck -- we should be periodically confirming that the disk has not been detached from the control plane. 
- [ ] For all zones (this includes zones with durable datasets, as well as zones with transient filesystems)
  - [x] Identify that the physical disk has been deactivated, and update the service inventory system to stop requesting a service using that dataset.
  - [x] Rely on the service provisioning system in the update planner to allocate additional services, and restore redundancy requirements.
  - [ ] If necessary, communicate with existing services to identify their new peers, and to remove their old peers. I believe this is particularly relevant for CockroachDB and Clickhouse.
- [x] For all Propolis instances with filesystems backed by physical disks
  - [x] Stop them, mark them as failed (Will be fixed by https://github.com/oxidecomputer/omicron/pull/5965)

## Debugging

- [x] Add additional tooling to OMDB to help track this flow. Especially since resources need to be migrated upon disk removal, it may be worthwhile adding a flow that makes it more clear "what was the blast radius of a disk removal, and to where did resources move?" (Will be fixed by https://github.com/oxidecomputer/omicron/pull/5994)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

U.2 Physical Disk Deactivation #4719

Terminology

Action Items

Identification

Marking and Adapting to Disk Deactivation

Debugging

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

U.2 Physical Disk Deactivation #4719

Description

Terminology

Action Items

Identification

Marking and Adapting to Disk Deactivation

Debugging

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions