Skip to content

U.2 Physical Disk Deactivation #4719

Open
@smklein

Description

@smklein

This issue covers "the cleanup that Nexus and other services need to do, in order for a physical disk to be removable from the fleet without causing problems".

Terminology

There are many distinct phases to removing a physical disk from the rack, and terminology can be confusing, so I propose the following terms to help disambiguate:

  • Physical Disk: A U.2 (or M.2) device which physically exists and may or may not be attached to a sled. This is distinct from a virtual disk, which is an abstracted resource provided by the control plane for use by end users.
  • Deactivated Disk: A physical disk is deactivated when the control plane has made a decision, for any reason, to stop using it. Although the rationale may be part of the identification process (e.g., is it non-responsive? is it erroneous? are we re-shuffling our data center?), ultimately, the control plane must cope with deactivation by marking removed resources and re-provisioning them where appropriate.
  • Unplugged Disk: A physical disk is unplugged when it has been pulled out of the rack by a human. Notably, this is only one scenario where a disk may not be "reporting" up to the control plane.
  • Functioning Disk: A physical disk which is capable of servicing requests to zpools, datasets, and services above it without error is considered "functioning".
  • Impaired Disk: If, for any reason, temporary or permanent, the physical disk cannot fulfill requests from the broader control plane, it is referred to as an impaired disk. This is distinguished from the traditional FMA terminology of "error" and "fault" because access to a disk may be limited for reasons other than disk errors -- power cycling a disk or sled may both be valid reasons for a disk to be "impaired". This is a blanket term that is useful for these catch-all conditions, though the diagnosis is an important decision in "whether or not an impaired disk should be deactivated. How do we cope?".

Action Items

Identification

  • Automatically detect and provide signal about disk impairment. Note that this is not a trivial boolean -- disks may not have been parsed yet, disks may be temporarily power cycled, disks may be unplugged/re-attached, disk drivers may not have fully loaded.
    • Note: This would be a great intersection point with a distributed fault management system, if we had one. Conceptually, this overlaps with "filing an EREPORT" as one of the many signals for an operator to interpret.
  • Add an operator API to explicitly "deactivate" a specific physical disk.
  • Provide a mechanism to re-activate disks that have previously been deactivated. Disks contain a variety of datasets that can be interpreted after a disk has been removed and re-attached. Example: "Disk was unplugged and put on a shelf, the system reacted by migrating data off, than it was re-attached into the rack a month later".

Marking and Adapting to Disk Deactivation

We will need to handle failures differently for the myriad of storage use-cases, though Crucible is arguably the most complex.

  • Crucible...
    • Mark all {regions, region snapshots} as deleted -- although a user has not requested that these resources have been removed, the "resource provisioned in this specific dataset" no longer exists, and the control plane must respond accordingly.
    • Signal that these regions require re-allocation to restore redundancy (e.g., with one failure, we'd be at 2 / 3 redundancy for Crucible regions, and Nexus should be responsible for allocating and provisioning a backup copy).
    • Communicate with all currently running Crucible services (upstairs? pantry?) that the Crucible downstairs service for the region no longer exists, and should no longer be contacted.
      • We will likely also need to tell these services about the new Crucible service that gets provisioned as a back-fill.
    • Ensure that all sagas relying on provisioning / de-provisioning resources can check for this "physical disk deactivated" scenario. As a concrete example: the disk creation saga, on the unwind path, tries to send an HTTP DELETE request to the Crucible service forever. If the disk was unplugged, this will never respond, and this saga would be stuck -- we should be periodically confirming that the disk has not been detached from the control plane.
  • For all zones (this includes zones with durable datasets, as well as zones with transient filesystems)
    • Identify that the physical disk has been deactivated, and update the service inventory system to stop requesting a service using that dataset.
    • Rely on the service provisioning system in the update planner to allocate additional services, and restore redundancy requirements.
    • If necessary, communicate with existing services to identify their new peers, and to remove their old peers. I believe this is particularly relevant for CockroachDB and Clickhouse.
  • For all Propolis instances with filesystems backed by physical disks

Debugging

  • Add additional tooling to OMDB to help track this flow. Especially since resources need to be migrated upon disk removal, it may be worthwhile adding a flow that makes it more clear "what was the blast radius of a disk removal, and to where did resources move?" (Will be fixed by [nexus] Expunge disk internal API, omdb commands #5994)

Metadata

Metadata

Assignees

Labels

nexusRelated to nexusstorageRelated to storage.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions