Skip to content

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Oct 30, 2025

RFD 603 proposes the fault management situation report, or sitrep, as the central data structure for the control plane's fault management subsystem. The design, which is discussed in much greater detail in that RFD, draws a lot of inspiration from the blueprint data structure in the Reconfigurator. Sitreps are generated by the planning phase in a plan-execute pattern. At any time, a single sitrep is considered current. Updating the control plane's understanding of the state of the system based on new inputs is done by a new planning step based on the current sitrep along with other inputs, and produces a new sitrep with the current sitrep as its parent. A sitrep may then be added to the version history of current sitreps if (and only if) its parent sitrep is still the current sitrep (i.e. the highest version number currently stored in the sitrep history). This ensures that there is a single sequentially consistent history of sitreps. Sitreps generated based on outdated inputs --- due to multiple Nexuses generating them concurrently, or a Nexus operating on state that is no longer up to date ---may not be made current, and are discarded.

This branch adds the foundation of the sitrep subsystem. In particular, it includes the following:

  • Database schemas for the fm_sitrep table, which stores metadata for sitreps, and the fm_sitrep_history table, which stores the version history
  • Models and nexus_types types for the same
  • Database queries for reading the current sitrep version, reading a sitrep by its ID, and for inserting sitreps, including the "compare and swap" CTE that ensures new versions may only be inserted if they descend directly from the current sitrep
  • A fm_sitrep_loader task that loads the latest sitrep version and publishes it over a tokio::sync::watch channel (which is not presently consumed by other code)
  • OMDB commands for looking at sitreps

Right now, a sitrep only contains its top-level metadata. Other tables for storing parts of the sitrep, such as cases and records for updating Problems, will be added later as more of the control plane fault management subsystem is implemented. Currently, no sitreps are ever created outside of tests, so this code won't really do anything yet. But, it's an important foundation for the ret of the FM work, so I wanted to get it up for review as soon as possible.

@hawkw hawkw self-assigned this Oct 30, 2025
@hawkw hawkw added the nexus Related to nexus label Oct 30, 2025
@hawkw
Copy link
Member Author

hawkw commented Oct 31, 2025

Um, okay...the test failure on the Ubuntu buildomat job is making me kinda uncomfortable: it seems like something in the dropshot API test has overflowed its stack. This is disquieting because there isn't any change to Dropshot APIs on this branch, so I'm not sure if there's anything I've changed that has caused it to behave differently... I'll have to investigate further.

@hawkw
Copy link
Member Author

hawkw commented Oct 31, 2025

Um, okay...the test failure on the Ubuntu buildomat job is making me kinda uncomfortable: it seems like something in the dropshot API test has overflowed its stack. This is disquieting because there isn't any change to Dropshot APIs on this branch, so I'm not sure if there's anything I've changed that has caused it to behave differently... I'll have to investigate further.

Ah, apparently this was fixed on main yesterday. i've updated this branch.

pub id: DbTypedUuid<SitrepKind>,
pub parent_sitrep_id: Option<DbTypedUuid<SitrepKind>>,
pub inv_collection_id: DbTypedUuid<CollectionKind>,
pub creator_id: DbTypedUuid<OmicronZoneKind>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Could we order these fields the same way they appear in fm_sitrep? I believe that would be "time_created, creator_id, comment"

(version, sitrep)
}
SitrepIdOrCurrent::Current => {
let Some((version, sitrep)) =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where a caller queries this API - and the current sitrep changes immediately after they make this query - is it possible that:

  • maybe_version + sitrep would become an "old" value of the sitrep
  • Below, on line 293, we'd show this as a historical sitrep?

WDYT about:

always querying fm_current_sitrep_version first, then choosing to load the sitrep afterwards?

Then, if the sitrep changes (which would be fine), we'd show something that appeared current when you started this operation, which would perhaps be less confusing than showing something that looks historical when the user asked for the current sitrep?

debug!(log, "current sitrep has not changed");
return Status::Loaded { version, time_loaded };
}
Some(SitrepVersion { version, id, .. })
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably, the version also "shouldn't go down", right?

Idk if it's worth having special cases for these - definitely we want to detect "has the UUID changed or not", but idk if it's worth it to include match cases for invariants that should really be impossible to trigger.

opctx.authorize(authz::Action::Modify, &authz::FLEET).await?;

// Create the sitrep metadata record.
diesel::insert_into(sitrep_dsl::fm_sitrep)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, if we crash after this "insert_into" but before the "InsetSitrepVersionQuery", we'll also leak some rows.

To confirm: Nothing is deleting, nor attempting to delete, orphaned/partially written sitreps in this PR, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, I was going to do that next.

@hawkw
Copy link
Member Author

hawkw commented Nov 1, 2025

The test failure in this build-and-test (helios) job looks like a flake. I've opened #9330 to track that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nexus Related to nexus

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants