-
Notifications
You must be signed in to change notification settings - Fork 60
[nexus] fault management situation reports #9320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
e951689 to
4108899
Compare
|
Um, okay...the test failure on the Ubuntu buildomat job is making me kinda uncomfortable: it seems like something in the dropshot API test has overflowed its stack. This is disquieting because there isn't any change to Dropshot APIs on this branch, so I'm not sure if there's anything I've changed that has caused it to behave differently... I'll have to investigate further. |
Ah, apparently this was fixed on |
| pub id: DbTypedUuid<SitrepKind>, | ||
| pub parent_sitrep_id: Option<DbTypedUuid<SitrepKind>>, | ||
| pub inv_collection_id: DbTypedUuid<CollectionKind>, | ||
| pub creator_id: DbTypedUuid<OmicronZoneKind>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Could we order these fields the same way they appear in fm_sitrep? I believe that would be "time_created, creator_id, comment"
| (version, sitrep) | ||
| } | ||
| SitrepIdOrCurrent::Current => { | ||
| let Some((version, sitrep)) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where a caller queries this API - and the current sitrep changes immediately after they make this query - is it possible that:
maybe_version+sitrepwould become an "old" value of the sitrep- Below, on line 293, we'd show this as a historical sitrep?
WDYT about:
always querying fm_current_sitrep_version first, then choosing to load the sitrep afterwards?
Then, if the sitrep changes (which would be fine), we'd show something that appeared current when you started this operation, which would perhaps be less confusing than showing something that looks historical when the user asked for the current sitrep?
| debug!(log, "current sitrep has not changed"); | ||
| return Status::Loaded { version, time_loaded }; | ||
| } | ||
| Some(SitrepVersion { version, id, .. }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably, the version also "shouldn't go down", right?
Idk if it's worth having special cases for these - definitely we want to detect "has the UUID changed or not", but idk if it's worth it to include match cases for invariants that should really be impossible to trigger.
| opctx.authorize(authz::Action::Modify, &authz::FLEET).await?; | ||
|
|
||
| // Create the sitrep metadata record. | ||
| diesel::insert_into(sitrep_dsl::fm_sitrep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written, if we crash after this "insert_into" but before the "InsetSitrepVersionQuery", we'll also leak some rows.
To confirm: Nothing is deleting, nor attempting to delete, orphaned/partially written sitreps in this PR, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, I was going to do that next.
|
The test failure in this build-and-test (helios) job looks like a flake. I've opened #9330 to track that. |
RFD 603 proposes the fault management situation report, or sitrep, as the central data structure for the control plane's fault management subsystem. The design, which is discussed in much greater detail in that RFD, draws a lot of inspiration from the blueprint data structure in the Reconfigurator. Sitreps are generated by the planning phase in a plan-execute pattern. At any time, a single sitrep is considered current. Updating the control plane's understanding of the state of the system based on new inputs is done by a new planning step based on the current sitrep along with other inputs, and produces a new sitrep with the current sitrep as its parent. A sitrep may then be added to the version history of current sitreps if (and only if) its parent sitrep is still the current sitrep (i.e. the highest version number currently stored in the sitrep history). This ensures that there is a single sequentially consistent history of sitreps. Sitreps generated based on outdated inputs --- due to multiple Nexuses generating them concurrently, or a Nexus operating on state that is no longer up to date ---may not be made current, and are discarded.
This branch adds the foundation of the sitrep subsystem. In particular, it includes the following:
fm_sitreptable, which stores metadata for sitreps, and thefm_sitrep_historytable, which stores the version historynexus_typestypes for the samefm_sitrep_loadertask that loads the latest sitrep version and publishes it over atokio::sync::watchchannel (which is not presently consumed by other code)Right now, a sitrep only contains its top-level metadata. Other tables for storing parts of the sitrep, such as cases and records for updating Problems, will be added later as more of the control plane fault management subsystem is implemented. Currently, no sitreps are ever created outside of tests, so this code won't really do anything yet. But, it's an important foundation for the ret of the FM work, so I wanted to get it up for review as soon as possible.