Open
Description
See: #6973 for some background context, as well as the ad-hoc meeting from 11/4 with @jgallagher , @davepacheco , @andrewjstone on this topic.
Background
- The planner and reconfigurator-cli both use the blueprint builder to construct blueprints.
- The planner would be used by Nexus, and likely has a more conservative bias towards constructing valid blueprints.
- The reconfigurator-cli acts as something of a "system override", and wants to construct blueprints that are "valid enough", but which may deviate from the constructions that the planner might create.
- Defining what abnormalities are valid / not valid is somewhat subtle. For example:
- Assigning the same underlay address to distinct services is probably always invalid. This could be categorized as a hard error.
- Deploying multiple services which have incompatible versions is invalid, but should it be prohibited from ever being constructed by the reconfigurator-cli?
- Deploying a blueprint with "no Nexuses" - this could be viewed as a deviation from policy, but on production systems, it'll create an inoperable system. How are we categorizing the validity of a blueprint with this configuration?
Categorizing Validity
It will be important for us to define some of these error cases - aka, what are deviations from an "okay" blueprint, and what's acceptable - as we define:
- What is valid for the blueprint builder API to produce?
- What is valid for the reconfigurator-cli to emit?
We've discussed using at least the following categories, though there may be more:
- Blueprint OK, matches policy: The blueprint is valid, and we cannot find any ways in which it deviates from the policy the planner would use.
- Blueprint OK, but deviates from policy: The blueprint could be deployed, but does not match our policy. For example: If our policy is to deploy three nexus zones, a blueprint in this category might be attempting to deploy "two" or "four" Nexus zones.
- Blueprint Erroneous: There are many flavors here, but this category includes:
- The blueprint cannot be deployed (we know ahead of time that a sled agent could or should reject it)
- The blueprint would render the system inoperable (e.g. delete all Nexus zones)
- The blueprint contains an internal inconsistency (data modified without changing generation number, etc)
Identifying Validity
This issue proposes a blueprint checker (perhaps called blippy) which can inspect a blueprint and identify "how valid" the blueprint appears, with categorization of how far the blueprint deviates from the norm.
We could use blippy in the following spots:
- As a standalone tool for inspecting blueprint
- As a part of the blueprint builder, to help the planner validate it has not created a "known erroneous" blueprint
- As a part of the reconfigurator-cli, to help users identify that their changes only deviate from a policy, and are not a violation of correctness guarantees (or, perhaps, we let people do this anyway, but with many warnings)