Skip to content

Commit

Permalink
Document design choices behind the API server model (#561)
Browse files Browse the repository at this point in the history
This PR also removes the `stable_resources` field and fixes a proof
broken due to the removal (unpredictability of SMT solver again!).

---------

Signed-off-by: Xudong Sun <[email protected]>
  • Loading branch information
marshtompsxd authored Oct 5, 2024
1 parent 7075b47 commit 829b6a6
Show file tree
Hide file tree
Showing 3 changed files with 111 additions and 42 deletions.
10 changes: 7 additions & 3 deletions src/v2/kubernetes_cluster/proof/controller_runtime_safety.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
// SPDX-License-Identifier: MIT
#![allow(unused_imports)]
use crate::kubernetes_api_objects::spec::prelude::*;
use crate::temporal_logic::{defs::*, rules::*};
use crate::kubernetes_cluster::spec::{
api_server::{state_machine::transition_by_etcd, types::*},
cluster::*,
controller::types::*,
external::{state_machine::*, types::*},
message::*,
};
use crate::temporal_logic::{defs::*, rules::*};
use vstd::prelude::*;

verus! {
Expand Down Expand Up @@ -91,11 +91,15 @@ pub proof fn lemma_always_triggering_cr_has_lower_uid_than_uid_counter(self, spe
// - If the response is processed by the controller, the controller will create a new pending request in flight which
// allows the invariant to still hold.

pub open spec fn state_comes_with_a_pending_request(self, controller_id: int, state: spec_fn(ReconcileLocalState) -> bool) -> bool {
&&& forall |s| #[trigger] state(s) ==> s != (self.controller_models[controller_id].reconcile_model.init)()
&&& forall |cr, resp_o, pre_state| #[trigger] state((self.controller_models[controller_id].reconcile_model.transition)(cr, resp_o, pre_state).0) ==> (self.controller_models[controller_id].reconcile_model.transition)(cr, resp_o, pre_state).1.is_Some()
}

pub proof fn lemma_always_pending_req_in_flight_or_resp_in_flight_at_reconcile_state(self, spec: TempPred<ClusterState>, controller_id: int, key: ObjectRef, state: spec_fn(ReconcileLocalState) -> bool)
requires
self.controller_models.contains_key(controller_id),
forall |s| (#[trigger] state(s)) ==> s != (self.controller_models[controller_id].reconcile_model.init)(),
forall |cr, resp_o, pre_state| #[trigger] state((self.controller_models[controller_id].reconcile_model.transition)(cr, resp_o, pre_state).0) ==> (self.controller_models[controller_id].reconcile_model.transition)(cr, resp_o, pre_state).1.is_Some(),
self.state_comes_with_a_pending_request(controller_id, state),
spec.entails(lift_state(self.init())),
spec.entails(always(lift_action(self.next()))),
spec.entails(always(lift_state(Self::pending_req_of_key_is_unique_with_unique_id(controller_id, key)))),
Expand Down
142 changes: 104 additions & 38 deletions src/v2/kubernetes_cluster/spec/api_server/state_machine.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,69 @@ use vstd::{multiset::*, prelude::*};

verus! {

// TODO:
// + Need more validation checks.
// For example, RoleBinding's roleRef is immutable: https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrolebinding-example
// This state machine describes the behavior of the API server and its back-end datastore (e.g., etcd).
// More concretely, it models how the datastore stores state objects, and how the API server handles
// controllers' requests to read and write state objects. We model the two components in one state
// machine because controllers always interact with state objects via API servers.
//
// + Check kind-specific strategy like AllowCreateOnUpdate()
// The datastore provides a key-value store interface to store all the state objects. For each
// state object, the key is a tuple of its name, namespace and kind, and the value is the entire
// object. Besides name, namespace and kind, each object has other metadata fields including:
// * a resource_version that tracks the revision of this object
// * a uid that uniquely distinguishes an object from its historical occurrences
// * so on...
//
// + Support graceful deletion
// The API server provides a REST API, including Get, List, Create, Update and Delete. The API
// implementation is more complex than a simple key-value store because a modification to the
// cluster state must pass a series of validation checks (e.g., whether the resource_version matches).
// In real world, each REST API operation above is supposed to be atomic and implemented using a etcd
// transaction, so we can model them using one state machine step.
//
// This state machine ues a Map to represent the state object datastore, matching its key-value interface.
// Besides the key-value store, this state machine also needs to maintain a local state that is used
// for managing the metadata of each object. E.g., it needs to allocate a universally unique uid to
// each object when it gets created, and updates the resource version of each object when it's updated.
//
// The model is not always submissive to the implementation. When writing the model, we sometimes choose
// the data representation that is most suitable for formal reasoning without affecting correctness.
// For example, the uid is modeled as an int, instead of a string. Generating unique uid is done by
// incrementing an int counter, so that each newly generated uid is different from any previous ones.
// This simple modeling easily guarantees the uniqueness of uid without tedious reasoning on strings.
// To make sure this inconsistency does not sacrifice soundness, Anvil does not allow controllers to see
// an object's uid. Instead, a controller can check whether two uids are the same, and copying a uid to
// construct an owner_reference or deletion precondition. In this way, modeling the uid as an int or a string
// makes no difference to controller implementations.
//
// Importantly, we do *not* model the API server's watch cache for simplicity. This means that:
// * Watch operation is not modeled, and controllers rely on the schedule_controller_reconcile action of the
// cluster state machine to get invoked (instead of getting triggered by a watch stream).
// * Controllers are forced to *not* read from the API server's watch cache and every read hits the consistent datastore.
// This avoids the time-travel bugs reported by the Sieve project (https://github.com/sieve-project/sieve).
// Note that controllers can still maintain an internal cache of the cluster state if they want.
//
// Certainly, this model is not perfect and have some inconsistency and incompleteness issues. We have a TODO
// list to gradually improve the model.
//
// The TODO list:
// + Support more expressive list operation
//
// + Model Patch (if needed)
//
// + Model uniqueness of generated name using spec ensures (when supported)
//
// + Model all the validation checks of built-in types
//
// + Model kind-specific strategy like AllowCreateOnUpdate
//
// + Model graceful deletion
//
// + Model foreground and orphan deletion options
//
// + Support cluster-wide state objects (the ones that don't belong to a namespace)
//
// + Keep the error code consistent with the real API Server
//
// + Document intended mismatch between the model and the real API server

#[verifier(inline)]
pub open spec fn unmarshallable_spec(obj: DynamicObjectView, installed_types: InstalledTypes) -> bool {
Expand Down Expand Up @@ -155,7 +211,6 @@ pub open spec fn handle_get_request(req: GetRequest, s: APIServerState) -> GetRe

#[verifier(inline)]
pub open spec fn handle_list_request(req: ListRequest, s: APIServerState) -> ListResponse {
// TODO: List should consider other fields
let selector = |o: DynamicObjectView| {
&&& o.object_ref().namespace == req.namespace
&&& o.object_ref().kind == req.kind
Expand Down Expand Up @@ -194,6 +249,10 @@ pub open spec fn created_object_validity_check(created_obj: DynamicObjectView, i
pub closed spec fn generate_name(s: APIServerState) -> StringView;

// TODO: This should be a spec ensures of the generate_name above
// NOTE: In the actual implementation, the API server might fail to generate a unique name
// after a number of attempts. For the sake of liveness, we simplify that behavior and assume
// that the API server is always able to find a unique name by random generation in our model.
// For more details, see the implementation: https://github.com/kubernetes/kubernetes/blob/v1.30.0/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L432-L443
#[verifier(external_body)]
pub proof fn generated_name_is_unique(s: APIServerState)
ensures forall |key| #[trigger] s.resources.contains_key(key) ==> key.name != generate_name(s)
Expand Down Expand Up @@ -244,8 +303,6 @@ pub open spec fn handle_create_request(installed_types: InstalledTypes, req: Cre
// Creation succeeds.
(APIServerState {
resources: s.resources.insert(created_obj.object_ref(), created_obj),
// The object just gets created so it is not stable yet: built-in controller might update it
stable_resources: s.stable_resources.remove(created_obj.object_ref()),
uid_counter: s.uid_counter + 1,
resource_version_counter: s.resource_version_counter + 1,
..s
Expand Down Expand Up @@ -284,6 +341,11 @@ pub open spec fn delete_request_admission_check(req: DeleteRequest, s: APIServer
// modeling and proof much easier compared to modelling the real clock.
pub closed spec fn deletion_timestamp() -> StringView;

// NOTE: Deletion has three modes including background (default), foreground, orphan.
// For now, we only model background deletion and only allow our controllers to perform background deletion
// (by hiding the mode option).
// This is acceptable for this project because background deletion is enough for all the reconciliation tasks
// performed by our controllers.
pub open spec fn handle_delete_request(req: DeleteRequest, s: APIServerState) -> (APIServerState, DeleteResponse) {
if delete_request_admission_check(req, s).is_Some() {
// Deletion fails.
Expand All @@ -295,7 +357,17 @@ pub open spec fn handle_delete_request(req: DeleteRequest, s: APIServerState) ->
// With the finalizer(s) in the object, we cannot immediately delete it from the key-value store.
// Instead, we set the deletion timestamp of this object.
// If the object already has a deletion timestamp, then skip.
//
// NOTE: Having finalizer(s) is not the only reason that a deletion is postponed. Having a graceful
// period set in the deletion option is another reason. Currently we do not model the graceful period
// option so we don't model how the API server checks whether a deletion should be graceful.
// However, even without a graceful period option, deleting a Pod can still be graceful depending on its
// spec content (see https://github.com/kubernetes/kubernetes/blob/v1.30.0/pkg/apis/core/types.go#L3256)
// because Pod implements CheckGracefulDelete (see https://github.com/kubernetes/kubernetes/blob/v1.30.0/pkg/registry/core/pod/strategy.go#L168).
// This is irrelevant to application controllers that do not manage pods, and acceptable for verifying
// low-level built-in controllers because they are supposed to treat terminating pods as non-existing pods.
if obj.metadata.deletion_timestamp.is_Some() {
// A deletion timestamp is already set so no need to bother it.
(s, DeleteResponse{res: Ok(())})
} else {
let stamped_obj_with_new_rv = obj.set_deletion_timestamp(deletion_timestamp())
Expand All @@ -313,6 +385,13 @@ pub open spec fn handle_delete_request(req: DeleteRequest, s: APIServerState) ->
}
} else {
// The object can be immediately removed from the key-value store.
//
// NOTE: In some very corner case, the API server *seems* to first updates the object (to update its finalizers)
// and then deletes the object immediately, which makes the entire Delete operation not atomic.
// However, this only happens in the orphan or foreground deletion mode, so we do not model this
// seemingly non-atomic behavior for now.
// For more details, see how the API server updates the object in the middle of handling deletion requests:
// https://github.com/kubernetes/kubernetes/blob/v1.30.0/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L1009
(APIServerState {
resources: s.resources.remove(req.key),
resource_version_counter: s.resource_version_counter + 1,
Expand Down Expand Up @@ -440,25 +519,33 @@ pub open spec fn handle_update_request(installed_types: InstalledTypes, req: Upd
// or has at least one finalizer.
(APIServerState {
resources: s.resources.insert(req.key(), updated_obj_with_new_rv),
// The object just gets updated so it is not stable yet: built-in controller might update it
stable_resources: s.stable_resources.remove(req.key()),
resource_version_counter: s.resource_version_counter + 1, // Advance the rv counter
..s
}, UpdateResponse{res: Ok(updated_obj_with_new_rv)})
} else {
// The delete-during-update case, where the update removes the finalizer from
// The delete-during-update case, where the update removes the finalizers from
// the object that has a deletion timestamp, so the object needs to be deleted now.
//
// NOTE: in the actual implementation, when handling an update request,
// the API server first applies the update to the object in the key-value store,
// then checks whether object can be deleted.
// then checks whether the object can be deleted.
// If so, it continues to delete the object from the key-value store before replying
// to the update request.
// This means that there is a super short window where the object has been updated in the store
// (with a deletion timestamp and without any finalizer), but has not been deleted yet.
// We choose not to model this short window and instead merge the update and delete into one atomic action
// because the controller that issues the update request in an non-async manner will not observe
// this intermediate state within the short window.
// When the update request returns, the object has been deleted anyway.
// This behavior, strictly speaking, makes the entire Update operation not atomic.
// We choose to still model the Update operation as an atomic step for simplicity.
// This design choice does not affect the correctness of the controller that issues Update
// in a blocking manner because the controller will never observe this intermediate state between
// the update and deletion.
// More generally speaking, this modeling won't affect the correctness of any controller that
// treats a terminating object without finalizers as a non-existing object --- a good practice
// controllers should follow.
//
// NOTE: the API server should also check whether the deletion grace period in the metadata
// is set to 0 before deleting the object in case of graceful deletion
// (see https://github.com/kubernetes/kubernetes/blob/v1.30.0/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L533).
// Here we omit that condition because graceful deletion is not supported.
(APIServerState {
resources: s.resources.remove(updated_obj_with_new_rv.object_ref()),
resource_version_counter: s.resource_version_counter + 1, // Advance the rv counter
Expand Down Expand Up @@ -568,7 +655,6 @@ pub open spec fn handle_update_status_request_msg(installed_types: InstalledType
(s_prime, form_update_status_resp_msg(msg, resp))
}

// etcd is modeled as a centralized map that handles get/list/create/delete/update
pub open spec fn transition_by_etcd(installed_types: InstalledTypes, msg: Message, s: APIServerState) -> (APIServerState, Message)
recommends
msg.content.is_APIRequest(),
Expand All @@ -590,25 +676,6 @@ pub open spec fn handle_request(installed_types: InstalledTypes) -> APIServerAct
&&& input.recv.get_Some_0().content.is_APIRequest()
},
transition: |input: APIServerActionInput, s: APIServerState| {
// This transition describes how Kubernetes API server handles requests,
// which consists of multiple steps in reality:
//
// (1) apiserver receives the request from some controller/client,
// perform some validation (e.g., through webhook)
// and forwards the request to the underlying datastore (e.g., etcd);
// (2) The datastore reads/writes the cluster state and replies to apiserver;
// (3) The datastore also sends a notification of state update caused by the request to apiserver;
// (4) apiserver replies to the controller/client that sent the request;
// (5) apiserver forwards the notification to any controllers/clients that subscribes for the updates.
//
// Note that built-in controllers often subscribes for updates to several kinds of resources.
// For example, the statefulset controller subscribes updates to all statefulset objects.
// When seeing a new statefulset is created,
// it will send requests to create pods and volumes owned by this statefulset.
//
// Here we simplify step (1) ~ (5) by omitting the process that state changes are streamed
// to built-in controllers and activate their reconciliation.
// Built-in controllers will be specified as actions of the top level cluster state machine.
let (s_prime, etcd_resp) = transition_by_etcd(installed_types, input.recv.get_Some_0(), s);
(s_prime, APIServerActionOutput {
send: Multiset::singleton(etcd_resp)
Expand All @@ -620,8 +687,7 @@ pub open spec fn handle_request(installed_types: InstalledTypes) -> APIServerAct
pub open spec fn api_server(installed_types: InstalledTypes) -> APIServerStateMachine {
StateMachine {
init: |s: APIServerState| {
&&& s.resources == Map::<ObjectRef, DynamicObjectView>::empty()
&&& s.stable_resources == Set::<ObjectRef>::empty()
s.resources == Map::<ObjectRef, DynamicObjectView>::empty()
},
actions: set![handle_request(installed_types)],
step_to_action: |step: APIServerStep| {
Expand Down
1 change: 0 additions & 1 deletion src/v2/kubernetes_cluster/spec/api_server/types.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ pub struct APIServerState {
pub resources: StoredState,
pub uid_counter: Uid,
pub resource_version_counter: ResourceVersion,
pub stable_resources: Set<ObjectRef>,
}

pub type InstalledTypes = Map<StringView, InstalledType>;
Expand Down

0 comments on commit 829b6a6

Please sign in to comment.