Reason about common behavior among controllers in the reconciliation logic #168

euclidgame · 2023-07-05T23:15:57Z

Context and problem

To reduce developers' proof effort, Anvil needs to provide highly applicable lemmas that encodes representative liveness/safety reasoning for different controllers and can be directly applied to any specific newly-developed controller.

We had several such lemmas in controller_runtime_liveness and controller_runtime_safety. However, they are limited to reasoning about state transitions and invariants for the initial, error, or done step because these properties do not talk about the core reconcile process (i.e., the continue_reconcile action of controller, determined by reconcile_core).

Despite such generic lemmas, we still need to write controller-specific lemmas when reasoning about how reconcile_core transitions the state (i.e., the precondition of each transition and the resulting new state). Unfortunately, we end up writing highly repeated but slightly different lemmas that reason about how reconcile_core starting from a concrete state can lead to another concrete state. This practice is tedious and error-prone and causes unnecessary code bloat for our proof.

Solution

With all the lessons learned along the way of proving liveness of the zookeeper controller, we notice that there are indeed many common patterns among the controller-specific lemmas. In other words, when proving spec |= a ~> b, we may not require the details of what a and b are or what operations give us the ~> relation. Instead, it only requires certain relations between a and b.

If we can formalize such relation correctly in the form of preconditions, there is a chance to significantly simplify the proof by replacing all such controller-specific lemmas with a more general one.

Take the zookeeper controller for example. The following code block demonstrates how the controller advances from different steps:

match step {
......
   ZookeeperReconcileStep::AfterCreateHeadlessService => {
       let client_service = make_client_service(zk);
       let req_o = Option::Some(APIRequest::CreateRequest(CreateRequest{
           namespace: zk.metadata.namespace.get_Some_0(),
           obj: client_service.to_dynamic_object(),
       }));
       let state_prime = ZookeeperReconcileState {
           reconcile_step: ZookeeperReconcileStep::AfterCreateClientService,
           ..state
       };
       (state_prime, req_o)
   },
   ZookeeperReconcileStep::AfterCreateClientService => {
       let admin_server_service = make_admin_server_service(zk);
       let req_o = Option::Some(APIRequest::CreateRequest(CreateRequest{
           namespace: zk.metadata.namespace.get_Some_0(),
           obj: admin_server_service.to_dynamic_object(),
       }));
       let state_prime = ZookeeperReconcileState {
           reconcile_step: ZookeeperReconcileStep::AfterCreateAdminServerService,
           ..state
       };
       (state_prime, req_o)
   },
   ZookeeperReconcileStep::AfterCreateAdminServerService => {
       let config_map = make_config_map(zk);
       let req_o = Option::Some(APIRequest::CreateRequest(CreateRequest{
           namespace: zk.metadata.namespace.get_Some_0(),
           obj: config_map.to_dynamic_object(),
       }));
       let state_prime = ZookeeperReconcileState {
           reconcile_step: ZookeeperReconcileStep::AfterCreateConfigMap,
           ..state
       };
       (state_prime, req_o)
   },
......
}

An important observation is that the three cases are quite similar: we can see that there is only one possible next step for each of the three steps. Therefore, the proof strategies of spec |= at_step(current_step) ~> at_step(next_step) for them resemble each other at a high level. Instead of writing three lemmas that talk about the three transitions (which is what we did), we can come up with one lemma that talks about how the zookeeper controller advances from the current step to the next step, if that step is the only possible step that is directly reachable from the current step (this is what #166 does!). Note that this lemma can be further generalized to apply to other controllers.

Future work

Later we can move on to fold more complicated reasoning patterns into highly usable lemmas, including what we have before the reconciliation and how objects are changed after that. Actually, I think the operations can be generally categorized into several types. Besides, taking the common behavior in the reconcile_core into consideration, we can also prove many general safety invariants that can apply to different states of many different controllers.

The progress so far shows that this approach is promising in reducing proof efforts for newly-developed controllers.

The text was updated successfully, but these errors were encountered:

euclidgame · 2023-07-05T23:18:34Z

I want to record the "common behaviors" in this issue, so let me first formally explain the case in the top comment.

euclidgame · 2023-07-06T00:05:01Z

If for a reconcile state state_a that is not an error or done state, controller being at state_a always implies its pending request is in flight or a corresponding response is in flight, the controller will eventually enter the next state state_b.

We require state_a not to be done or error to let the verifier know that the only possible action to change the current state is reconcile_core (not end_reconcile, and run_scheduled_reconcile is impossible because the reconcile already starts).

Then we have such a lemma:
https://github.com/vmware-research/verifiable-controllers/blob/6438c35cc331dc5e2a4031a0c1a69266fc03b131/src/kubernetes_cluster/proof/controller_runtime_liveness.rs#L521
If spec |= state_b ~> reconcile_idle is already in place, we can then derive spec |= state_a ~> reconcile_idle. So the termination proof of zookeeper can be seen as a series applications of this lemma in reverse order of all the steps.

euclidgame · 2023-07-06T00:37:13Z

This is similar for some safety invariants.

We want to have an invariant that controller being at some state implies there is a pending request which is either in flight or has an in-flight matched response. Previously, we proved this property for a number of states, see:
https://github.com/vmware-research/verifiable-controllers/blob/7621c62f9c00f5403b40bef2de1ad4630b3d6aa6/src/controller_examples/zookeeper_controller/proof/safety.rs#L869

To avoid repetition, think about if we want a state state_x to have such an invariant, what constraints should be imposed on state_x? The answer is, if the controller enters state_x in reconcile_core, it must create a request which is part of the return value. Then we have this lemma:
https://github.com/vmware-research/verifiable-controllers/blob/6438c35cc331dc5e2a4031a0c1a69266fc03b131/src/kubernetes_cluster/proof/controller_runtime_safety.rs#L526

It is now a property of controller runtime that can apply to different controllers.

marshtompsxd assigned tianyin, marshtompsxd and euclidgame Jul 5, 2023

marshtompsxd closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reason about common behavior among controllers in the reconciliation logic #168

Reason about common behavior among controllers in the reconciliation logic #168

euclidgame commented Jul 5, 2023 •

edited by marshtompsxd

Loading

euclidgame commented Jul 5, 2023

euclidgame commented Jul 6, 2023

euclidgame commented Jul 6, 2023

Reason about common behavior among controllers in the reconciliation logic #168

Reason about common behavior among controllers in the reconciliation logic #168

Comments

euclidgame commented Jul 5, 2023 • edited by marshtompsxd Loading

Context and problem

Solution

Future work

euclidgame commented Jul 5, 2023

euclidgame commented Jul 6, 2023

euclidgame commented Jul 6, 2023

euclidgame commented Jul 5, 2023 •

edited by marshtompsxd

Loading