Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aligned approach for Kyma controllers to revert additions to Kyma-managed resources #953

Open
c-pius opened this issue Nov 29, 2024 · 0 comments
Assignees
Labels
decision Related to all issues that need a decision

Comments

@c-pius
Copy link

c-pius commented Nov 29, 2024

Created on 2024-11-29 by Christoph Schwägerl (@c-pius).

Decision log

Name Description
Title Aligned approach for Kyma controllers to revert additions to Kyma-managed resources
Due date ?
Status Re-proposed on 2024-12-XX (v0.0.2)
Decision type Choice
Affected decisions Extends #941

Changelog

  • v0.0.1
    • initial proposal
  • v0.0.2
    • clarified that SRE changes are also expected to be reverted unless skip-reconcilation=true; this leads to a new challenge
    • tested that also changes applied via client-side apply are reverted
    • added Option 4 to prevent unexpected changes via a ValidatingWebhook
    • added Decision and Consequences for Option 2 - Reverting added fields with SSA

Context

Controllers within Kyma often use Server-Side Apply (SSA) for reconciling the resources they manage.

Reasons: shift load from the controller to the API server; see also benefits described by K8s

Recently, issues with using SSA have been observed when it comes to reverting resources that are managed by controller A, but where another actor B adds fields to the resource that controller A isn't aware of and that should be reverted by controller A. For example:

  • lifecycle-manager reconciles a deployment
    kind: Deployment
    spec:
      replicas: 1
      template:
        spec:
        # ...
  • another actor configures a volume for this deployment; lifecycle-manager isn't aware of this added volume (it is not part of the deployment specified in the module's raw manfiest)
    kind: Deployment
    spec:
      replicas: 1
      template:
        spec:
        # ...
        volumes:
        - configMap:
            defaultMode: 420
            name: my-cfg
          name: config
  • continued reconciles by lifecycle-manager do NOT remove this volume

Real examples where this has been problematic:

The goal of this ADR is to align a uniform approach within Kyma that enables our controllers to revert undesired additions to Kyma-managed objects.

Expectations

The following cases summarize the expected scenarios where Kyma controllers are either expected to revert changes or to keep changes.

Expectation 1 - Revert resources to their Kyma-managed state

In general, it is expected that Kyma controllers revert additions to the resources they manage. This is also expressed by the disclaimer "DO NOT EDIT - This resource is managed by Kyma. Any modifications are discarded and the resource is reverted to the original state." present on resources managed by lifecycle-manager (currently not used uniformly). Any additions to those resources by the customer must be reverted. E.g.,

  • resources from the module's raw manifest
  • internal resources like SKR webhook-related ones

However, there are exceptions where Kyma provides the initial state of a resource, and it is expected that the customer extends this. In those cases, the changes must not be reverted. E.g.,

  • KymaCR where the customer modifies the modules
  • default CRs where the customer may change configuration

Expectation 2 - Don't revert additions by K8s

For some resources, K8s itself adds information to the resources. This is mostly limited to .metadata and .state. Those changes must not be reverted. E.g.,

  • kube-controller-manager extends .metadata.annotations and .status of a Deployment

Expectation 3 - Revert additions by SRE unless skip-reconcilation=true

SRE or other Kyma-internal actors may need to modify resources while troubleshooting. Those changes are also reverted unless the operator.kyma-project.io/skip-reconciliation=true label is present on the resource.

Expectation 4 - Don't revert modifications from other Kyma controllers

Some resources may be managed by multiple Kyma-internal controllers. It must be ensured that those don't end up in an endless "add-revert loop". E.g.,

  • Namespace kyma-system is controlled by lifecycle-manager, but environment-broker adds labels
  • CustomResourceDefinitions (CRDs) are controlled by lifecycle-manager, but module managers may add a conversion webhook

Expectation 5 - Revert resources when managing a previously unmanaged module

When re-managing a previously unmanaged module (see feature request), the Kyma controller takes care of a resource initially provided by the customer. In this case, all extensions by the customer not part of the original config shall be reverted.

Key Assumptions

Key Assumption 1 - Managers don't give up ownership properly

When working with SSA, the clean approach would be to give up ownership. For fields that should remain to exist, this would be done by transferring ownership to another manager. For fields that were added and are not necessary anymore, this should be done by sending another SSA without the field included. Experience from the tickets linked in Context shows that this is not always the case. Also, this is not the case when customers edit the resources manually.

Option 1 - Recreating the object

The situation could be solved by re-creating the object. The problem is however that this may cause temporary disruptions until the object is re-created and all expected managers have added their patches. Therefore, this is considered as a non-option.

Option 2 - Reverting added fields with SSA

💡 It is recommended to first get a rough understanding of how SSA and especially the mechanism of "Field management" works before reading further. See Server-Side Apply in K8s docs.

In summary, SSA allows the caller to modify a resource by providing the subset of fields the caller cares for. The K8s runtime takes care of either creating a new resource if it doesn't exist yet, or patching an existing resource with the provided fields while maintaining the remaining state. To keep track of who is controlling what part of the single resource, K8s stores the caller's identity as "field manager" together with the fields this manager controls in .metadata.managedFields. If multiple managers try to modify the same field, this results in a conflict and the update is rejected. For controllers, it is recommended that they always force conflicts on objects they own meaning that the controller overwrites the other managers value.

Due to the way how SSA is designed to work, a controller A using SSA to reconcile a resource is not able to revert an addition by actor B that controller A is not aware of without further consideration. In the following, a possible way of enabling our controllers to revert such additions is outlined.

Approach - Detect unexpected managers, remove them, and take full ownership of the resource

This option consists of three steps that are executed consecutively, and a fourth step which is the regular reconcile of the resource:

  1. the controller detects that there are unexpected field managers present on the resource
  2. the controller removes the unexpected field managers from the resource
  3. the controller re-applies the full resource including the unknown fields to claim ownership of those fields
  4. the controller re-applies its desired state of the object without the unknown fields (regular reconcile); this removes those fields again as the controller previously claimed their ownership and the consecutive call without those fields is interpreted as desire to remove those fields

In pseudo code, this may look as follows:

func ssaObject(obj info.Object) error {
  // 0. regularly reconcile the obj, the obj argument is the desired state as retrieved from the module's raw manifest
  // after this call, obj is filled with the full state as currently in the cluster
  clnt.Patch(ctx, obj, client.Apply, client.ForceOwnership, OWNER_NAME)

  managedFields := obj.GetManagedFields()
  needToTakeOwnership := false
  for i := 0; i < len(managedFields); i++ {
    // 1. detect unexpected field manager
    if isUnexpectedFieldManager(managedFields[i].Manager) {
      // 2. remove unexpected field manager from managedFields
			patch := []byte(fmt.Sprintf(`[{"op": "remove", "path": "/metadata/managedFields/%d"}]`, i))
			clnt.Patch(ctx, obj, client.RawPatch(types.JSONPatchType, patch))

      needToTakeOwnership = true
    }
  }

  // 3. if there was an unexpected field manager, take ownership by re-applying the full obj
  // as we don't want to change fields, just claim ownership, we also can omit the client.ForceOwnership
  if needToTakeOwnership {
		clnt.Patch(ctx, obj, client.Apply, OWNER_NAME)
  }

  // 4. exit, the next regular recocnile will remove the unexpected fields
  return nil
}

In general, this approach works. It has been tested successfully with both, actor B using SSA and client-side apply. It does NOT work when controller A is using client-side-apply! However, it comes with the following challenges:

Challenge 1 - Detecting what field managers are unexpected

The controller needs to have information what field managers are expected per resource and which ones are not. Maintaining an allow-list per object appears too cumbersome.

As an alternative, it would be possible to create a convention that all controllers in Kyma MUST use a pattern for their field manager name like operator.kyma-project.io/<manager-name>. By this, all field managers not complying to this pattern could be filtered out as unexpected ones. Additionally, this would require SRE and any other Kyma-internal actor making changes to the object via kubectl to specify a field manager with the aligned prefix (kubectl flag --field-manager).

Also, the K8s internal managers like kube-controller-manager MUST be allow-listed.

Challenge 2 - Temporarily taking ownership from other expected managers

When re-sending the full object (step 3.), the controller also claims ownership of the fields owned by another expected controller. This should not be a problem since the controller gives up ownership again with the regular reconcile that doesn't include the other field managers fields (step 4.).

Challenge 3 - Automatic or on-demand action

It needs to be decided whether this reverting approach is executed automatically, or if it is executed on demand. On-demand would mean to actively trigger this approach to be executed. This could be done by adding an annotation to the resource. With this approach, it could also be decided whether the revert is executed on all unexpected field managers as described above, or if only a field manager named in the annotation value is removed.

Challenge 4 - Giving up ownership

If decided to follow an approach where we detect known field managers based on a name prefix, Kyma controller A does not revert additions from Kyma controller B. Consequently, it would not automatically solve the problem described in #941.

The proper approach for this case would be that the module-manager (controller B) still gives up ownership properly by sending an SSA with an empty object.

Option 3 - Not reverting added fields

Additions to the resource are not reverted. Instead, additions made by the user are ignored. If they lead to errors in the cluster, they are considered user errors.

When it comes to multiple Kyma controllers handling the same resource, the controllers must ensure that they properly give up ownership of patches they added when not needed anymore. I.e, they must send an SSA with an empty object. This will remove their additions to the object and the field manager entry.

Option 4 - Prevent updates with a validation webhook

Instead of reverting changes done by unkown actors, they could also be prevented in the first place.
The idea is to configure validating webhoooks that intercept all changes on kyma-managed resources.
If an unkown actor is trying to change the resource, the validating webhook should fail.

Challenge 1 - Detecting what actors are unexpected.

This could be solved in the same way as proposed in Option 2. AdmissionRequests contain an options field listing the FieldManager, e.g., see CreateOptions.

Challenge 2 - Detecting what resources to validate

It must be determined what resources to validate.
This may be achieved by namespaceSelector and objectSelector (see ValidatingWebhookConfiguration) filtering all objects labelled with operator.kyma-project.io/managed-by: kyma.
In the webhook, we would need to be able to configure exceptions for KymaCR and DefaultCRs.

Challenge 3 - Rotating the CA bundle

Webhooks must secure communication via certificates that need to be rotated.
This may be achieved by re-using the existing certificates for the SKR webhook.

Challenge 4 - Handling the load

The validation webhook may need to handle a lot of load and should not add much latency to the change requests.
A central webhook in KCP managing all clusters may be facing too much load.
Instead, there should be a webhook per SKR.

Challenge 5 - Single Point of Failure

If this webhook breaks down for whatever reason, this would prevent all changes to Kyma-managed resources in the SKR.
A mitigation is to use failurePolicy: ignore. This will make the admission request succeed on unknown errors from the webhook.
The downside is that this allows unexpected changes while the webhook is down or unavailable, e.g. when the cluster starts.

Decisions

It is decided to pursue Option 2 - Reverting added fields with SSA. Kyma-controllers shall revert changes from unexpected managers via the approach outlined.

In a first step, lifecycle-manager will implement the detection of unexpected managers. It will not revert those changes but write observability data to get an understanding of the changes that would be reverted. Once verified that this matches the expectations, the approach will be rolled out to revert these changes.

Consequences

  • Jellyfish to:
    • implement the detection of unexpected managers (first with an allow-list, later with the naming pattern)
    • write observability data of what changes would be reverted
    • once verified:
      • rename their field-managers to follow the pattern operator.kyma-project.io/<manager-name>
      • implement the reverting logic
      • check whether the pattern could be implemented as a re-use library
  • Module teams to:
    • rename their field-managers to follow the pattern operator.kyma-project.io/<manager-name>
    • implement the revrerting logic where applicable
    • give up ownership properly where needed (see Option 2, Challenge 4)

Addendum

Notes

  • client-side applying an existing object with a new field creates a field manager entry for this new field but DOES NOT take ownership of the existing field
    • in contrast, SSA DOES take ownership of the existing fields
@ruanxin ruanxin added the decision Related to all issues that need a decision label Nov 29, 2024
@c-pius c-pius self-assigned this Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decision Related to all issues that need a decision
Projects
None yet
Development

No branches or pull requests

2 participants