Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: AP 17 Fleet shard status #82

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
150 changes: 150 additions & 0 deletions _ap/17/index.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
---
num: 17
category: "Fleet manager, fleet shard and instance operators"
title: "How to build fleet shard status"
status: "Accepted"
authors:
- "Steven Hawkins"
- "Luca Burgazzoli"
tags:
- "kafka"
---

## Intent

Define a pattern for fleet shard status representation and interpretation. The pattern addresses common problems such as interpreting errors as either temporary or terminal failures.

## Motivation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small drawing of fleet-manager fleet-shard operator operand could help set the context.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After starting on a drawing it made me wonder if that was making this seem too complicated. So I just updated with some language about custom resources in general - I really am trying to convey this is the kubernetes custom resource paradigm with an intermediary, so it has some additional considerations on top of the base recommendations.


A control plane (or fleet manager) communicates the desired state of resource(s) to a fleet shard. In return the fleet shard must convey the status of the data plane resources it is responsible for back to the control plane. This is typically done using mechanisms built around a Kubernetes https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/[custom resources] and its status sub-resource.

There should be consistency around the design and usage of that status information that conforms to Kubernetes guidelines and so that control planes, SREs, and developers can quickly and appropriately reason over the status.

## Status Guidelines

### Conditions

Kubernetes provides a significant amount of guidance for https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties[status conditions].

Highlights:
Condition types should be single word or short camel-cased names. They should be unique - it’s more of a map of conditions by type, rather than just an array.
The reason field should be a short word or short camel-cased phase.
The message should be human readable.
The status field should be a three-valued boolean.

More specific to fleet shards:
Since the fleet shard will typically send a full status object back to the control plane periodically or with every change, it is best to include only a minimally necessary set of conditions. However that does not mean you can simply omit conditions. If a condition is not present, it should be interpreted as having status Unknown.
It is fine to have a control plane reason over status conditions - in particular the type, reason, and status fields. However the parsing of a condition message should be avoided, and rather additional status fields should be used to provide specialized information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show examples in the form of yaml with a context of the status if necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there specific examples you had in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things like showing non omited considtions even though not use for that specific effective status report.
And that might not be here but you say that the shard in most case cannot know whether a case is transient or terminal. which makes me wonder how the control plan makes that decision from thge status it receives fromt he shard. HEre some examples of how you achieved it , or how a status is interpreted for feedback tot he end user would be interesting for context on how the global machinery is orchestrated

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things like showing non omited considtions even though not use for that specific effective status report.

The guidance about minimizing is fleetshard specific. The interpretation of missing as unknown comes straight from the kubernetes recommendations. From existing non-fleetshard operators, we have counter examples like:

strimzi uses both a Ready and NotReady conditions. The NotReady condition is only populated when there's a problem. By their contract you have to know that a missing NotReady means to look instead for the Ready condition, not that NotReady is currently unknown.

But I'm honestly not sure how instructive that is.

which makes me wonder how the control plan makes that decision from thge status it receives fromt he shard

That's the gist of the recommendations around the error handling - the control plane nor a user can assume an immediate action is needed given seeing something is currently in error - regardless of whether we're installing.

The fleetshard operator interprets the strimzi status as follows:
https://github.com/bf2fc6cc711aee1a0c2a/kas-fleetshard/blob/main/operator/src/main/java/org/bf2/operator/operands/AbstractKafkaCluster.java#L169 That is then aggregated with the other operands. The fleet manager is further able to aggregate reasoning over the cluster itself and from things like route53.

We don't expect the fleet manager to make specific decisions based upon errors that appear in our status - we are instead calling out very specific service level cases - the wrong profile or version of something is in the cr, the data plane lacks capacity for the given instance - that the control plane may use to do things like select a different strimzi version or use a different cluster. Otherwise about the best that they can infer is whether we think we're still Installing.


### Other Status Fields

Beyond conditions, other status fields provide the ability to convey information via a custom schema.

For example the ManagedKafka status once Ready will include route and other information, which are not part of a condition:

[source,yaml]
----
status:
conditions:
- lastTransitionTime: '2022-08-02T08:24:04.818994Z'
message: ''
status: 'True'
type: Ready
routes:
- name: admin-server
prefix: admin-server
router: ingresscontroller.kas.mk-0419-204008.t6kh.p1.openshiftapps.com
- name: bootstrap
prefix: ''
router: ingresscontroller.kas.mk-0419-204008.t6kh.p1.openshiftapps.com
----

### Interpretation

The meaning of any condition or status field is up to the controller / resource and should be a documented part of the API contract. Given that the fleet manager is the primary consumer of the status, care should be taken to ensure the control plane's understanding of the status - and any changes must be made with its usage in mind.

For example the ManagedKafka status Ready condition refers to the operator’s view of the resource. A ManagedKafka is only ready when all of the dependent resources are observed to be in their desired / ready state. This will not always be the same notion of readiness from an end user’s observation of managed kafka service. In particular, the restart of a broker pods, a temporary networking issue, etc. may not be reflected in ManagedKafka status. It follows that a ManagedKafka status Ready condition alone is insufficient to show the user the state of their service. ManagedKafka canary and other kafka metrics provide a more exact, and up-to-date, representation of cluster functioning.

A concrete example with ManagedKafka is that when the IngressController HAProxy pods restart that is not reflected in the ManagedKafka Ready status, but may be interruptive of the user's experience of the service as all existing connections will need to reconnect.

### Error States

An operator in a fleetshard _usually_ lacks sufficient context to determine whether a valid resource is in a terminal state. The operator’s primary job is to maintain the desired state of its resource. If at a given time additional resources are needed, other necessary system parts aren’t installed, etc., there is little or nothing under the operator's direct control that can be done to address the situation. Conditions may be corrected at a later time as a result of some action outside the context of the operator - such as an auto-scaling node coming up.

Viewing the operator from a service perceptive means that some situations may be seen as terminal problems. For example with ManagedKafka there are pre-configured instance profiles, such as Developer and Standard, known to both the fleet shard and manager. If the ManagedKafka resource from the control plane specifies a unknown profile, then from a service perspective this may be considered a terminal error and may be given a different reason than other errors. From a more general operator / controller perspective, this situation is not terminal in the sense that the user may not have yet updated the ManagedKafkaAgent resource containing the relevant profile information - once updated the ManagedKafka reconciliation would continue as expected.

When status is conveying an error that is not known to be terminal, it follows that the error may resolve on its own with additional time. Control planes should assume such errors are ephemeral and not immediately react as if a hard failure has occurred. If there are further actions that the control plane may take, the condition reason or other status fields should make that easy to determine. This allows the data plane logic to remain straightforward and not have to fully understand every possible error condition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a state diagram with some context could help setup the context of the conversation here.

  • transient vs terminal
  • data plane reconsiliation vs fleet manager expected state display
    I'm not sure it will work but could help as I feel I have to think at the problem to figure out what is recommended and for what use case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There really isn't a state diagram, only very special exceptions to what qualifies as a terminal error.

Maybe restating more succinctly might help:

  • many, if not most, errors will be ephemeral
  • it's impractical, or impossible, for the data plane to know a priori how to categorize whether an error it's seeing is terminal or not.
  • requiring a data plane fix every time a new ephemeral error is encountered is cumbersome, instead only errors that lead to SLO breach need high priority remediation.


For example with managed kafka and auto-scaling enabled it may take longer for a node to become available than Strimzi’s timeout allows. That error is propagated onto the ManagedKafka status. However there is no further action expected by the control plane - and with additional time the auto-scaled node will be available and the placement will succeed. For example from https://issues.redhat.com/browse/MGDSTRM-9288 the control plane observed:

[source,yaml]
----
status:
conditions:
- lastTransitionTime: '2022-08-02T08:24:04.818994Z'
message: 'Exceeded timeout of 420000ms while waiting for Pods resource ninth-zookeeper-0 in namespace kafka-cbfv5rnfnecdu9rb4gc0 to be ready'
status: 'False'
reason: 'Error'
type: Ready
----

Rather than interpreting this as a hard failure, it should be assumed that with additional time and reconciliations that it may resolve on its own.

How much time to give an error to resolve is up to the control plane and relevant SLOs. During installation this may include waiting until the SLO has been exceeded. For other operations it may mean waiting at least 1 additional resolution time window if the operator or dependent operators uses a time based mechanism.

Enforcement and actions related to SLO timeouts are best for the control plane to consider - for example the control plane may need to try a different cluster for the placement of the resource, which is an action beyond the scope of a single data plane.

### Current vs. Desired State

Especially in instances where spec changes are rolled into dependent resources the desired state will not be fully realized until some point in the future. It is appropriate for the status to provide additional information about this transition. As per the Kubernetes guidelines that could be represented via a condition, or similar to the standard Deployment resource additional status fields such as ready or available replicas, allow for the control plane or a user to infer the progress of the transition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give a concrete example here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elaborated more based upon a kafka version upgrade.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show concrete YAML example to anchor the reader

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the yaml that matches the description.


For example the upgrade of ManagedKafka kafka version requires a rolling update of brokers where each broker is taken off-line and replace with one that has an updated version. This is a relatively long-running process. The status ManagedKafka Ready condition during this time will have a reason of KafkaUpdating, rather than simply indicating Ready based upon the Strimzi Kafka resource being updated to the desired state.

[source,yaml]
----
status:
conditions:
- lastTransitionTime: '2022-08-02T08:24:04.818994Z'
message: 'Updating Kafka version'
status: 'True'
reason: 'KafkaUpdating'
type: Ready
----

So while the desired state has already been applied and we're still Ready, the Reason allows consumers of the status to discern that an upgrade is still in progress.

### generation and observedGeneration

In kubernetes, each object should have a `metadata.generation` field which is defined as a sequence number - set by the system and monotonically increased per resource on a spec change. Some resources include a field named `status.observedGeneration`, which is the generation most recently observed by the component responsible for acting upon changes to the desired state of the resource. This can be used, for instance, to ensure that the reported status reflects the most recent desired state. The observedGeneration field is also part of metav1.Conditions and represents the spec generation that the condition was set based upon.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing a link into how I would use that generation field, compare it to what? Maybe an example?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example.


Similar concepts with fields that mirror their kubernetes couterparts can be introduced to MAS resources to offer a standard way for clients and control planes to figure out when a specific desired state has been taken into account. Consider the following example of a resource local to the data plane:

[source,yaml]
----
metadata:
annotations:
fleetGeneration: 4
generation: 2
spec:
...
status:
conditions:
- lastTransitionTime: '2022-08-02T08:24:04.818994Z'
message: 'Exceeded timeout of 420000ms while waiting for Pods resource ninth-zookeeper-0 in namespace kafka-cbfv5rnfnecdu9rb4gc0 to be ready'
status: 'False'
reason: 'Error'
type: Ready
fleetObservedGeneration: 3
----

Here the data plane has picked up the 4th generation of the resource from the control plane. The status condition indicates that the fleetObservedGeneration is still at 3. Thus observers, including the control plane, would be able to determine the status condition does not yet reflect the latest desired state. Note that the fleet generation mechansim, shown here as 2, is a separate sequence from the local generation - as it's entirely possible for the fleet shard to miss control plane generations. If desired the local observedGeneration could be in the status as well.

### Frequency Of Status Changes

Rapid changes to the status not due to modifications of the custom resource should be avoided. Each status change is an update that the local Kuberentes instance must process and is generally also relayed to the control plane, which can lead to a substantial amount of overhead for a large number of resources. For controllers that implement fixed interval resolving this is generally not an issue. Event driven controllers though should be designed to minimize unnecessary updates - in particular updates which do not imply a status change should leave the existing status unmodified.

## Participants
* Control Plane -- development team for the KAS Fleet Manager API.
* Kafka Services -- team developing the kafka fleet shard operator.
* MAS Connectors -- team developing the connector fleet shard operator.