Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alarm report for Kafka operator #340

Open
zyue110026 opened this issue Mar 5, 2024 · 0 comments
Open

Alarm report for Kafka operator #340

zyue110026 opened this issue Mar 5, 2024 · 0 comments

Comments

@zyue110026
Copy link

zyue110026 commented Mar 5, 2024

What happened?

I am using acto to test Kafka Operator, below is my config.json:

  "deploy": {
    "steps": [
      {
        "apply": {
          "file": "data/strimzi-kafka-operator/bundle.yaml",
          "operator": true
        }
      }
    ]
  },
  "crd_name": "kafkas.kafka.strimzi.io",
  "seed_custom_resource": "data/strimzi-kafka-operator/cr.yaml",
  "analysis": {
    "github_link": "https://github.com/strimzi/strimzi-kafka-operator.git",
    "commit": "ef60183b123245490900dd103a0cf2e15a4f5d3e",
    "entrypoint": null,
    "type": "Kafka",
    "package": "github.com/kapi/src/main/java/io/strimzi/api/kafka"
}
}

Alarm 1:
We can found there is an inconsistent of status.observedGeneration. Acto failed to change the property of path status.observedGeneration from 3 to 0 for Kafka cluster.

        "crash": null,
        "health": null,
        "operator_log": null,
        "consistency": {
            "message": "Found no matching fields for input",
            "input_diff": {
                "prev": 3,
                "curr": 0,
                "path": {
                    "path": [
                        "status",
                        "observedGeneration"
                    ]
                }
            },
            "system_state_diff": null
        },
        "differential": null,
        "custom": null

Alarm 2:
This alarm is caused by a misoperation vulnerability in the Kafka operator.

"crash": {
            "message": "Pod test-cluster-kafka-0 crashed"
        },

This alarm shows the Kafka cluster crashed. Acto added spec.kafka.authorization.type == custom and spec.kafka.authorization.tokenEndPointUri to the kafka cluster's cr.

      "dictionary_item_added": {
            "root['spec']['kafka']['authorization'][type]": {
                  "prev": "NotPresent",
                  "curr": "custom",
                  "path": {
                        "path": [
                              "spec",
                              "kafka",
                              "authorization",
                              "type"
                        ]
                  }
            },
            "root['spec']['kafka']['authorization'][tokenEndpointUri]": {
                  "prev": "NotPresent",
                  "curr": "ACTOKEY",
                  "path": {
                        "path": [
                              "spec",
                              "kafka",
                              "authorization",
                              "tokenEndpointUri"
                        ]
                  }
            }
      }
}

What did you expect to happen?

Alarm 1:
Here we can see the the status.observedGeneration create/update after a reconciliation of Kafka cluster: https://github.com/strimzi/strimzi-kafka-operator/blob/ef60183b123245490900dd103a0cf2e15a4f5d3e/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaAssemblyOperator.java#L150.
It is a system-managed field, and it will not trigger reconciliation when user manually update that field. No status field passed into kefkaReconciler function: https://github.com/strimzi/strimzi-kafka-operator/blob/ef60183b123245490900dd103a0cf2e15a4f5d3e/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaReconciler.java#L180C5-L192C24

Alarm 2:
Here we can see for authorization type 'custom', it does not have tokenEndpointUri property: https://github.com/strimzi/strimzi-kafka-operator/blob/ef60183b123245490900dd103a0cf2e15a4f5d3e/api/src/main/java/io/strimzi/api/kafka/model/KafkaAuthorizationCustom.java#L27
And only type keyCloak has tokenEndpointsUri property: https://github.com/strimzi/strimzi-kafka-operator/blob/ef60183b123245490900dd103a0cf2e15a4f5d3e/api/src/main/java/io/strimzi/api/kafka/model/KafkaAuthorizationKeycloak.java#L26
This indicates that it is an invalid configuration. The operator should reject this kind of erroneous desired state.

Root Cause

Alarm 1:
Thus, this is a false alarm. The operator's behavior is correct. It did not update the system state because it is a system-managed field and wouldn't trigger rolling update.

Alarm 2:
This is a true alarm. This indicates that acto applies an invalid configuration for the spec.kafka.authorization and does not properly configure the custom authorization. cause all Kafka broker pods unavailable and the whole cluster is not functionality. Finally, cluster got crashed due to it was unable to recover from error state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant