Numaplane didn't appear to re-reconcile the new NumaflowControllerRollout after the previous one was broken #377

juliev0 · 2024-10-30T18:06:18Z

Describe the bug
I was using my test asset in DevPortal. I tried updating my NumaflowControllerRollout from "1.3.3" to "1.3.3-copy1" based on a definition for "1.3.3-copy1" ConfigMap that I'd added. My user namespace config set "pause-and-drain" as the preferred strategy.

The "1.3.3-copy1" version failed due to this error:

2024-10-30T17:56:52Z	ERROR	Reconciler error	{"controller": "numaflow-controller-rollout-controller", "object": {"name":"numaflow-controller","namespace":"oss-analytics-testasyncjulie-e2e"}, "namespace": "oss-analytics-testasyncjulie-e2e", "name": "numaflow-controller", "reconcileID": "6e4a8587-6926-489f-8d15-7f90a6eb0d35", "error": "couldn't find image named \"numaflow\" in Deployment &Deployment{ObjectMeta:{numaflow-controller

This was expected, since after the apply was made, in the next reconciliation the code was trying to find the numaflow image in the running Deployment using the name "numaflow" in order to get the tag on the Image, but the container's name was "numaflow-rc" so it couldn't find it, and it errored, causing the NumaflowControllerRollout to be in a "Failed" state.

The issue was:

When I tried to re-update the NumaflowControllerRollout back to "1.3.3", it remained in a "Failed" state. I should be able to fix it back to the working version and not have it be in a "Failed" state.

numaplane.log
numaflowcontrollerrollout.yaml.txt

Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

The text was updated successfully, but these errors were encountered:

juliev0 · 2024-11-04T22:41:09Z

The issue is that we are comparing the current image version tag with the one specified in the NumaflowControllerRollout, which are the same after the failure. Therefore, we don't notice a difference.

We need to instead be comparing to what we've already attempted to deploy, whether it failed or succeeded. It is the same root cause as for this issue.

This differs from ISBServiceRollout, for which we simply do an Apply of the Spec and Metadata, and if there's any kind of failure, it would be a failed isbsvc itself, rather than a failure to create the isbsvc in the first place. Therefore, for the ISBServiceRollout, we can compare the Rollout-defined spec/metadata to the isbsvc spec/metadata to determine if there's a difference.

One way to handle this is to have a new CRD called NumaflowController, which is very simple but represents a single Numaflow Controller version, so it would be have similar to isbsvc, and then we could just compare the NumaflowControllerRollout-specified version to the NumaflowController-specified version to determine if deployment occurred.

The other benefit is that it would translate better into Progressive Delivery: there is one NumaflowControllerRollout which could have 2 NumaflowController children. One of them could be labeled "promoted" and the other labeled "upgrading".

juliev0 · 2024-11-07T15:55:16Z

An alternative to having a NumaflowController CRD is that we can basically just store the "previously attempted version" in the Status of the NumaflowControllerRollout CRD. If we've already attempted it, don't do it again.

juliev0 · 2024-11-07T15:58:04Z

Hey @afugazzotto - just assigned you this issue and this one. They have the same root cause. I am thinking maybe we could handle them this way, since it's simpler than the NumaflowController CRD approach I mentioned above?

afugazzotto · 2024-11-07T16:23:44Z

I'll investigate the issue and try to find some options. One thing to consider may be to include allowed NumaflowController versions in the CRD using https://book.kubebuilder.io/reference/markers/crd-validation enums (this needs more investigation).

juliev0 · 2024-11-07T21:53:06Z

I'll investigate the issue and try to find some options. One thing to consider may be to include allowed NumaflowController versions in the CRD using https://book.kubebuilder.io/reference/markers/crd-validation enums (this needs more investigation).

Sure. I see an advantage of this solution is that if the CRD isn't allowed to exist in the first place, then we don't have to worry about the case of a NumaflowControllerRollout that's in a Failed state due to this reason, and then all of a sudden the CRD they wanted is there after all.

Only other thing I can think of is if there are other ways to get into error scenarios. Are there any other user errors? Are there only Platform Configuration errors? I am okay if at least for now we make the assumption that the platform is configured correctly.

juliev0 · 2024-11-13T22:43:30Z

Let me try to consolidate the description of this issue and the other related issue all in one place.

So, there are 2 different scenarios:

The NumaflowControllerRollout version is found and is deployed successfully, but the "numaflow" Container name having an unexpected name causes all subsequent reconciliations to fail, meaning Pipelines remain paused indefinitely, and making it so that the NumaflowControllerRollout can't even be fixed back to an older version.
The NumaflowControllerRollout version isn't found, so it's not deployed successfully. It's marked "Failed", but due to the current code we will continually see that there's a different between the desired version and the running version, so we will continually try to pause.

I like your idea to update the CRD if possible @afugazzotto, but the problems I mention above are still bad in the case that there is a platform misconfiguration that causes the error.

For this reason, I'm thinking that we need to preserve in our Status, the "previous attempted" Controller definition (a.k.a. InstanceID and Version) that we deployed before. Remove the existing comparison which looks at the Deployment Container tag currently running and replace it with this.

If we already attempted to deploy the Version/Instance we want, don't try to deploy it.
If not and we're in PPND, we can pause the pipelines/keep them paused

juliev0 added the bug Something isn't working label Oct 30, 2024

juliev0 self-assigned this Oct 30, 2024

This was referenced Nov 4, 2024

During PPND, if Numaflow Controller is in an error state, the pipelines will remain paused #379

Open

Add a new CRD called "NumaflowController" #390

Open

juliev0 assigned afugazzotto and unassigned juliev0 Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numaplane didn't appear to re-reconcile the new NumaflowControllerRollout after the previous one was broken #377

Numaplane didn't appear to re-reconcile the new NumaflowControllerRollout after the previous one was broken #377

juliev0 commented Oct 30, 2024 •

edited

Loading

juliev0 commented Nov 4, 2024

juliev0 commented Nov 7, 2024

juliev0 commented Nov 7, 2024

afugazzotto commented Nov 7, 2024

juliev0 commented Nov 7, 2024

juliev0 commented Nov 13, 2024

Numaplane didn't appear to re-reconcile the new NumaflowControllerRollout after the previous one was broken #377

Numaplane didn't appear to re-reconcile the new NumaflowControllerRollout after the previous one was broken #377

Comments

juliev0 commented Oct 30, 2024 • edited Loading

juliev0 commented Nov 4, 2024

juliev0 commented Nov 7, 2024

juliev0 commented Nov 7, 2024

afugazzotto commented Nov 7, 2024

juliev0 commented Nov 7, 2024

juliev0 commented Nov 13, 2024

juliev0 commented Oct 30, 2024 •

edited

Loading