Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numaplane didn't appear to re-reconcile the new NumaflowControllerRollout after the previous one was broken #377

Open
juliev0 opened this issue Oct 30, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@juliev0
Copy link
Collaborator

juliev0 commented Oct 30, 2024

Describe the bug
I was using my test asset in DevPortal. I tried updating my NumaflowControllerRollout from "1.3.3" to "1.3.3-copy1" based on a definition for "1.3.3-copy1" ConfigMap that I'd added. My user namespace config set "pause-and-drain" as the preferred strategy.

The "1.3.3-copy1" version failed due to this error:

2024-10-30T17:56:52Z	ERROR	Reconciler error	{"controller": "numaflow-controller-rollout-controller", "object": {"name":"numaflow-controller","namespace":"oss-analytics-testasyncjulie-e2e"}, "namespace": "oss-analytics-testasyncjulie-e2e", "name": "numaflow-controller", "reconcileID": "6e4a8587-6926-489f-8d15-7f90a6eb0d35", "error": "couldn't find image named \"numaflow\" in Deployment &Deployment{ObjectMeta:{numaflow-controller

This was expected, since after the apply was made, in the next reconciliation the code was trying to find the numaflow image in the running Deployment using the name "numaflow" in order to get the tag on the Image, but the container's name was "numaflow-rc" so it couldn't find it, and it errored, causing the NumaflowControllerRollout to be in a "Failed" state.

The issue was:

  • When I tried to re-update the NumaflowControllerRollout back to "1.3.3", it remained in a "Failed" state. I should be able to fix it back to the working version and not have it be in a "Failed" state.

numaplane.log
numaflowcontrollerrollout.yaml.txt

Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

@juliev0 juliev0 added the bug Something isn't working label Oct 30, 2024
@juliev0 juliev0 self-assigned this Oct 30, 2024
@juliev0
Copy link
Collaborator Author

juliev0 commented Nov 4, 2024

The issue is that we are comparing the current image version tag with the one specified in the NumaflowControllerRollout, which are the same after the failure. Therefore, we don't notice a difference.

We need to instead be comparing to what we've already attempted to deploy, whether it failed or succeeded. It is the same root cause as for this issue.

This differs from ISBServiceRollout, for which we simply do an Apply of the Spec and Metadata, and if there's any kind of failure, it would be a failed isbsvc itself, rather than a failure to create the isbsvc in the first place. Therefore, for the ISBServiceRollout, we can compare the Rollout-defined spec/metadata to the isbsvc spec/metadata to determine if there's a difference.

One way to handle this is to have a new CRD called NumaflowController, which is very simple but represents a single Numaflow Controller version, so it would be have similar to isbsvc, and then we could just compare the NumaflowControllerRollout-specified version to the NumaflowController-specified version to determine if deployment occurred.

The other benefit is that it would translate better into Progressive Delivery: there is one NumaflowControllerRollout which could have 2 NumaflowController children. One of them could be labeled "promoted" and the other labeled "upgrading".

@juliev0
Copy link
Collaborator Author

juliev0 commented Nov 7, 2024

An alternative to having a NumaflowController CRD is that we can basically just store the "previously attempted version" in the Status of the NumaflowControllerRollout CRD. If we've already attempted it, don't do it again.

@juliev0 juliev0 assigned afugazzotto and unassigned juliev0 Nov 7, 2024
@juliev0
Copy link
Collaborator Author

juliev0 commented Nov 7, 2024

Hey @afugazzotto - just assigned you this issue and this one. They have the same root cause. I am thinking maybe we could handle them this way, since it's simpler than the NumaflowController CRD approach I mentioned above?

@afugazzotto
Copy link
Contributor

I'll investigate the issue and try to find some options. One thing to consider may be to include allowed NumaflowController versions in the CRD using https://book.kubebuilder.io/reference/markers/crd-validation enums (this needs more investigation).

@juliev0
Copy link
Collaborator Author

juliev0 commented Nov 7, 2024

I'll investigate the issue and try to find some options. One thing to consider may be to include allowed NumaflowController versions in the CRD using https://book.kubebuilder.io/reference/markers/crd-validation enums (this needs more investigation).

Sure. I see an advantage of this solution is that if the CRD isn't allowed to exist in the first place, then we don't have to worry about the case of a NumaflowControllerRollout that's in a Failed state due to this reason, and then all of a sudden the CRD they wanted is there after all.

Only other thing I can think of is if there are other ways to get into error scenarios. Are there any other user errors? Are there only Platform Configuration errors? I am okay if at least for now we make the assumption that the platform is configured correctly.

@juliev0
Copy link
Collaborator Author

juliev0 commented Nov 13, 2024

Let me try to consolidate the description of this issue and the other related issue all in one place.

So, there are 2 different scenarios:

  1. The NumaflowControllerRollout version is found and is deployed successfully, but the "numaflow" Container name having an unexpected name causes all subsequent reconciliations to fail, meaning Pipelines remain paused indefinitely, and making it so that the NumaflowControllerRollout can't even be fixed back to an older version.
  2. The NumaflowControllerRollout version isn't found, so it's not deployed successfully. It's marked "Failed", but due to the current code we will continually see that there's a different between the desired version and the running version, so we will continually try to pause.

I like your idea to update the CRD if possible @afugazzotto, but the problems I mention above are still bad in the case that there is a platform misconfiguration that causes the error.

For this reason, I'm thinking that we need to preserve in our Status, the "previous attempted" Controller definition (a.k.a. InstanceID and Version) that we deployed before. Remove the existing comparison which looks at the Deployment Container tag currently running and replace it with this.

  1. If we already attempted to deploy the Version/Instance we want, don't try to deploy it.
  2. If not and we're in PPND, we can pause the pipelines/keep them paused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants