-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numaplane didn't appear to re-reconcile the new NumaflowControllerRollout after the previous one was broken #377
Comments
The issue is that we are comparing the current image version tag with the one specified in the We need to instead be comparing to what we've already attempted to deploy, whether it failed or succeeded. It is the same root cause as for this issue. This differs from One way to handle this is to have a new CRD called The other benefit is that it would translate better into Progressive Delivery: there is one |
An alternative to having a |
Hey @afugazzotto - just assigned you this issue and this one. They have the same root cause. I am thinking maybe we could handle them this way, since it's simpler than the |
I'll investigate the issue and try to find some options. One thing to consider may be to include allowed NumaflowController versions in the CRD using https://book.kubebuilder.io/reference/markers/crd-validation enums (this needs more investigation). |
Sure. I see an advantage of this solution is that if the CRD isn't allowed to exist in the first place, then we don't have to worry about the case of a Only other thing I can think of is if there are other ways to get into error scenarios. Are there any other user errors? Are there only Platform Configuration errors? I am okay if at least for now we make the assumption that the platform is configured correctly. |
Let me try to consolidate the description of this issue and the other related issue all in one place. So, there are 2 different scenarios:
I like your idea to update the CRD if possible @afugazzotto, but the problems I mention above are still bad in the case that there is a platform misconfiguration that causes the error. For this reason, I'm thinking that we need to preserve in our
|
Describe the bug
I was using my test asset in DevPortal. I tried updating my NumaflowControllerRollout from "1.3.3" to "1.3.3-copy1" based on a definition for "1.3.3-copy1" ConfigMap that I'd added. My user namespace config set "pause-and-drain" as the preferred strategy.
The "1.3.3-copy1" version failed due to this error:
This was expected, since after the apply was made, in the next reconciliation the code was trying to find the numaflow image in the running
Deployment
using the name "numaflow" in order to get the tag on the Image, but the container's name was "numaflow-rc" so it couldn't find it, and it errored, causing the NumaflowControllerRollout to be in a "Failed" state.The issue was:
numaplane.log
numaflowcontrollerrollout.yaml.txt
Message from the maintainers:
Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.
The text was updated successfully, but these errors were encountered: