-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to detect and cleanup failed deployments #20444
Conversation
ce8d657
to
163c7dd
Compare
f023847
to
b421fba
Compare
end | ||
|
||
start_pod_monitor | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copied from similar behavior in the event catcher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed with you and @agrare, maybe not for this PR, but I suggest we move the general pattern of having a WatchThread that can be auto-restarted into a generic for in the core manageiq repo in the lib dir, for eventual extraction into kubeclient. This way issues that arise can be fixed in one place (like the 410 gone issue).
I recommened it lives in core, and then the kubernetes / openshift providers use it directly. We may need to tweak the interface to be less ems oriented and more generic.
when "status" | ||
# other times, we can get 'status' type/kind, with a code of 410 | ||
# https://github.com/ManageIQ/manageiq-providers-kubernetes/blob/745ba1332fa43cfb0795644279f3f55b8751f1c8/app/models/manageiq/providers/kubernetes/container_manager/refresh_worker/watch_thread.rb#L48 | ||
break if event.code == 410 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agrare I couldn't track down why we're doing one thing in the kubernetes provider here and something different in the fluent-plugin. referenced above... maybe different api versions return a toplevel status object which can be used here and other api versions return an error watch event with the status object inside? I'll need to track this down either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that k8s defines the event type when a 410 Gone is returned, here are the docs on watches https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes so this might be an implementation detail (whether it returns "status" or "error" as the event type).
I'll need to do some research to see how we should handle it, would be unfortunate if we had to handle both cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, at the worst, if we have to support both formats for 410, we can combine them like:
when "error", "status"
break if event.code == 410 # outer watch event has failed 410 error code
break if event.object && event.object.code == 410 # or outer watch event completed but inner event object has the 410 error code
It's less than ideal but if the outer watch event contains the code or the inner object contains the code,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect depending on what you are asking for, you can get "success" but with a failure object OR the whole request can fail. It's possible there are bugs on the kubernetes side in terms of consistency in various API while there are possibly legitimate reasons to have a successful failure vs. a failing failure.
6effc55
to
e0d8b4e
Compare
@Fryguy @agrare @brandon I think this is ready for review. In terms of a surgical change, I think this is the basics and can't really remove any of the functionality. There are still things to do but it feels like we can have separate discussions on that. Perhaps I can make this logic optional to start and we can get this in and discuss the future items. Note, I've tested this on pods by injecting the code into a rails console in the orchestrator so I'll need to retest or show others how to test when we have an image they can run this with. Some of the items remaining that we might want to include here or just do later (YAGNI):
Later:
|
lib/container_orchestrator.rb
Outdated
private | ||
def pod_options | ||
@pod_options ||= {:namespace => my_namespace, :label_selector => "app=#{app_name}"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pairing with @Fryguy, we came up with a label we can set for all deployments {:"#{app_name}-orchestrated-by" => ENV['POD_NAME']}
in the object definition and we can then do a selector here app=manageiq,manageiq-orchestrated-by=orchestrator-9f99d8cb9-7mprg
so we'll only get pods that are managed by the orchestrator. In the orchestrator code, we can then look for all pods that we're managing...
e0d8b4e
to
beaf75e
Compare
…liminate the accessor Update log message on error since we're not resetting the resource version anymore.
eec6c8b
to
da84d0a
Compare
def collect_initial_pods | ||
pods = orchestrator.get_pods | ||
pods.each { |p| save_pod(p) } | ||
pods.resourceVersion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @agrare for the suggestion... returning the resourceVersion here and passing it to the watch is simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Joe!
@jrafanie Can you fix the rubocops? Some of them are legit. |
Yeah, I'm doing one more full build to make sure it fixes the whole end-to-end process and I'll clean those up. |
* stale comments * style
cdd36ef
to
9ba2fc4
Compare
Checked commits jrafanie/manageiq@db496aa~...9ba2fc4 with ruby 2.6.3, rubocop 0.69.0, haml-lint 0.28.0, and yamllint app/models/miq_server/worker_management/monitor/kubernetes.rb
lib/container_orchestrator.rb
|
Ok, final tests were successful. I added screenshots in the description to hopefully better document what this PR does. The final style issues are meh:
|
Yeah I'm 👍 with that |
Sounds good @simaishi. We're both comfortable with bringing back both PRs to jansa. |
Add ability to detect and cleanup failed deployments (cherry picked from commit f3c20e8)
Jansa backport details:
|
What this PR does:
The easiest way to recreate this:
oc get pods
to show new amazon-cloud-event-catcher and amazon-cloud-refresh worker pods creating/starting:TODO:
More nuanced detection of failed deployments (5+ restarts and terminated status) as we might flag pods that often hit memory/cpu limits over days/months.. This is basic and doesn't conflict with liveness checks failures since those will be restarted, and not remain in terminated lastState. Any pod that has 5 or more container restarts and remains in terminated state will get removed as a deployment.Fixes: Worker deployments exist after worker records are removed #20147
Here's an example of the events indicating two failed worker pods and their subsequent automatic removal:
Side effect bonus:
Each pod's labels show which orchestrator they're managed by:

By filtering by the orchestrator, such as

manageiq-orchestrated-by orchestrator-5f89795bcc-89ztg
, we can see all of the deployments managed by that orchestrator (or any orchestator pod if you filter bymanageiq-orchestrated-by
) and therefore which ones we're monitoring and will get killed if they continually fail: