Add ability to detect and cleanup failed deployments #20444

jrafanie · 2020-08-13T20:15:34Z

What this PR does:

pods are labeled with the orchestrator pod that manages them
for this orchestrator's pods, a collector thread gets the initial pod information and monitors for updates
deployments with 1+ terminated container(s) and sum of restarts > 5 are "killed"

The easiest way to recreate this:

Add an amazon cloud provider in the manageiq UI.
Wait a few seconds for oc get pods to show new amazon-cloud-event-catcher and amazon-cloud-refresh worker pods creating/starting:

$ oc get pods
NAME                                              READY   STATUS              RESTARTS   AGE
1-amazon-cloud-event-catcher-2-7bd6479946-84c8n   0/1     ContainerCreating   0          2s
1-amazon-cloud-refresh-2-3-4-5548695867-9qkzx     0/1     ContainerCreating   0          1s

Delete the amazon cloud provider in the manageiq UI before the new pods finish starting.
The new worker pods will continually start/restart/fail/backoff until it reaches 6 restarts and gets killed.

TODO:

Mutex around read/writes
Ensure the pod collector thread is running/restarted
Guard against monitoring/killing deployments for things such as postgresql, memcached, httpd, etc. To do this, we'll label all of the worker pods that are managed by the orchestrator so the orchestrator will only monitor and kill failed pods it's monitoring. This will require new images to be tested to ensure the labels are set correctly and that the orchestrator can monitor this subset of pods that it's managing as represented by this label.
Fix some thread safety issues identified below.
~~More nuanced detection of failed deployments (5+ restarts and terminated status) as we might flag pods that often hit memory/cpu limits over days/months.~~. This is basic and doesn't conflict with liveness checks failures since those will be restarted, and not remain in terminated lastState. Any pod that has 5 or more container restarts and remains in terminated state will get removed as a deployment.
Manual tests are great but we'll need some tests
Fixes: Worker deployments exist after worker records are removed #20147

Here's an example of the events indicating two failed worker pods and their subsequent automatic removal:

Side effect bonus:

Each pod's labels show which orchestrator they're managed by:
By filtering by the orchestrator, such asmanageiq-orchestrated-by orchestrator-5f89795bcc-89ztg, we can see all of the deployments managed by that orchestrator (or any orchestator pod if you filter by manageiq-orchestrated-by) and therefore which ones we're monitoring and will get killed if they continually fail:

app/models/miq_server/worker_management/monitor/kubernetes.rb

jrafanie · 2020-08-24T21:55:46Z

app/models/miq_server/worker_management/monitor/kubernetes.rb

+      end
+
+      start_pod_monitor
+    end


copied from similar behavior in the event catcher

As discussed with you and @agrare, maybe not for this PR, but I suggest we move the general pattern of having a WatchThread that can be auto-restarted into a generic for in the core manageiq repo in the lib dir, for eventual extraction into kubeclient. This way issues that arise can be fixed in one place (like the 410 gone issue).

I recommened it lives in core, and then the kubernetes / openshift providers use it directly. We may need to tweak the interface to be less ems oriented and more generic.

jrafanie · 2020-08-24T21:58:19Z

app/models/miq_server/worker_management/monitor/kubernetes.rb

+      when "status"
+        # other times, we can get 'status' type/kind, with a code of 410
+        # https://github.com/ManageIQ/manageiq-providers-kubernetes/blob/745ba1332fa43cfb0795644279f3f55b8751f1c8/app/models/manageiq/providers/kubernetes/container_manager/refresh_worker/watch_thread.rb#L48
+        break if event.code == 410


@agrare I couldn't track down why we're doing one thing in the kubernetes provider here and something different in the fluent-plugin. referenced above... maybe different api versions return a toplevel status object which can be used here and other api versions return an error watch event with the status object inside? I'll need to track this down either way.

I don't know that k8s defines the event type when a 410 Gone is returned, here are the docs on watches https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes so this might be an implementation detail (whether it returns "status" or "error" as the event type).

I'll need to do some research to see how we should handle it, would be unfortunate if we had to handle both cases

yeah, at the worst, if we have to support both formats for 410, we can combine them like:

when "error", "status" break if event.code == 410 # outer watch event has failed 410 error code break if event.object && event.object.code == 410 # or outer watch event completed but inner event object has the 410 error code

It's less than ideal but if the outer watch event contains the code or the inner object contains the code,

I suspect depending on what you are asking for, you can get "success" but with a failure object OR the whole request can fail. It's possible there are bugs on the kubernetes side in terms of consistency in various API while there are possibly legitimate reasons to have a successful failure vs. a failing failure.

jrafanie · 2020-09-03T18:48:46Z

@Fryguy @agrare @brandon I think this is ready for review. In terms of a surgical change, I think this is the basics and can't really remove any of the functionality.

There are still things to do but it feels like we can have separate discussions on that. Perhaps I can make this logic optional to start and we can get this in and discuss the future items.

Note, I've tested this on pods by injecting the code into a rails console in the orchestrator so I'll need to retest or show others how to test when we have an image they can run this with.

Some of the items remaining that we might want to include here or just do later (YAGNI):

Label pods on deployment so we can identify/search just for deployments we care to kill and let be recreated. Do we want to support killing httpd deployment? Or memcached? Or postgresql? Or just pods that are in the miq_workers table? Or just ems based workers? We can choose to have a hardcoded opt-in or opt-out list to start and use labels in kubernetes for more future work.

Later:

What do we want to do with the miq_workers table? This PR doesn't address the fact that pods still doesn't create the worker row at orchestrator deployment time in the orchestrator but instead as the pod initializes and starts from within the pod.
Looking for terminated lastState + >+ 5 container restarts across the pod is pretty naive but seems to work for the situations we care about. Additionally, maybe there isn't a big downside to redeploying a pod that manages to get included as a false positive? Basically, if the pod is restarted and fails enough, it will hit this state and could be removed and redeployed.

app/models/miq_server/worker_management/monitor/kubernetes.rb

lib/container_orchestrator.rb

jrafanie · 2020-09-03T21:00:25Z

lib/container_orchestrator.rb

  private
+  def pod_options
+    @pod_options ||= {:namespace => my_namespace, :label_selector => "app=#{app_name}"}


Pairing with @Fryguy, we came up with a label we can set for all deployments {:"#{app_name}-orchestrated-by" => ENV['POD_NAME']} in the object definition and we can then do a selector here app=manageiq,manageiq-orchestrated-by=orchestrator-9f99d8cb9-7mprg so we'll only get pods that are managed by the orchestrator. In the orchestrator code, we can then look for all pods that we're managing...

…liminate the accessor Update log message on error since we're not resetting the resource version anymore.

jrafanie · 2020-09-09T19:46:13Z

app/models/miq_server/worker_management/monitor/kubernetes.rb

+  def collect_initial_pods
+    pods = orchestrator.get_pods
+    pods.each { |p| save_pod(p) }
+    pods.resourceVersion


Thanks @agrare for the suggestion... returning the resourceVersion here and passing it to the watch is simpler.

app/models/miq_server/worker_management/monitor/kubernetes.rb

Fryguy

Great work Joe!

Fryguy · 2020-09-09T20:00:12Z

@jrafanie Can you fix the rubocops? Some of them are legit.

jrafanie · 2020-09-09T20:01:00Z

@jrafanie Can you fix the rubocops? Some of them are legit.

Yeah, I'm doing one more full build to make sure it fixes the whole end-to-end process and I'll clean those up.

* stale comments * style

miq-bot · 2020-09-09T21:55:28Z

Checked commits jrafanie/manageiq@db496aa~...9ba2fc4 with ruby 2.6.3, rubocop 0.69.0, haml-lint 0.28.0, and yamllint
6 files checked, 2 offenses detected

app/models/miq_server/worker_management/monitor/kubernetes.rb

❗ - Line 22, Col 5 - Naming/MemoizedInstanceVariableName - Memoized variable @monitor_thread does not match method name start_pod_monitor. Use @start_pod_monitor instead.

lib/container_orchestrator.rb

❗ - Line 61, Col 7 - Naming/AccessorMethodName - Do not prefix reader method names with get_.

jrafanie · 2020-09-09T22:02:24Z

Ok, final tests were successful. I added screenshots in the description to hopefully better document what this PR does.

The final style issues are meh:

get_pods is the kubeclient method, so I'm naming our caller of that the same.
@start_pod_monitor isn't any better than @monitor_thread... we can address a more appropriate name when we extract this concept with what we're doing very similar things elsewhere

jrafanie · 2020-09-09T22:12:57Z

Thanks @agrare @Fryguy @bdunne @simaishi for all the help with reviews/image building.

simaishi · 2020-09-10T19:39:15Z

@jrafanie backporting this to jansa conflicts as #20420 is not in jansa branch. Not sure if we want to take #20420 as well. If not, please create a separate PR for jansa branch.

jrafanie · 2020-09-10T20:02:33Z

@agrare I'm ok with bringing back #20420 to jansa, what do you think? I think there isn't much risk since systemd wasn't working correctly anyway, right?

Thanks @simaishi

agrare · 2020-09-10T22:55:40Z

Yeah I'm 👍 with that

jrafanie · 2020-09-11T12:58:46Z

@simaishi Can you let me know if you are able to backport #20420 in order to backport this PR? Thanks!

simaishi · 2020-09-11T13:58:58Z

@jrafanie I can backport #20420, followed by #20444 without conflicts.

jrafanie · 2020-09-11T14:04:46Z

Sounds good @simaishi. We're both comfortable with bringing back both PRs to jansa.

Add ability to detect and cleanup failed deployments (cherry picked from commit f3c20e8)

simaishi · 2020-09-11T18:25:44Z

Jansa backport details:

$ git log -1
commit ab194ebd64e259b0a8e44ad791cdbfdc9e19fecf
Author: Jason Frey <[email protected]>
Date:   Wed Sep 9 18:21:17 2020 -0400

    Merge pull request #20444 from jrafanie/cleanup_failed_deployments

    Add ability to detect and cleanup failed deployments

    (cherry picked from commit f3c20e88dbf7c29b775fdd225b3cc1e53e4e494f)

jrafanie requested a review from agrare August 13, 2020 20:17

miq-bot added the wip label Aug 13, 2020

agrare reviewed Aug 13, 2020

View reviewed changes

app/models/miq_server/worker_management/monitor/kubernetes.rb Outdated Show resolved Hide resolved

agrare reviewed Aug 13, 2020

View reviewed changes

app/models/miq_server/worker_management/monitor/kubernetes.rb Show resolved Hide resolved

jrafanie force-pushed the cleanup_failed_deployments branch from ce8d657 to 163c7dd Compare August 24, 2020 19:05

jrafanie commented Aug 24, 2020

View reviewed changes

app/models/miq_server/worker_management/monitor/kubernetes.rb Outdated Show resolved Hide resolved

jrafanie force-pushed the cleanup_failed_deployments branch 2 times, most recently from f023847 to b421fba Compare August 24, 2020 20:45

jrafanie commented Aug 24, 2020

View reviewed changes

jrafanie force-pushed the cleanup_failed_deployments branch 5 times, most recently from 6effc55 to e0d8b4e Compare August 28, 2020 20:22

jrafanie marked this pull request as ready for review September 3, 2020 17:37

jrafanie requested review from Fryguy and gtanzillo as code owners September 3, 2020 17:37

jrafanie changed the title ~~[WIP] Add ability to detect and cleanup failed deployments~~ Add ability to detect and cleanup failed deployments Sep 3, 2020

miq-bot removed the wip label Sep 3, 2020

jrafanie commented Sep 3, 2020

View reviewed changes

app/models/miq_server/worker_management/monitor/kubernetes.rb Outdated Show resolved Hide resolved

jrafanie commented Sep 3, 2020

View reviewed changes

app/models/miq_server/worker_management/monitor/kubernetes.rb Outdated Show resolved Hide resolved

jrafanie commented Sep 3, 2020

View reviewed changes

app/models/miq_server/worker_management/monitor/kubernetes.rb Outdated Show resolved Hide resolved

jrafanie commented Sep 3, 2020

View reviewed changes

app/models/miq_server/worker_management/monitor/kubernetes.rb Outdated Show resolved Hide resolved

jrafanie commented Sep 3, 2020

View reviewed changes

app/models/miq_server/worker_management/monitor/kubernetes.rb Outdated Show resolved Hide resolved

Fryguy reviewed Sep 3, 2020

View reviewed changes

lib/container_orchestrator.rb Outdated Show resolved Hide resolved

jrafanie commented Sep 3, 2020

View reviewed changes

jrafanie added the enhancement label Sep 4, 2020

jrafanie force-pushed the cleanup_failed_deployments branch from e0d8b4e to beaf75e Compare September 4, 2020 15:38

Pass the resourceVersion from collect_initial_pods to watch_pods to e…

da84d0a

…liminate the accessor Update log message on error since we're not resetting the resource version anymore.

jrafanie force-pushed the cleanup_failed_deployments branch from eec6c8b to da84d0a Compare September 9, 2020 19:37

jrafanie commented Sep 9, 2020

View reviewed changes

app/models/miq_server/worker_management/monitor/kubernetes.rb Outdated Show resolved Hide resolved

Fryguy approved these changes Sep 9, 2020

View reviewed changes

Minor cleanup

9ba2fc4

* stale comments * style

jrafanie force-pushed the cleanup_failed_deployments branch from cdd36ef to 9ba2fc4 Compare September 9, 2020 21:55

Fryguy merged commit f3c20e8 into ManageIQ:master Sep 9, 2020

Fryguy self-assigned this Sep 9, 2020

Fryguy added the jansa/yes? label Sep 9, 2020

jrafanie deleted the cleanup_failed_deployments branch September 10, 2020 12:50

gtanzillo added jansa/yes and removed jansa/yes? labels Sep 10, 2020

simaishi added the jansa/conflict label Sep 10, 2020

simaishi pushed a commit that referenced this pull request Sep 11, 2020

Merge pull request #20444 from jrafanie/cleanup_failed_deployments

ab194eb

Add ability to detect and cleanup failed deployments (cherry picked from commit f3c20e8)

simaishi added jansa/backported and removed jansa/conflict labels Sep 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to detect and cleanup failed deployments #20444

Add ability to detect and cleanup failed deployments #20444

jrafanie commented Aug 13, 2020 •

edited

Loading

jrafanie Aug 24, 2020

Fryguy Sep 4, 2020

jrafanie Aug 24, 2020

agrare Aug 25, 2020 •

edited

Loading

jrafanie Aug 25, 2020

jrafanie Aug 25, 2020

jrafanie commented Sep 3, 2020 •

edited

Loading

jrafanie Sep 3, 2020

jrafanie Sep 9, 2020

Fryguy left a comment

Fryguy commented Sep 9, 2020

jrafanie commented Sep 9, 2020

miq-bot commented Sep 9, 2020 •

edited by jrafanie

Loading

jrafanie commented Sep 9, 2020

jrafanie commented Sep 9, 2020

simaishi commented Sep 10, 2020

jrafanie commented Sep 10, 2020

agrare commented Sep 10, 2020

jrafanie commented Sep 11, 2020

simaishi commented Sep 11, 2020

jrafanie commented Sep 11, 2020

simaishi commented Sep 11, 2020

Add ability to detect and cleanup failed deployments #20444

Add ability to detect and cleanup failed deployments #20444

Conversation

jrafanie commented Aug 13, 2020 • edited Loading

jrafanie Aug 24, 2020

Choose a reason for hiding this comment

Fryguy Sep 4, 2020

Choose a reason for hiding this comment

jrafanie Aug 24, 2020

Choose a reason for hiding this comment

agrare Aug 25, 2020 • edited Loading

Choose a reason for hiding this comment

jrafanie Aug 25, 2020

Choose a reason for hiding this comment

jrafanie Aug 25, 2020

Choose a reason for hiding this comment

jrafanie commented Sep 3, 2020 • edited Loading

jrafanie Sep 3, 2020

Choose a reason for hiding this comment

jrafanie Sep 9, 2020

Choose a reason for hiding this comment

Fryguy left a comment

Choose a reason for hiding this comment

Fryguy commented Sep 9, 2020

jrafanie commented Sep 9, 2020

miq-bot commented Sep 9, 2020 • edited by jrafanie Loading

jrafanie commented Sep 9, 2020

jrafanie commented Sep 9, 2020

simaishi commented Sep 10, 2020

jrafanie commented Sep 10, 2020

agrare commented Sep 10, 2020

jrafanie commented Sep 11, 2020

simaishi commented Sep 11, 2020

jrafanie commented Sep 11, 2020

simaishi commented Sep 11, 2020

jrafanie commented Aug 13, 2020 •

edited

Loading

agrare Aug 25, 2020 •

edited

Loading

jrafanie commented Sep 3, 2020 •

edited

Loading

miq-bot commented Sep 9, 2020 •

edited by jrafanie

Loading