Cleanup failed systemd services #20420

agrare · 2020-08-05T19:27:28Z

It is possible for a systemd or container deployment to fail and no longer be run (e.g. CrashLoopBackOff) but the deployment/service will still exist in the runtime environment.

Add the ability for these failed services to be cleaned up during MiqServer's sync_workers loop.

agrare · 2020-08-05T21:14:03Z

app/models/miq_server/worker_management/monitor/quiesce.rb

@@ -36,8 +36,6 @@ def quiesce_workers_loop
    miq_workers.each do |w|
      if w.containerized_worker?
        w.delete_container_objects
-      elsif w.systemd_worker?
-        w.stop_systemd_worker


We call stop_systemd_worker in the stop_worker() method so this is just extra

jrafanie · 2020-08-05T22:17:42Z

app/models/miq_worker/systemd_common.rb

+      end
+
+      def failed_services
+        services.select { |service| service[:active_state] == "failed" }.map { |service| service[:name] }


This is cool. I've started doing something similar with kubernetes and categorizing failed ones as failed and then removing the deployments and allowing new ones to start makes me wonder if we could get into an infinite loop if each new deployment will continue to fail because we're detecting one should be started but for whatever reason, the runner itself can't start properly. I have to think about that some more.

Fryguy · 2020-08-06T20:33:15Z

@agrare I'd like to extend this idea even further. In various discussion with you @jrafanie and others, I think we need to create an abstract, pluggable, interface for different backends for the worker reconciliation. I think Nick summed it up pretty well here - #20147 (comment) . I think it's pretty straightforward, but let's do a quick arch session on it and come up with a solid design.

app/models/miq_worker.rb

agrare · 2020-08-10T14:04:25Z

Love that we might finally get to refactoring the MiqServer WorkerManagement, a little out of scope for a "quick fix" to the orphaned deployments issue so I opened #20424 so we can discuss/design there

agrare · 2020-08-11T13:45:03Z

So I was trying to do two things with this PR, 1. cleanup failed services/deployments and 2. use systemd/k8s for the list of "current" workers rather than our miq_workers table but it was getting a little big so I'm only going to tackle 1 in this PR and I'll open another to change how we look at "current" workers

app/models/miq_server/worker_management/monitor.rb

app/models/miq_server/worker_management/monitor/systemd.rb

app/models/miq_server/worker_management/monitor.rb

It is possible for a systemd or container deployment to fail and no longer be run (e.g. CrashLoopBackOff) but the deployment/service will still exist in the runtime environment. Add the ability for these failed services to be cleaned up during MiqServer's sync_workers loop.

miq-bot · 2020-08-12T22:45:33Z

Checked commit agrare@698d4f6 with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint
5 files checked, 0 offenses detected
Everything looks fine. ⭐

jrafanie · 2020-08-13T14:04:49Z

app/models/miq_server/worker_management/monitor.rb

+
+    if podified?
+    elsif systemd?
+      cleanup_failed_systemd_services


Maybe when we get podified committed, we can make worker monitor subclasses where one type gets instantiated earlier, and these subclasses implements cleanup_failed_things, etc. so we don't need to have the podified? and system? checks. For now, this is super clean/surgical and fixes just the issue we care about without complicating things prematurely.

Yeah absolutely, I think with proper subclassing we can do the common cleanup in the base class and the specific cleanups in the subclass with super to do the core stuff

jrafanie

LGTM

jrafanie · 2020-08-13T14:58:21Z

Ah, cool, I forgot to merge it. 🤣 Thanks @chessbyte

…t_workers_list Cleanup failed systemd services (cherry picked from commit 8338fdc)

simaishi · 2020-09-11T18:25:28Z

Jansa backport details:

$ git log -1
commit 4049ac6afe5c29e0214a291cea171027927d8ba0
Author: Oleg Barenboim <[email protected]>
Date:   Thu Aug 13 10:57:29 2020 -0400

    Merge pull request #20420 from agrare/use_systemd_services_for_current_workers_list

    Cleanup failed systemd services

    (cherry picked from commit 8338fdcb8429ce60f00b74a684bbdc781fa1a3c9)

agrare requested a review from jrafanie as a code owner August 5, 2020 19:27

agrare added core/workers wip labels Aug 5, 2020

agrare changed the title ~~Use the systemd services instead of miq_workers table~~ [WIP] Use the systemd services instead of miq_workers table Aug 5, 2020

agrare force-pushed the use_systemd_services_for_current_workers_list branch from 7e42ca4 to c8abf5a Compare August 5, 2020 21:13

agrare commented Aug 5, 2020

View reviewed changes

jrafanie reviewed Aug 5, 2020

View reviewed changes

agrare force-pushed the use_systemd_services_for_current_workers_list branch 2 times, most recently from 251f064 to 980f1ce Compare August 6, 2020 14:52

jrafanie reviewed Aug 7, 2020

View reviewed changes

app/models/miq_worker.rb Outdated Show resolved Hide resolved

jrafanie reviewed Aug 7, 2020

View reviewed changes

app/models/miq_worker.rb Outdated Show resolved Hide resolved

agrare force-pushed the use_systemd_services_for_current_workers_list branch 2 times, most recently from 31291f1 to 146df10 Compare August 11, 2020 12:52

agrare changed the title ~~[WIP] Use the systemd services instead of miq_workers table~~ [WIP] Cleanup failed systemd services Aug 11, 2020

agrare force-pushed the use_systemd_services_for_current_workers_list branch from 146df10 to 368eb65 Compare August 11, 2020 14:38

agrare requested a review from gtanzillo as a code owner August 11, 2020 14:38

agrare force-pushed the use_systemd_services_for_current_workers_list branch 4 times, most recently from a235e0a to 27b3c15 Compare August 11, 2020 19:04

agrare changed the title ~~[WIP] Cleanup failed systemd services~~ Cleanup failed systemd services Aug 11, 2020

miq-bot removed the wip label Aug 11, 2020

jrafanie reviewed Aug 12, 2020

View reviewed changes

app/models/miq_server/worker_management/monitor.rb Outdated Show resolved Hide resolved

app/models/miq_server/worker_management/monitor/systemd.rb Outdated Show resolved Hide resolved

app/models/miq_server/worker_management/monitor.rb Outdated Show resolved Hide resolved

agrare force-pushed the use_systemd_services_for_current_workers_list branch 3 times, most recently from e7593e7 to d79ac6c Compare August 12, 2020 19:46

agrare force-pushed the use_systemd_services_for_current_workers_list branch from d79ac6c to 698d4f6 Compare August 12, 2020 22:41

chessbyte assigned gtanzillo Aug 13, 2020

chessbyte added the refactoring label Aug 13, 2020

jrafanie reviewed Aug 13, 2020

View reviewed changes

jrafanie approved these changes Aug 13, 2020

View reviewed changes

chessbyte assigned chessbyte and unassigned gtanzillo Aug 13, 2020

chessbyte merged commit 8338fdc into ManageIQ:master Aug 13, 2020

simaishi mentioned this pull request Sep 10, 2020

Add ability to detect and cleanup failed deployments #20444

Merged

6 tasks

simaishi pushed a commit that referenced this pull request Sep 11, 2020

Merge pull request #20420 from agrare/use_systemd_services_for_curren…

4049ac6

…t_workers_list Cleanup failed systemd services (cherry picked from commit 8338fdc)

simaishi added the jansa/backported label Sep 11, 2020

agrare deleted the use_systemd_services_for_current_workers_list branch October 19, 2020 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup failed systemd services #20420

Cleanup failed systemd services #20420

agrare commented Aug 5, 2020 •

edited

Loading

agrare Aug 5, 2020

jrafanie Aug 5, 2020

Fryguy commented Aug 6, 2020

agrare commented Aug 10, 2020

agrare commented Aug 11, 2020

miq-bot commented Aug 12, 2020

jrafanie Aug 13, 2020 •

edited

Loading

agrare Aug 13, 2020

jrafanie left a comment

jrafanie commented Aug 13, 2020

simaishi commented Sep 11, 2020

Cleanup failed systemd services #20420

Cleanup failed systemd services #20420

Conversation

agrare commented Aug 5, 2020 • edited Loading

agrare Aug 5, 2020

Choose a reason for hiding this comment

jrafanie Aug 5, 2020

Choose a reason for hiding this comment

Fryguy commented Aug 6, 2020

agrare commented Aug 10, 2020

agrare commented Aug 11, 2020

miq-bot commented Aug 12, 2020

jrafanie Aug 13, 2020 • edited Loading

Choose a reason for hiding this comment

agrare Aug 13, 2020

Choose a reason for hiding this comment

jrafanie left a comment

Choose a reason for hiding this comment

jrafanie commented Aug 13, 2020

simaishi commented Sep 11, 2020

agrare commented Aug 5, 2020 •

edited

Loading

jrafanie Aug 13, 2020 •

edited

Loading