Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup failed systemd services #20420

Conversation

agrare
Copy link
Member

@agrare agrare commented Aug 5, 2020

It is possible for a systemd or container deployment to fail and no longer be run (e.g. CrashLoopBackOff) but the deployment/service will still exist in the runtime environment.

Add the ability for these failed services to be cleaned up during MiqServer's sync_workers loop.

@agrare agrare requested a review from jrafanie as a code owner August 5, 2020 19:27
@agrare agrare changed the title Use the systemd services instead of miq_workers table [WIP] Use the systemd services instead of miq_workers table Aug 5, 2020
@agrare agrare force-pushed the use_systemd_services_for_current_workers_list branch from 7e42ca4 to c8abf5a Compare August 5, 2020 21:13
@@ -36,8 +36,6 @@ def quiesce_workers_loop
miq_workers.each do |w|
if w.containerized_worker?
w.delete_container_objects
elsif w.systemd_worker?
w.stop_systemd_worker
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call stop_systemd_worker in the stop_worker() method so this is just extra

end

def failed_services
services.select { |service| service[:active_state] == "failed" }.map { |service| service[:name] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool. I've started doing something similar with kubernetes and categorizing failed ones as failed and then removing the deployments and allowing new ones to start makes me wonder if we could get into an infinite loop if each new deployment will continue to fail because we're detecting one should be started but for whatever reason, the runner itself can't start properly. I have to think about that some more.

@agrare agrare force-pushed the use_systemd_services_for_current_workers_list branch 2 times, most recently from 251f064 to 980f1ce Compare August 6, 2020 14:52
@Fryguy
Copy link
Member

Fryguy commented Aug 6, 2020

@agrare I'd like to extend this idea even further. In various discussion with you @jrafanie and others, I think we need to create an abstract, pluggable, interface for different backends for the worker reconciliation. I think Nick summed it up pretty well here - #20147 (comment) . I think it's pretty straightforward, but let's do a quick arch session on it and come up with a solid design.

app/models/miq_worker.rb Outdated Show resolved Hide resolved
app/models/miq_worker.rb Outdated Show resolved Hide resolved
@agrare
Copy link
Member Author

agrare commented Aug 10, 2020

Love that we might finally get to refactoring the MiqServer WorkerManagement, a little out of scope for a "quick fix" to the orphaned deployments issue so I opened #20424 so we can discuss/design there

@agrare agrare force-pushed the use_systemd_services_for_current_workers_list branch 2 times, most recently from 31291f1 to 146df10 Compare August 11, 2020 12:52
@agrare agrare changed the title [WIP] Use the systemd services instead of miq_workers table [WIP] Cleanup failed systemd services Aug 11, 2020
@agrare
Copy link
Member Author

agrare commented Aug 11, 2020

So I was trying to do two things with this PR, 1. cleanup failed services/deployments and 2. use systemd/k8s for the list of "current" workers rather than our miq_workers table but it was getting a little big so I'm only going to tackle 1 in this PR and I'll open another to change how we look at "current" workers

@agrare agrare force-pushed the use_systemd_services_for_current_workers_list branch from 146df10 to 368eb65 Compare August 11, 2020 14:38
@agrare agrare requested a review from gtanzillo as a code owner August 11, 2020 14:38
@agrare agrare force-pushed the use_systemd_services_for_current_workers_list branch 4 times, most recently from a235e0a to 27b3c15 Compare August 11, 2020 19:04
@agrare agrare changed the title [WIP] Cleanup failed systemd services Cleanup failed systemd services Aug 11, 2020
@miq-bot miq-bot removed the wip label Aug 11, 2020
@agrare agrare force-pushed the use_systemd_services_for_current_workers_list branch 3 times, most recently from e7593e7 to d79ac6c Compare August 12, 2020 19:46
It is possible for a systemd or container deployment to fail and no
longer be run (e.g. CrashLoopBackOff) but the deployment/service will
still exist in the runtime environment.

Add the ability for these failed services to be cleaned up during
MiqServer's sync_workers loop.
@agrare agrare force-pushed the use_systemd_services_for_current_workers_list branch from d79ac6c to 698d4f6 Compare August 12, 2020 22:41
@miq-bot
Copy link
Member

miq-bot commented Aug 12, 2020

Checked commit agrare@698d4f6 with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint
5 files checked, 0 offenses detected
Everything looks fine. ⭐


if podified?
elsif systemd?
cleanup_failed_systemd_services
Copy link
Member

@jrafanie jrafanie Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe when we get podified committed, we can make worker monitor subclasses where one type gets instantiated earlier, and these subclasses implements cleanup_failed_things, etc. so we don't need to have the podified? and system? checks. For now, this is super clean/surgical and fixes just the issue we care about without complicating things prematurely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah absolutely, I think with proper subclassing we can do the common cleanup in the base class and the specific cleanups in the subclass with super to do the core stuff

Copy link
Member

@jrafanie jrafanie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chessbyte chessbyte assigned chessbyte and unassigned gtanzillo Aug 13, 2020
@chessbyte chessbyte merged commit 8338fdc into ManageIQ:master Aug 13, 2020
@jrafanie
Copy link
Member

Ah, cool, I forgot to merge it. 🤣 Thanks @chessbyte

simaishi pushed a commit that referenced this pull request Sep 11, 2020
…t_workers_list

Cleanup failed systemd services

(cherry picked from commit 8338fdc)
@simaishi
Copy link
Contributor

Jansa backport details:

$ git log -1
commit 4049ac6afe5c29e0214a291cea171027927d8ba0
Author: Oleg Barenboim <[email protected]>
Date:   Thu Aug 13 10:57:29 2020 -0400

    Merge pull request #20420 from agrare/use_systemd_services_for_current_workers_list

    Cleanup failed systemd services

    (cherry picked from commit 8338fdcb8429ce60f00b74a684bbdc781fa1a3c9)

@agrare agrare deleted the use_systemd_services_for_current_workers_list branch October 19, 2020 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants