-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup failed systemd services #20420
Cleanup failed systemd services #20420
Conversation
7e42ca4
to
c8abf5a
Compare
@@ -36,8 +36,6 @@ def quiesce_workers_loop | |||
miq_workers.each do |w| | |||
if w.containerized_worker? | |||
w.delete_container_objects | |||
elsif w.systemd_worker? | |||
w.stop_systemd_worker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call stop_systemd_worker in the stop_worker()
method so this is just extra
end | ||
|
||
def failed_services | ||
services.select { |service| service[:active_state] == "failed" }.map { |service| service[:name] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool. I've started doing something similar with kubernetes and categorizing failed ones as failed and then removing the deployments and allowing new ones to start makes me wonder if we could get into an infinite loop if each new deployment will continue to fail because we're detecting one should be started but for whatever reason, the runner itself can't start properly. I have to think about that some more.
251f064
to
980f1ce
Compare
@agrare I'd like to extend this idea even further. In various discussion with you @jrafanie and others, I think we need to create an abstract, pluggable, interface for different backends for the worker reconciliation. I think Nick summed it up pretty well here - #20147 (comment) . I think it's pretty straightforward, but let's do a quick arch session on it and come up with a solid design. |
Love that we might finally get to refactoring the MiqServer WorkerManagement, a little out of scope for a "quick fix" to the orphaned deployments issue so I opened #20424 so we can discuss/design there |
31291f1
to
146df10
Compare
So I was trying to do two things with this PR, 1. cleanup failed services/deployments and 2. use systemd/k8s for the list of "current" workers rather than our miq_workers table but it was getting a little big so I'm only going to tackle 1 in this PR and I'll open another to change how we look at "current" workers |
146df10
to
368eb65
Compare
a235e0a
to
27b3c15
Compare
e7593e7
to
d79ac6c
Compare
It is possible for a systemd or container deployment to fail and no longer be run (e.g. CrashLoopBackOff) but the deployment/service will still exist in the runtime environment. Add the ability for these failed services to be cleaned up during MiqServer's sync_workers loop.
d79ac6c
to
698d4f6
Compare
Checked commit agrare@698d4f6 with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint |
|
||
if podified? | ||
elsif systemd? | ||
cleanup_failed_systemd_services |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe when we get podified committed, we can make worker monitor subclasses where one type gets instantiated earlier, and these subclasses implements cleanup_failed_things
, etc. so we don't need to have the podified?
and system?
checks. For now, this is super clean/surgical and fixes just the issue we care about without complicating things prematurely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah absolutely, I think with proper subclassing we can do the common cleanup in the base class and the specific cleanups in the subclass with super
to do the core stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Ah, cool, I forgot to merge it. 🤣 Thanks @chessbyte |
…t_workers_list Cleanup failed systemd services (cherry picked from commit 8338fdc)
Jansa backport details:
|
It is possible for a systemd or container deployment to fail and no longer be run (e.g. CrashLoopBackOff) but the deployment/service will still exist in the runtime environment.
Add the ability for these failed services to be cleaned up during MiqServer's sync_workers loop.