Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link MiqWorker record to a running pod when not created using run_single_worker.rb #23112

Conversation

@agrare agrare requested a review from jrafanie as a code owner July 25, 2024 20:46
@miq-bot miq-bot added the wip label Jul 25, 2024
@agrare agrare force-pushed the miq_worker_worker_management_kubernetes_non_rails_system_uid branch 2 times, most recently from 9f162cb to 7879964 Compare July 30, 2024 14:56
@agrare agrare requested a review from Fryguy as a code owner July 30, 2024 14:56
@agrare agrare force-pushed the miq_worker_worker_management_kubernetes_non_rails_system_uid branch 2 times, most recently from 23332aa to d4173ca Compare July 30, 2024 15:58
@agrare
Copy link
Member Author

agrare commented Jul 30, 2024

I added some debug logging showing when we find a worker without a system_uid and a pod that matches the worker class:

{"@timestamp":"2024-07-30T17:27:45.709442","hostname":"orchestrator-5cf6b5b749-fxkfw","pid":7,"tid":"9808","service":"evm","level":"info","message":"MIQ(MiqWebServiceWorker#start) Worker started: ID [], PID [], GUID [89ac1fbe-81a3-4cc4-9afe-723ac46bb898]"}
{"@timestamp":"2024-07-30T17:27:45.713792","hostname":"orchestrator-5cf6b5b749-fxkfw","pid":7,"tid":"9808","service":"evm","level":"info","message":"MIQ(OpentofuWorker.sync_workers) Workers are being synchronized: Current #: [0], Desired #: [1]"}
{"@timestamp":"2024-07-30T17:27:45.811334","hostname":"orchestrator-5cf6b5b749-fxkfw","pid":7,"tid":"9808","service":"evm","level":"info","message":"MIQ(ContainerOrchestrator#patch_deployment) deployment_name: 1-opentofu-runner, data: {:spec=>{:replicas=>1}}"}
{"@timestamp":"2024-07-30T17:27:46.182359","hostname":"orchestrator-5cf6b5b749-fxkfw","pid":7,"tid":"9808","service":"evm","level":"info","message":"MIQ(MiqQueue.put) Message id: [1293], Zone: [default], Role: [], Server: [], MiqTask id: [], Handler id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqEvent.raise_evm_event], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [[\"MiqServer\", 1], \"evm_worker_start\", {:event_details=>\"Worker started: ID [14], PID [], GUID [1d7f7921-aba9-4026-b1ea-d197f34d55cd]\", :type=>\"OpentofuWorker\"}]"}
{"@timestamp":"2024-07-30T17:27:46.182432","hostname":"orchestrator-5cf6b5b749-fxkfw","pid":7,"tid":"9808","service":"evm","level":"info","message":"MIQ(OpentofuWorker#start) Worker started: ID [14], PID [], GUID [1d7f7921-aba9-4026-b1ea-d197f34d55cd]"}
{"@timestamp":"2024-07-30T17:27:46.228408","hostname":"orchestrator-5cf6b5b749-fxkfw","pid":7,"tid":"9808","service":"evm","level":"info","message":"MIQ(MiqServer::WorkerManagement::Kubernetes#sync_starting_workers) AG: found a worker without a system_uid: Class [OpentofuWorker] Id [14]"}
{"@timestamp":"2024-07-30T17:27:46.228506","hostname":"orchestrator-5cf6b5b749-fxkfw","pid":7,"tid":"9808","service":"evm","level":"info","message":"MIQ(MiqServer::WorkerManagement::Kubernetes#sync_starting_workers) AG: found a pod 1-opentofu-runner-84b54759f6-hhfs4 assigning to worker: Class [OpentofuWorker] Id [14]"}

@agrare agrare changed the title [WIP] Miq worker worker management kubernetes non rails system uid [WIP] Link MiqWorker record to a running pod when not created using run_single_worker.rb Jul 30, 2024
@agrare agrare force-pushed the miq_worker_worker_management_kubernetes_non_rails_system_uid branch from d4173ca to dc13bc3 Compare August 1, 2024 14:52
@agrare agrare changed the title [WIP] Link MiqWorker record to a running pod when not created using run_single_worker.rb Link MiqWorker record to a running pod when not created using run_single_worker.rb Aug 1, 2024
@miq-bot miq-bot removed the wip label Aug 1, 2024
@Fryguy Fryguy self-assigned this Aug 7, 2024
@jrafanie
Copy link
Member

jrafanie commented Aug 7, 2024

@agrare are these legit test errors? Is they related to the ansible runner specs that Jason added recently?

@agrare agrare force-pushed the miq_worker_worker_management_kubernetes_non_rails_system_uid branch from dc13bc3 to 3e6c113 Compare August 8, 2024 13:15
@agrare
Copy link
Member Author

agrare commented Aug 8, 2024

@jrafanie the test failures look unrelated, I'll rebase and kick the tests

@agrare
Copy link
Member Author

agrare commented Aug 8, 2024

TODO exclude starting workers from cleanup orphaned worker rows and confirm that miq_workers are marked failed after the starting timeout on podified. Handle the case where a miq_worker record is created but the pod hasn't started yet.

@Fryguy
Copy link
Member

Fryguy commented Aug 8, 2024

@agrare and I discussed over video and I am concerned there is a race condition between actually starting the worker and sync_starting_workers. If the worker doesn't actually start before we hit that method, then it's possible that current_pods will not have the record. In that case you'd have a worker record with a blank system_uid that also will not get assigned a system_uid in sync_starting_workers, and then it will be deleted in cleanup_orphaned_workers. Then, on the next pass, current_pods will have the worker, but the database record won't exist, and there won't be anything to sync it to. That is, the pods_without_workers variable will have extra things (which shouldn't happen), but there isn't any code to check that case of extra values in there.

@agrare agrare changed the title Link MiqWorker record to a running pod when not created using run_single_worker.rb [WIP] Link MiqWorker record to a running pod when not created using run_single_worker.rb Aug 8, 2024
@miq-bot miq-bot added the wip label Aug 8, 2024
@agrare agrare force-pushed the miq_worker_worker_management_kubernetes_non_rails_system_uid branch 3 times, most recently from 712bbed to b9996fb Compare August 19, 2024 17:56
@agrare agrare force-pushed the miq_worker_worker_management_kubernetes_non_rails_system_uid branch from b9996fb to 99fc10a Compare August 21, 2024 18:33
@agrare
Copy link
Member Author

agrare commented Aug 21, 2024

Okay I've done a live test where I manually skip any opentofu-runner pods to force the case where the record has been created but the pod is not created yet. Confirmed that we are not deleting the worker record during this period, pending the check on deleting the worker record after the 10 minute startup time.

@agrare
Copy link
Member Author

agrare commented Aug 21, 2024

Okay confirmed that the "worker starting but no pod is ever created" case behaves correctly:
{"@timestamp":"2024-08-21T20:16:46.619501","hostname":"orchestrator-5754f779b8-xkf42","pid":7,"tid":"83cc","service":"evm","level":"err","message":"MIQ(MiqServer::WorkerManagement::Kubernetes#exceeded_heartbeat_threshold?) Worker [OpentofuWorker] with ID: [40], PID: [], GUID: [2822a83c-7b5a-49d4-a67f-53cc36ff7f09] has not responded in 603.585520507 seconds, restarting worker"}

@agrare agrare force-pushed the miq_worker_worker_management_kubernetes_non_rails_system_uid branch from 99fc10a to 2bfa69f Compare August 21, 2024 20:30
@agrare agrare force-pushed the miq_worker_worker_management_kubernetes_non_rails_system_uid branch from 2bfa69f to d7a2b57 Compare August 21, 2024 20:37
it "marks the worker as not responding" do
# Make sure that #find_worker returns our instance of worker that
# that stubs the #stop_container method.
expect(server.worker_manager).to receive(:find_worker).with(worker).and_return(worker)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE it looks a little weird to have .with(worker).and_return(worker) here but that is just the way the matcher works, without this line the worker object passed in to stop_worker does not have the mock expect(worker).to receive(:stop_container) applied and it actually drops into ContainerOrchestrator to try to delete the container. This ensures that stop_worker has the right object with the stubbed methods.

@agrare agrare changed the title [WIP] Link MiqWorker record to a running pod when not created using run_single_worker.rb Link MiqWorker record to a running pod when not created using run_single_worker.rb Aug 21, 2024
@agrare
Copy link
Member Author

agrare commented Aug 21, 2024

Okay live test forcing the race complete and specs should be green, taking out of WIP

@agrare agrare added radjabov/yes? and removed wip labels Aug 21, 2024
@miq-bot
Copy link
Member

miq-bot commented Aug 21, 2024

Checked commits agrare/manageiq@4f070f7~...d7a2b57 with ruby 3.1.5, rubocop 1.56.3, haml-lint 0.51.0, and yamllint
2 files checked, 0 offenses detected
Everything looks fine. 🏆

@Fryguy Fryguy merged commit a04c02d into ManageIQ:master Aug 22, 2024
8 checks passed
@agrare agrare deleted the miq_worker_worker_management_kubernetes_non_rails_system_uid branch August 22, 2024 13:41
@Fryguy
Copy link
Member

Fryguy commented Aug 22, 2024

Backported to radjabov in commit 71e76a6.

commit 71e76a6c1b0173bdf6fcb7e594cc0535fa309ebe
Author: Jason Frey <[email protected]>
Date:   Thu Aug 22 09:41:35 2024 -0400

    Merge pull request #23112 from agrare/miq_worker_worker_management_kubernetes_non_rails_system_uid
    
    Link MiqWorker record to a running pod when not created using run_single_worker.rb
    
    (cherry picked from commit a04c02ddf3452b083fe4e8205f4d41fbfa907927)

Fryguy added a commit that referenced this pull request Aug 22, 2024
…bernetes_non_rails_system_uid

Link MiqWorker record to a running pod when not created using run_single_worker.rb

(cherry picked from commit a04c02d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpentofuWorker not restarted after changing settings on podified
5 participants