Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory constrained GitHub Runners stuck in idle #30

Open
tcuthbert opened this issue Feb 6, 2023 · 0 comments
Open

Memory constrained GitHub Runners stuck in idle #30

tcuthbert opened this issue Feb 6, 2023 · 0 comments

Comments

@tcuthbert
Copy link

Bug Description

We've had reports that github-runner-operator deployed runners occasionally stop receiving jobs.
I did some investigation and discovered when oom-reaper is invoked and kills the VMs, the service does not recover.
I was able to get the runners to recover by rebooting the system; restarting the charm and LXD did not fix the issue.

To Reproduce

N/A; from my investigation it would appear this is caused by the runners running out of memory and oom-killer.

Environment

late/edge

"charm": "github-runner",
"series": "focal",
"os": "ubuntu",
"charm-origin": "charmhub",
"charm-name": "github-runner",
"charm-rev": 4,
"charm-channel": "edge",
"charm-version": "45cfce1",
"exposed": false,

Relevant log output

https://pastebin.ubuntu.com/p/vQ887s42J8/

Additional context

From my investigation notes:

This turned out to be a memory issue.
oom-killer killed a bunch of the VMs on xlarge/2*
xlarge/2 was stuck in the following state:

[Thu Feb  2 15:07:36 2023] Out of memory: Killed process 3018113 (qemu-system-x86) total-vm:17906732kB, anon-rss:44264kB, file-rss:0kB, shmem-rss:6236532kB,
[Thu Feb  2 18:36:52 2023] Out of memory: Killed process 3016510 (qemu-system-x86) total-vm:17915952kB, anon-rss:43820kB, file-rss:0kB, shmem-rss:5447228kB,
[Thu Feb  2 22:25:03 2023] Out of memory: Killed process 548735 (qemu-system-x86) total-vm:18131036kB, anon-rss:45532kB, file-rss:0kB, shmem-rss:4885312kB,
[Thu Feb  2 22:41:01 2023] Out of memory: Killed process 547841 (qemu-system-x86) total-vm:18166784kB, anon-rss:48156kB, file-rss:0kB, shmem-rss:4816036kB,
[Thu Feb  2 23:05:39 2023] Out of memory: Killed process 2953011 (qemu-system-x86) total-vm:18049308kB, anon-rss:43928kB, file-rss:0kB, shmem-rss:2984476kB,
[Thu Feb  2 23:59:43 2023] Out of memory: Killed process 2951372 (qemu-system-x86) total-vm:18185036kB, anon-rss:44996kB, file-rss:0kB, shmem-rss:2964956kB,
03 Feb 2023 00:07:41Z  workload   maintenance  Reconciling runners`

Bizarre how it affected all the units.
I rebooted xlarge/2 first, then noticed there was an active runner and a test got scheduled.
None of the others recovered, I rebooted those, they also started registering as online.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant