Please address the design failures and customers' concerns in Runner Scale Sets #3340

tmehlinger · 2024-03-12T03:18:09Z

tmehlinger
Mar 12, 2024

I'm both a GHEC customer and spare-time user staring down an "upgrade" from Runner Deployments to Runner Scale Sets and couldn't possibly be more disappointed with this feature.

While trying to figure out how they're supposed to work, I found this discussion where customers operating at scale have expressed reservations about using scale sets because, to be completely frank, they don't work. Selecting on multiple labels is essential for scaling out an Actions deployment, and the newer ARC can't do it. Think of it this way--how utterly useless would Kubernetes be if you couldn't manage resources using label selection? Whether you like it or not, existing large Actions deployments work the same way.

I think the rationale for the design is pretty weak and it needs revisiting.

The reason we opted for single labels on ARC (and are doing the same thing on the larger hosted runners as well) is to get us ready to improve the Dev UX about 'why isn't my job starting'.

Having experienced the same frustration myself, my suggestion would be to just give me visibility into the job queue, showing the labels that didn't match any runners and thus prevented scheduling. I suggest looking at how Karpenter presents diagnostic output; it will very nicely explain how/why pods can't be scheduled in a Kubernetes cluster based on the combinations of labels and taints that apply to the capacity it manages. It's a problem remarkably similar to this one.

But....

When we had multiple labels, doing deterministic assignment of jobs to a 'scale set' wasn't possible. The ven diagram of runners that matched labels vs. the labels I was looking for meant that jobs always just waited until 'something was free' and 'where' something would start was less clear. This conflicted with a reasonably high request we had to 'see the queue of jobs and when mine will start'.

What do "deterministic" and "where" mean? Do you mean specifically which runner would pick up my job? How could you possibly know ahead of time when you can't predict when a runner would become available? You don't need to tell me which runner a job will start on, I need to know which runners (plural) with matching capabilities can host a job, and then I don't care which one picks it up because they should all have identical features. None of this prevents you from just showing us the job queue.

This all works fine today with Runner Deployments. The only trouble my team has with Runner Deployments is the odd "it won't start" issue, which (again) would be mitigated by having detailed visibility into the job queue. If I could see which labels a job wants, and what permissions it has (also sorely lacking in diagnostic output), this problem is much easier to solve. You chose to make an entirely new and fundamentally flawed paradigm for ARC, then spent the time to implement it, instead of going for the simpler solution of showing information you already have.

I note that some people mitigated this by adding 'lots and lots' of labels to make a run on target very specific – which also isn't great for new users.

New users? Of an advanced feature that requires hosting a CI platform in Kubernetes? How about we favor competency instead?

On using an excess of labels, documentation to the effect of "don't do that, do this instead" would have been much more useful for a lot less effort.

I do appreciate that this results in a change in workflow files :(

Literally thousands of workflows. How could this possibly be the best solution?

Then there's this, an actual security issue, flippantly disregarded by one of the maintainers. Being able to distinguish self-hosted runners from others is critical to avoiding issues like this one.

Finally, I really don't understand the decision to force multiple ARC deployments, one for each RSS. A single controller is capable of doing all the work, and the earlier controller design had a nice separation of responsibilities, fulfilling custom resources that could each describe exactly the capacity I need, labeled accordingly. Now you're just wasting capacity with a controller for every flavor of runner, and I have to manage N Helm installs. This is an inferior design by all measurements.

It's not my intention to be rude. I believe everyone involved has good intentions. Unfortunately, in this case, it isn't even remotely good enough, and the amount of work we, your customers, have to do to handle it is immensely frustrating.

Please reconsider the feedback and deal with it accordingly. To be perfectly transparent, my desire is that you will scrap RSS entirely and continue investing in the previous method.

As it stands now, I fully intend to stay on the older controller deployment and sincerely hope it gets continued support.

danmanners · 2024-03-12T15:44:28Z

danmanners
Mar 12, 2024

I'll second 100% of this.

1 reply

8BallDuVal Mar 21, 2024

I 100% agree with this post. Github needs to be made aware that this is unacceptable and will cause dozens of hours of unnecessary work for developers to modify their workflows, which have been using an array of runner labels (as recommended by github's own documentation) for 3+ years now.

nebuk89 · 2024-04-02T16:26:23Z

nebuk89
Apr 2, 2024
Maintainer

Hey!

I have been speaking with @juliandunn about this and the engineering team, and wanted to give you an update along with play back where we are.

We are still evaluating what to do next as there is no clear right answer that will satisfy the folks discussing here and the things we need to consider below.

Thinking/history

The first is to reiterate how we got here and why we launched scale sets with only a single label, and resulting in the new ARC mode with a single label. The headline reason for this was a decision we made based on customer feedback to increase the determinism of 'where' a job will run. Having only labels that map to a single object (i.e. a name) means that jobs cannot be 'stolen' by simply adding the wrong label to a runner, which was and is a major concern for many customers still. This is still a benefit to customers and one we need to keep in mind.

I will also apologize for a lack of a clear change log when we released this. We did a really bare-bones changelog for the beta of ARC and scale sets but did not make this clear enough at all at the time.

What it would take to change

This is not something we can 'flip the bit' to go back on, the scale sets were built without multi-label support, so this would be an involved change for us to move over to.

Mitigation

For those who are blocked, the older scaling modes of ARC have not been removed and can continue to be used.

Concerns raised

Just to ensure I capture everyone's concerns raised, I want to compile a list here that I will edit/add to as people respond in this thread.

Security issues: The concern is that "I name a runner X and GitHub introduces that as a label", then the job can be taken by the hosted runner job. There is also a concern that adding 'self-hosted' can act as a 'blanket mitigation' against a job landing on a GitHub runner with a 'custom label'.

Targeting jobs gets more complex: The concern is that this requires knowing the ordering of the components like ubuntu-2core while ubuntu, 2-core doesn't.

Low determinism job targeting: Some customers want to be able to just match ubuntu and don't care if I get a 2-core or a 64 core machine. There is a desire for 'soft matching' that must span different runner groups.

Runner groups are not in free teams: Though some people could use runner groups to mitigate some concerns (not all), this isn't feasible for all due to the limitation in creating more runner groups. Note creating more runner groups is now available in paid teams and I will take an Action to see what we can do here.

High availability: We have docs on how to do this here, but if there are other forms of this concern that this does not address then I would <3 to know

Migration: There is no easy path to move from multi to single label today.

8 replies

stasostrovskyi Apr 4, 2024

@nebuk89 Thank you for a clear communication! This clarifies a lot and "allows" us to do the work for migration to Scale Sets not being "afraid" all the changes in workflows will be for nothing. Now it all sounds more clear to me with runner groups and such and I only have one last question/request to you - please clarify these docs with regards to how it works with scale sets compare to all other usages.

nebuk89 Apr 4, 2024
Maintainer

@stasostrovskyi the docs are great feedback, I will start getting a PR up (probably early next week now) and ensure we do a tidy through for these. <3

jb-2020 Apr 5, 2024

@nebuk89 Actions Usage Metrics public beta looks great, any timeline for this feature coming to Github Enterprise Server?

nebuk89 Apr 23, 2024
Maintainer

@jb-2020 no timeline yet, we are currently doing roadmap planning for this year and working out where we spend our time :) we should have more certainty in a few weeks when the dust settles

Nuru May 9, 2024

@nebuk89 Chiming in on just 2 points for now:

Codifying bits of information - Could you give me some examples of the maximum number of bits you are looking to codify for runners please? The example you have used is repos, which looking at the data tend to have significantly more labels than runners do.

Soft matching- By this I meant that if you just use a label 'linux' this is less clear where this will land, will it land on a 2vCPU machine on a scale set in runner group A or will it land in runner group B. This was a concern for customers I spoke to at the time who were asking for support in reducing the risk that jobs 'didn't go where they expected'.

If users want to target a job to a specific set of runners, and runners support multiple labels, they can give runners arbitrarily named unique labels (e.g. "cat-phoenix") and target them with that label. Jobs ending up on the wrong runner set is trivial to solve in this way.

Here are some bits of information relevant to me in targeting a runner:

which OS is it running
which version of which OS is it running
how much memory is available per runner
how much CPU is available per runner
how much ephemeral disk space is available
are they running on "spot" or "on demand" instances
is the runner set configured to handle long-running jobs (some runner pools rapidly scale up and down to handle variable workloads, while some runner pools maintain static capacity or very different scaling rules to support workloads that can run for hours)
which CPU architecture is it
what special hardware (e.g. GPU) is available
what special performance features (e.g. high performance compute cluster)
does the runner have access to internal resources (eg. databases, email servers) or is it walled off from internal resources
does the runner have access to the public internet
which region is the runner in
which availability zone is the runner in
which VPC is the runner in (and other ways to determine its network isolation and connectivity)
does the runner have access to secrets not available via GitHub (e.g. permission to use a KMS key to decrypt a SOPS file)

I'm sure there are more, but hopefully you get the picture. I want to have labels that correspond to features the runners support, and then target runners by listing the features I need.

If a job can run on multiple runners then it would be great if I could set priorities for the runner sets so the scheduler can automatically set which runner set to use (of the set of runner sets with matching labels, use the one with the highest priority). I think that would be the best solution. Alternately, if, say, I wanted to prevent jobs that do not need a GPU from ending up on runners that have a GPU, I would have to create and use negative labels for features, like "no-GPU", and that could get tedious and difficult to manage in the long run. If I could set priorities, I'd set them based on runner price per minute, so the job runs on the cheapest available matching runner group.

pragmaticivan · 2024-04-16T03:28:29Z

pragmaticivan
Apr 16, 2024

👋🏽 Any chance you all can share some ETA on that "flexibility"? I've been holding a production migration because asking teams to make this change is quite noisy when many of the changes will "require" review due to the separation of responsibilities and processes a team might have.

I am an Enterprise user, and it feels like the group of users you all asked for feedback on doesn't really represent the reality of enterprise-paying users at all. I'm kinda holding the release until this is sorted, but this has been going on for months with no sense of hope for platform-engineering teams providing the Runner Scale Sets as a product.

1 reply

nebuk89 Apr 23, 2024
Maintainer

Not right now, in a few weeks we will have our funding plan for what we do for the rest of this calendar year agreed and I can get some updates out :)

alexgaganashvili · 2024-04-22T20:15:20Z

alexgaganashvili
Apr 22, 2024

@tmehlinger , regarding your comment

Finally, I really don't understand the decision to force multiple ARC deployments, one for each RSS.

could you point me to where in the docs this design decision is mentioned? I'd like to reference it during the evaluation of ARC and RSS.
Thx.

2 replies

tmehlinger May 17, 2024
Author

Totally missed your ping here, didn't mean to ignore you! Relevant docs

The key point is this:

Runner scale sets is a group of homogeneous runners that can be assigned jobs from GitHub Actions.

So if you want different flavors of runners, that's another RSS deployment. However, there's also an AutoscalingRunnerSet CRD that the the controller fulfills which will "fetch or create a runner scale set", but there is no comprehensive documentation that I can find after half an hour of searching this morning.

It's all a bit confusing because Scale Sets themselves are distinct from the RSS controller. See here.

There was one tool for this job in "classic" ARC, and there was a nice Kubernetes-esque separation of responsibilities between components.

evilhamsterman Oct 4, 2024

@tmehlinger I know I'm a few months late here, but I think you misunderstand how the helm charts work, I actually came to discussions to figure this out myself as it seemed weird that I would have to deploy a main controller then a sub-controller for each RSS. That's when I discovered the whole issue with labels. Part of that is on GH @nebuk89 perhaps more documentation to update.

There still is only a single controller, the gh-runner-scale-set-controller. The gh-runner-scale-set chart is just a chart to make it easier to deploy RunnerScaleSets. It appears that each RSS also comes with some associated roles and rolebindings so it's not as easy as just creating a new resource for the RSS CRD. Whether or not that was a good decision is a whole 'nother discussion. But if all those have to be deployed together bundling them as a chart to ease that deployment seems reasonable.

You can see here in the gha-runner-scale-set-controller chart templates https://github.com/actions/actions-runner-controller/tree/master/charts/gha-runner-scale-set-controller/templates the deployment.yaml template which is what actually creates a deployment for the controller. But the gha-runner-scale-set https://github.com/actions/actions-runner-controller/tree/master/charts/gha-runner-scale-set only has templates for the RSS CRD resource, a secret, a service account, some roles and role bindings. There are no deployments, statefulsets, or even just pods being deployed by that chart.

BowlesCR · 2024-05-16T15:52:36Z

BowlesCR
May 16, 2024

Stumbling onto this thread a month later -- Any update on the funding plan and implications for this discussion?

What brought me here was actually looking to switch from our current single-label (on legacy ARC) setup to multiple labels with what is being called "soft matching" and described at length above. We've got an eye on switching to RSS in the future, so we're now at a decision point between implementing multiple labels or implementing RSS.

I, too, would like to see multiple labels supported (again). For many of my jobs I don't care where it runs as long as its Linux on x64... other jobs I want to additionally specify that it must run in prod, or on a "build" runner with particular tools available as opposed to a generic runner without. Echoing the tangentially related ask for some form of priorities/weighting would be great too -- the GPU example is a good one... a generic job might be okay with running there if its convenient, but I'd rather not tie up the resource and block a job that might require it. The previously mentioned static/on-demand or permanent/burst use-cases are also ones I would love to see -- ie, prioritize filling permanent nodes first, using bursted nodes only if necessary allowing them to spin back down as soon as the load spike drops off.

Since ARC is effectively a k8s-exclusive thing, I'm not sure I agree that simplifying was necessary to solve confusion -- k8s has a very robust (and even complex) system of nodeselectors/affinities/taints/tolerances and an understanding of those is a prerequisite to running ARC in any form. While I think implementing the entire k8s scheduling featureset is overkill, its a great source of inspiration.
I also echo the reference to Karpenter's debug output -- it very clearly exposes the decision tree used by the scheduling algorithm so you can figure out why a pod went to a particular node, or why it can't find a node at all.

Also, I'll say I very much appreciate the open discussion tone in this thread in contrast to the "No. Closed" approach in the original one. Thank you for taking the time to explain the rationale and constraints.

2 replies

nebuk89 May 17, 2024
Maintainer

@BowlesCR /others sorry for quietness here, I missed the other couple of tags last here where I was out at an event.

I have no updates though sadly :(, I appologise I don't have anything more currently. The rest of May-June we are very focused on the core avaliability of Actions itself and are making changes there still to improve that. As we go into June we will start our planning for July to August, which is when I will have an opportunity to weigh this against the other things we have to work on :)

I would say, we are trying to also build a roadmap based on the other feedback items for ARC as well currently as I am aware of other challenges as well ( supporting things like Vualt for secrets, logging on failure to create runners, build logs ot standard out, annotation improvents etc).

Thank you all for your patience, I appreciate this makes migrations/hard impossible for many.

On your particular feedback on the machine selection or generic 🤔 out of interest, how do you consider machines as 'locked up value' or how do you draw that line. You call out GPU as being 'too valuable' to tie up, but a large multicore machine for a job only really needing 1 core wouldn't be? Or is it more that its about load balancing all up that matters given the job volume?

BowlesCR May 17, 2024

Thanks for getting back even if its still TBD.

As for machine selection, I'd handwave my exact determination as a configuration detail in favor of configurable weights. To your point there are many variables someone might choose to prioritize on, even on otherwise identical hardware -- ie, this job can run on any runner, but maybe I prefer to avoid running it in prod.
I'll admit that I even have some conflict in my mind as to whether weights should consider idle runners ("Avoid starting a GPU runner, but go ahead an use it if one is already running and idle"). This is probably also a matter of org preference.

Drawing inspiration from Karpenter again -- if there are multiple viable nodepools available after evaluating all the selector/affinity/taint logic, the weight parameter of the candidate nodepools is evaluated as a tiebreaker. By default all nodepools have a weight of 0, so if you don't override that you get same "random" non-deterministic behavior we see in (legacy) ARC today, but of course if higher weights are configured you can then predict which will be selected (or perhaps its still random among a subset of nodepools tied for the highest weight).
Back to the consideration of whether idle vs cold start should be considered viewed through a Karpenter (and generic k8s) lens -- it will always prefer to schedule on an existing node before starting a new one, so that might be a reasonable assumption to make here too.

billimek · 2024-06-14T14:04:13Z

billimek
Jun 14, 2024

The dependabot documentation for running dependabot on self-hosted runners indicates that you must assign the dependabot label to the runners where dependabot should run.

With the intent from GitHub to limit runners to one-and-only-one label, how does this reconcile with dependabot and self-hosted runners? Are we meant to have a pool of self-hosted runners that are dedicated to only dependabot?

0 replies

enricojonas · 2024-08-01T10:31:20Z

enricojonas
Aug 1, 2024

We also looked into switching to the new scale sets. It seemed to be an easy task until we hit this label issue. It would mean a lot of unnecessary work for us to remove the self-hosted label from all workflows plus it indeed indicates that those runners are self-hosted. Keeping this on hold and sticking to the old setup for now.

0 replies

pchernikov-sxm · 2024-08-13T13:51:43Z

pchernikov-sxm
Aug 13, 2024

This is a blocker for us as well. The prospect of updating hundreds of repositories to switch to ARC runner scale sets is a major deterrent. The current multi-label system in runner deployments is essential for our scaling needs, and without it, the migration process is not only cumbersome but also introduces risks and inefficiencies. We'll be looking to go with the older scaling mode until runner scale sets label system improves.

1 reply

crohr Sep 26, 2024

You might be interested in RunsOn (I'm the author), which does support multiple labels and flexible runner assignment without the need for predefined runner scale sets.

allenb-epilog · 2024-10-09T22:13:26Z

allenb-epilog
Oct 9, 2024

The lack of labels, particularly the self-hosted label is proving to be problematic for us as well.

0 replies

maeghan-porter · 2024-10-30T17:59:21Z

maeghan-porter
Oct 30, 2024

I will throw my hat in on this as well, it's causing myself and my team significant issues. We had to update all our workflows, which was a huge pain, and since we use a canary release process for the cluster that hosts these runners, we're now faced with a ton of issues where the listeners on the new cluster coming in to replace the old one CANNOT register with github because the name is exactly the same and until the listeners are deleted from the old cluster, the new one can't register and start serving runners. This makes the canary deployment useless because we are no longer avoiding a situation where we just straight up don't have runners. If the listener names were unique and the runners were determined by labels, this would be a non-issue. It's incredibly frustrating and I don't get it. This was a bad decision and I hope they will reconsider it.

I'm struggling to understand the reasoning given when outlining why this path was chosen to begin with, to satisfy customers that need a single label. Why not then get rid of the default self-hosted label and just allow us to specify what labels we want to use? Then people who want only one label can specify only one label.

7 replies

Flasheh Oct 30, 2024

In the helm chart you can specify a runner group here. You need to create these groups beforehand in your github organisation settings. see docs.

If all clusters in your canary pool use a different group they can register with the same name and thus use the same label for your workflows. Jobs will be picked up at random by either listener with available runners. So you can scale up/down any cluster without downtime.

8BallDuVal Oct 31, 2024

If all clusters in your canary pool use a different group they can register with the same name and thus use the same label for your workflows.

We have gotten this to work, but we have been unable to figure out how to achieve a true zero downtime upgrade to the runners when the runner software gets out of date (and randomly stops picking up workflow jobs because of it). We could deploy to a different runner group with the same label when an update is needed to the runner software or runner image, but when the controller software needs updating that requires all the listeners to come down, along with the runner pods.

Jobs will be picked up at random by either listener with available runners. So you can scale up/down any cluster without downtime.

You can't do zero downtime/canary releases and have two controllers in the same cluster in my experience, unless i am doing something wrong.

I have found that you would likely need a completely separate cluster to have two controllers available at the same time, unless i am doing something wrong. If so please point me to an example so I can learn!!!

The controller is required to come down when there's an update for it, and this requires all listeners/runner pods to come down with it. If we could have two controllers up simultaneously, or bring one up (with the updates applied) as the other (without the updates) is coming down, that would allow us to eliminate downtime during updates and essentially replace the out of date one.

We only have a budget for one cluster unfortunately, so have never been able to eliminate downtime. We just do it during off hours over the weekend and workflow jobs just have to sit and wait during the process. Even when the controller is deployed in different namespaces, I have not been able to have two controllers running (successfully) at the same time in one cluster. One always conflicts with the other's resources and it doesn't work.

Are you saying you need a completely separate k8s cluster to do this? The controller software needs updating periodically as well so it's a pain when we have to take an outage do it.

Flasheh Oct 31, 2024

@8BallDuVal We're running multi-region for DR purposes, so 2 seperate clusters. And haven't noticed any issues with that setup. Nor have we had listeners stop picking up jobs. I was commenting on OP's scenario which also seemed to be using multiple clusters.

I haven't tried deploying multiple controllers in a single cluster so I can't comment o whether that's possible.

Maybe you're running at a much larger scale than us, but are updates that disruptive? I don't think it takes us much more than a minute for all the components to be upgraded and have the listeners registered again. Even on a single cluster I doubt people would notice or care.

maeghan-porter Oct 31, 2024

You can use duplicate listeners as long as they are placed in unique runnergroups.

We're using the same method for a multi-region setup and it's been running fine.

Thanks for the insight, yes that would work, but still wouldn't be ideal for our scenario. We are hosting runners not just for ourselves but for other teams who have their own runnergroups, so we'd effectively have to have two runnergroups per team. Not impossible, just a bit more complexity/management.

maeghan-porter Oct 31, 2024

I have found that you would likely need a completely separate cluster to have two controllers available at the same time, unless i am doing something wrong. If so please point me to an example so I can learn!!!

@8BallDuVal You are correct, however you can have more than one replica and by default nature of a deployment, any update to the deployment will do a rollout deployment where it will update one pod at a time and shouldn't require bringing down all the runners and listeners (though I suppose it would if they themselves were being updated as well). I will admit this is presumptuous on my part since we do canary cluster releases rather than updating the cluster in-situ, but just being familiar with K8s that would be my educated assumption on how it would work.

What we do might help you in your situation. We have clusters that use FluxCD for deploying all the apps in the cluster and when we have any updates to make to the cluster, we spin up a whole other cluster, check that everything looks good (via flux and kuberhealthy synthetic checks) and then spin down the old one. This used to work with the old scaling method that allowed us to have multiple labels but with the new method the listeners will obviously have the exact same name and so they are unable to register to github until the one on the cluster being replaced is deleted. This is not ideal since if there is an issue on the new cluster with the runners we won't catch it until it's too late and the old cluster is in the process of being destroyed.

Anyway, all that said, we still only have one cluster in the long run, just two when doing a deployment. No need to run the controller twice (not that you can anyway). @Flasheh's method with two runnergroups would allow for it to work though, and is an option.

lifeofmoo · 2024-11-07T10:40:08Z

lifeofmoo
Nov 7, 2024

Gah - only 2 weeks ago I finally got over the hurdle that the git cli was finally added to the runner container and now this 🤯

Not being able to even use the self-hosted label is painful

0 replies

Please address the design failures and customers' concerns in Runner Scale Sets #3340

Replies: 11 comments · 22 replies

nebuk89 Apr 2, 2024 Maintainer

Thinking/history

What it would take to change

Mitigation

Concerns raised

nebuk89 Apr 4, 2024 Maintainer

nebuk89 Apr 23, 2024 Maintainer

nebuk89 Apr 23, 2024 Maintainer

tmehlinger May 17, 2024 Author

nebuk89 May 17, 2024 Maintainer

Replies: 11 comments 22 replies

nebuk89
Apr 2, 2024
Maintainer

nebuk89 Apr 4, 2024
Maintainer

nebuk89 Apr 23, 2024
Maintainer

nebuk89 Apr 23, 2024
Maintainer

tmehlinger May 17, 2024
Author

nebuk89 May 17, 2024
Maintainer