Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Autoscalers actively querying for queued jobs and provisioning instances accordingly #5798

Closed
ZainRizvi opened this issue Oct 22, 2024 · 1 comment
Assignees

Comments

@ZainRizvi
Copy link
Contributor

ZainRizvi commented Oct 22, 2024

Goal is to reduce queueing by ensuring we provision instances even when their scale up request got dropped somehow

@ZainRizvi ZainRizvi converted this from a draft issue Oct 22, 2024
@ZainRizvi ZainRizvi self-assigned this Oct 22, 2024
@ZainRizvi ZainRizvi moved this from Prioritized to In Progress in PyTorch OSS Dev Infra Dec 2, 2024
ZainRizvi added a commit that referenced this issue Dec 10, 2024
… scale up code to enable larger scale ups (#6015)

Part of #5798

### Changes included
- Fix a big bug with how we provisioned instances for ephemeral
instances. In the old code, the fewer ephemeral runners we had
available, the _less_ likely we were to overprovision them, which sounds
like the opposite of what we actually want to do. Changed that logic to
more explicitly give ephemeral runners a 50% chance of overprovisioning
the incoming request by two instances, which is what I think [this
PR](#5501) was intending to do
(we can tweak these numbers as desired). This fix might handle most of
our ephemeral runner woes by itself...
- Extend the allRunnersBusy function to enable it to handle scale up
requests of more than a single instance (which will be leveraged in a
future PR)
- Renames allRUnnersBusy to something more accurate name
@ZainRizvi ZainRizvi moved this from In Progress to Done in PyTorch OSS Dev Infra Jan 21, 2025
@ZainRizvi
Copy link
Contributor Author

Fixed a different bug with autoscalers fixed queuing for ephemeral jobs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

1 participant