More robust runner autoscaling #5840

ZainRizvi · 2024-10-29T16:25:35Z

Goal: Reduce job queuing by increase self hosted runner fleet's autoscaling reliability in the face of failed/dropped scale up requests

The approach

Create a new scheduled lambda function that:

Runs every 15 mins
Queries ClickHouse for jobs that have been queued for over half an hour
Checks the runner types for those jobs to see which ones are self-hosted
Invokes the scale up function to scale up the appropriate number of runners of each type to handle the outstanding jobs

huydhn · 2025-01-14T21:34:45Z

Chatting with @ZainRizvi, this is probably not needed anymore with the ephemeral queuing fixes from the end of last half. We will observe a bit more and close this.

ZainRizvi added this to PyTorch OSS Dev Infra Oct 29, 2024

ZainRizvi converted this from a draft issue Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robust runner autoscaling #5840

More robust runner autoscaling #5840

ZainRizvi commented Oct 29, 2024

huydhn commented Jan 14, 2025 •

edited

Loading

More robust runner autoscaling #5840

More robust runner autoscaling #5840

Comments

ZainRizvi commented Oct 29, 2024

The approach

huydhn commented Jan 14, 2025 • edited Loading

huydhn commented Jan 14, 2025 •

edited

Loading