GPU fragmentation across nodes and Job/Pod rescheduling strategy request #3948

Antsypc · 2024-12-31T10:09:20Z

Background

In the Volcano Scheduler, the binpack plugin can be configured to maximize the resource usage of individual nodes (i.e., assigning jobs to fully utilize a node before allocating to empty nodes). However, idle GPUs may be scattered across different nodes due to inconsistent job finish time, resulting in insufficient resources for subsequent GPU jobs to be scheduled. For example:

Scenario 1:

A user has three 8-GPU nodes, with the following load distribution:
- node1: 7 GPUs used (4+2+1)
- node2: 6 GPUs used (4+2)
- node3: 7 GPUs used (4+2+1)
If the user wants to submit a job requiring 4 GPUs, it cannot run due to fragmented GPUs.
Scenario 2:

A user has eight 8-GPU nodes and schedules seven deployments, each with nine Pods, where each Pod uses one GPU. This results in instances of each deployment being distributed across different nodes. If the user deletes some deployments, GPU fragmentation occurs.

Expectation

When a Job is Pending, determine whether reallocating running jobs/pods can provide enough resources to execute the pending job. If feasible, restart the jobs or Pods and migrate them to new nodes.

Current Limitations

Job-level Scheduling Limitation:
Simply restarting jobs does not guarantee resource allocation for the pending job. Because the volcano scheduler schedules jobs one by one, multiple jobs cannot be scheduled as a whole.
Descheduler Limitations:
Both volcano descheuler and k8s descheduler lacks strategies for handling such scenarios. In current implementation, descheduler does not schedule replacement of evicted pods but relies on the default scheduler for that. However, when considering which pods to evict for defragmentation, it's better to combine with how to schedule later.

My request is similar to the issue described in GPU碎片资源整理. I would like to know if there are any solutions or plans to address this problem. I am truly eager to collaborate with you to solve it.

The text was updated successfully, but these errors were encountered:

hwdef · 2025-01-01T05:11:41Z

Is the defragmentation method in the document you provided descheduler?

Antsypc · 2025-01-02T01:38:06Z

Is the defragmentation method in the document you provided descheduler?

They didn't mention any implementation details of defragmentation, but the current descheduler cannot solve this problem.

JesseStutler · 2025-01-02T11:35:41Z

cc @Monokaix, actually the overall structure of volcano descheduler is not very different from k8s descheduler. We currently do not have the feature of GPU defragmentation, but I think it is indeed a requirement that can improve GPU clusters resource utilization and can be one of descheduler key features.

JesseStutler · 2025-01-02T11:37:38Z

/kind feature
/area scheduling

JesseStutler · 2025-01-03T06:18:05Z

Hi @Antsypc , I think it's a good feature, I have sent you an email, are you interested in collaborating with community? We can discuss the feature further in detail

Antsypc · 2025-01-03T07:11:42Z

@JesseStutler Sure, I'm glad to collaborate with you.

RehanFazal77 · 2025-02-05T17:48:35Z

Hello @Antsypc , I’m really interested in contributing to this project as I have enough understanding of Kubernetes, which I believe aligns well with the requirements of this project.

Could you share more details on how I can get involved, and if there are any prerequisites or tests to be considered for the upcoming mentorship? I’m eager to collaborate and contribute to solving this issue.

Looking forward to your response!

Monokaix · 2025-02-06T10:39:12Z

This is a LFX project of CNCF, anyone who is interested can apply it here:
https://mentorship.lfx.linuxfoundation.org/project/607246c3-f48b-446c-a7cc-10c0068c553f

Antsypc added the kind/question Categorizes issue related to a new question label Dec 31, 2024

Antsypc mentioned this issue Jan 2, 2025

[Umbrella] Proposed to add a descheduler component to enhance descheduling capabilities #3889

Closed

volcano-sh-bot added kind/feature Categorizes issue or PR as related to a new feature. area/scheduling labels Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU fragmentation across nodes and Job/Pod rescheduling strategy request #3948

GPU fragmentation across nodes and Job/Pod rescheduling strategy request #3948

Antsypc commented Dec 31, 2024 •

edited

Loading

hwdef commented Jan 1, 2025

Antsypc commented Jan 2, 2025

JesseStutler commented Jan 2, 2025 •

edited

Loading

JesseStutler commented Jan 2, 2025

JesseStutler commented Jan 3, 2025 •

edited

Loading

Antsypc commented Jan 3, 2025

RehanFazal77 commented Feb 5, 2025

Monokaix commented Feb 6, 2025

GPU fragmentation across nodes and Job/Pod rescheduling strategy request #3948

GPU fragmentation across nodes and Job/Pod rescheduling strategy request #3948

Comments

Antsypc commented Dec 31, 2024 • edited Loading

Background

Expectation

Current Limitations

hwdef commented Jan 1, 2025

Antsypc commented Jan 2, 2025

JesseStutler commented Jan 2, 2025 • edited Loading

JesseStutler commented Jan 2, 2025

JesseStutler commented Jan 3, 2025 • edited Loading

Antsypc commented Jan 3, 2025

RehanFazal77 commented Feb 5, 2025

Monokaix commented Feb 6, 2025

Antsypc commented Dec 31, 2024 •

edited

Loading

JesseStutler commented Jan 2, 2025 •

edited

Loading

JesseStutler commented Jan 3, 2025 •

edited

Loading