feat: add NodeResourcesFitPlus and ScarceResourceAvoidance plugin #2302

LY-today · 2024-12-20T09:11:25Z

What is your proposal:
The NodeResourcesFit plug-in of native k8s can only adopt a type of strategy for all resources, such as MostRequestedPriority and LeastRequestedPriority. However, in industrial practice, this design does not apply to some scenarios. For example: In AI scenarios, businesses that apply for GPUs prefer to occupy the entire GPU machine first to prevent GPU fragmentation; businesses that apply for CPU & MEM are prioritized and dispersed to non-GPU machines to prevent excessive consumption of CPU & MEM on GPU machines, resulting in real tasks of applying for GPUs. Pending due to insufficient non-GPU resources
. It is therefore hoped that both strategies can be extended to address this business need.

Why is this needed:
There are related descriptions above

Is there a suggested solution, if so, please add it:

plugin-one

config：

resources: 
  nvidia.com/gpu:
    type: MostAllocated
    weight: 2
  cpu:
    type: LeastAllocated
    weight: 1
  memory:
    type: LeastAllocated
    weight: 1

config description：

node score:

finalScoreNode = [(weight1 * resource1) + (weight2 * resource2) + … + (weightN* resourceN)] /(weight1+weight2+ … +weightN)

plugin-two

config：

resources: 
- nvidia.com/gpu

config description：

node score:

finalScoreNode = (allocatablesResourcesNum - requestsResourcesNum) * framework.MaxNodeScore / allocatablesResourcesNum

codecov · 2024-12-20T09:18:06Z

Codecov Report

Attention: Patch coverage is 61.01083% with 108 lines in your changes missing coverage. Please review.

Project coverage is 66.06%. Comparing base (a62dd49) to head (a3f12fa).
Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
...oderesourcefitplus/node_resource_fit_plus_utils.go	56.34%	44 Missing and 11 partials ⚠️
...arceresourceavoidance/scarce_resource_avoidance.go	65.90%	22 Missing and 8 partials ⚠️
...ins/noderesourcefitplus/node_resources_fit_plus.go	63.49%	17 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2302      +/-   ##
==========================================
+ Coverage   66.02%   66.06%   +0.04%     
==========================================
  Files         458      461       +3     
  Lines       53868    54459     +591     
==========================================
+ Hits        35564    35977     +413     
- Misses      15749    15885     +136     
- Partials     2555     2597      +42

Flag	Coverage Δ
unittests	`66.06% <61.01%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pkg/scheduler/plugins/scarceresourceavoidance/scarce_resource_avoidance.go

…lugis Signed-off-by: LY-today <[email protected]>

songchh · 2024-12-23T07:51:36Z

Expressing benefit, NodeResourcesFit is indeed very limited in some scenarios, especially in non-traditional CPU and Memory clusters, expensive resources like GPU do require special treatment.