reclaim action uses `HasPendingTasks` instead of `JobStarving`, which may cause the two jobs to reclaim with each other and lead to deadlock. #3869

JesseStutler · 2024-12-11T07:38:51Z

Description

Such a scenario: the number of replicas of the two jobs is 5, minAvailable is 1, we can not deploy 5 replicas at the same time on the cluster, there will be at least one pod pending, and then the reclaim will be triggered, and then the two jobs will keep reclaiming each other, leading to deadlock.

Steps to reproduce the issue

create a new queue:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: a
spec:
  reclaimable: true
  weight: 1

create a new job in default queue, 5 replicas, 1 minAvailable, all pods can not scheduled simultaneously on clusters.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: testjoba1
spec:
  minAvailable: 1
  schedulerName: volcano
  queue: default
  tasks:
    - replicas: 5
      name: "default-nginx"
      template:
        metadata:
          name: web
        spec:
          containers:
            - image: nginx:1.27-alpine3.19-slim
              imagePullPolicy: Never
              name: nginx
              resources:
                requests:
                  cpu: 7
          restartPolicy: OnFailure

3.create a new job in a queue, 5 replicas, 1 minAvailable:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: testjobb1
spec:
  minAvailable: 1
  schedulerName: volcano
  queue: a
  tasks:
    - replicas: 5
      name: "default-nginx"
      template:
        metadata:
          name: web
        spec:
          containers:
            - image: nginx:1.27-alpine3.19-slim
              imagePullPolicy: Never
              name: nginx
              resources:
                requests:
                  cpu: 7
          restartPolicy: OnFailure

Describe the results you received and expected

Deploy job-a first, occupying cluster resources, and then deploy job-b, triggering reclaim to evict job-a's pod resources. After job-a pod becomes pending, reclaim reclaim job-b's pod for job-a, resulting dead-lock.

We may change reclaim to use JobStarving, if a job has enought pods running >= minAvailable, we don't trigger reclaim action for it.

What version of Volcano are you using?

v1.10

Any other relevant information

No response

The text was updated successfully, but these errors were encountered:

JesseStutler · 2024-12-11T07:41:52Z

@Monokaix We may need to fix this issue in v2.0 and add it to v2.0 milestone.

JesseStutler added the kind/bug Categorizes issue or PR as related to a bug. label Dec 11, 2024

JesseStutler mentioned this issue Dec 11, 2024

How to configure resource reclaim between queues? #3861

Open

hwdef mentioned this issue Dec 11, 2024

reclaim: When choosing a preemptor, choose a starving one rather than one with pending tasks. #3870

Closed

JesseStutler linked a pull request Jan 2, 2025 that will close this issue

reclaim: When choosing a preemptor, choose a starving one rather than one with pending tasks. #3951

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reclaim action uses `HasPendingTasks` instead of `JobStarving`, which may cause the two jobs to reclaim with each other and lead to deadlock. #3869

reclaim action uses `HasPendingTasks` instead of `JobStarving`, which may cause the two jobs to reclaim with each other and lead to deadlock. #3869

JesseStutler commented Dec 11, 2024

JesseStutler commented Dec 11, 2024

reclaim action uses HasPendingTasks instead of JobStarving, which may cause the two jobs to reclaim with each other and lead to deadlock. #3869

reclaim action uses HasPendingTasks instead of JobStarving, which may cause the two jobs to reclaim with each other and lead to deadlock. #3869

Comments

JesseStutler commented Dec 11, 2024

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of Volcano are you using?

Any other relevant information

JesseStutler commented Dec 11, 2024

reclaim action uses `HasPendingTasks` instead of `JobStarving`, which may cause the two jobs to reclaim with each other and lead to deadlock. #3869

reclaim action uses `HasPendingTasks` instead of `JobStarving`, which may cause the two jobs to reclaim with each other and lead to deadlock. #3869