Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reclaim action uses HasPendingTasks instead of JobStarving, which may cause the two jobs to reclaim with each other and lead to deadlock. #3869

Open
JesseStutler opened this issue Dec 11, 2024 · 1 comment · May be fixed by #3951
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@JesseStutler
Copy link
Member

Description

Such a scenario: the number of replicas of the two jobs is 5, minAvailable is 1, we can not deploy 5 replicas at the same time on the cluster, there will be at least one pod pending, and then the reclaim will be triggered, and then the two jobs will keep reclaiming each other, leading to deadlock.

Steps to reproduce the issue

  1. create a new queue:
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: a
spec:
  reclaimable: true
  weight: 1
  1. create a new job in default queue, 5 replicas, 1 minAvailable, all pods can not scheduled simultaneously on clusters.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: testjoba1
spec:
  minAvailable: 1
  schedulerName: volcano
  queue: default
  tasks:
    - replicas: 5
      name: "default-nginx"
      template:
        metadata:
          name: web
        spec:
          containers:
            - image: nginx:1.27-alpine3.19-slim
              imagePullPolicy: Never
              name: nginx
              resources:
                requests:
                  cpu: 7
          restartPolicy: OnFailure    

3.create a new job in a queue, 5 replicas, 1 minAvailable:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: testjobb1
spec:
  minAvailable: 1
  schedulerName: volcano
  queue: a
  tasks:
    - replicas: 5
      name: "default-nginx"
      template:
        metadata:
          name: web
        spec:
          containers:
            - image: nginx:1.27-alpine3.19-slim
              imagePullPolicy: Never
              name: nginx
              resources:
                requests:
                  cpu: 7
          restartPolicy: OnFailure   

Describe the results you received and expected

Deploy job-a first, occupying cluster resources, and then deploy job-b, triggering reclaim to evict job-a's pod resources. After job-a pod becomes pending, reclaim reclaim job-b's pod for job-a, resulting dead-lock.

We may change reclaim to use JobStarving, if a job has enought pods running >= minAvailable, we don't trigger reclaim action for it.

What version of Volcano are you using?

v1.10

Any other relevant information

No response

@JesseStutler JesseStutler added the kind/bug Categorizes issue or PR as related to a bug. label Dec 11, 2024
@JesseStutler
Copy link
Member Author

@Monokaix We may need to fix this issue in v2.0 and add it to v2.0 milestone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
1 participant