Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

Multi-gpu in a single pod #362

Open
wallarug opened this issue Nov 19, 2021 · 2 comments
Open

Multi-gpu in a single pod #362

wallarug opened this issue Nov 19, 2021 · 2 comments

Comments

@wallarug
Copy link

Hi Team,

I am trying to run a Kubernetes Pod with multiple GPUs in the same pod. I can't seem to find any resources for how to do this. All the resources I find are 1 pod = 1 gpu. I don't want this. I want to be able to spin up 2x4gpu (8gpu) pods or different combinations.

It seems this has been asked before in #219 #331 but no solid answers in there.

The YAML file I have based my testing on is from this tutorial: https://towardsdatascience.com/pytorch-distributed-on-kubernetes-71ed8b50a7ee

I have changed part of it to reflect using 2 GPUs in 1 pod.

 Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          volumes:
            - name: pv-k8s-storage
              persistentVolumeClaim:
                claimName: pvc-k8s-storage
          containers:
            - name: pytorch
              command: ["/bin/sh"]
              args: ["-c", "/usr/bin/python3 -m pip install --upgrade pip; pip install tensorboardX pandas scikit-learn; python3 ranzrc.py --epochs 5 --ba$
              image: pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime
              resources:
                requests:
                  nvidia.com/gpu: 2
                limits:
                  nvidia.com/gpu: 2

I am seeing similar behaviour to #219 where when I spin this up, only 1 GPU gets used by the test code (when I told it to use 2).

Any assistance or pointing in the right direction on this would be great. Thanks!

@Shuai-Xie
Copy link

Maybe you can have a look at what I do in this issue #354 (comment).

Best wishes.

@gaocegege
Copy link
Member

This repository will be deprecated soon, please open an issue at github.com/kubeflow/training-operator

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants