Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

hived don't aware gpu topology #35

Closed
olderTaoist opened this issue Feb 2, 2021 · 11 comments
Closed

hived don't aware gpu topology #35

olderTaoist opened this issue Feb 2, 2021 · 11 comments
Assignees

Comments

@olderTaoist
Copy link

run mpijob on p4 node in kubernetes1.11,one gpu per pod。
the p4 gpu topology as fellow:
image

the worker-0 pod see gpu as fellow:
image

the worker1 pod see gpu as fellow:
image

the hived config as fellow:
apiVersion: v1
kind: ConfigMap
metadata:
name: hivedscheduler-config
namespace: kube-system
data:
policy.cfg : |
{
"kind": "Policy",
"apiVersion": "v1",
"extenders": [
{
"urlPrefix": "http://10.220.187.143:30096/v1/extender",
"filterVerb": "filter",
"preemptVerb": "preempt",
"bindVerb": "bind",
"enableHttps": false,
"httpTimeout": 5000000000,
"nodeCacheCapable": true,
"ignorable": false,
"managedResources": [
{
"name": "hivedscheduler.microsoft.com/pod-scheduling-enable",
"ignoredByScheduler": true
}
]
}
]
}
hivedscheduler.yaml: |
webServerAddress: ":30096"
waitingPodSchedulingBlockMilliSec: 50
physicalCluster:
skuTypes:
V100:
gpu: 1
cpu: 6
memory: 6Gi
P4:
gpu: 1
cpu: 1
memory: 2Gi
cellTypes:
V100-PCIE:
childCellType: V100
childCellNumber: 4
P4-CPU:
childCellType: P4
childCellNumber: 2
V100-NODE:
childCellType: V100-PCIE
childCellNumber: 2
isNodeLevel: true
P4-NODE:
childCellType: P4-CPU
childCellNumber: 2
isNodeLevel: true
V100-NODE-POOL:
childCellType: V100-NODE
childCellNumber: 1
P4-NODE-POOL:
childCellType: P4-NODE
childCellNumber: 2
physicalCells:
- cellType: V100-NODE-POOL
cellChildren:
- cellAddress: tx-220-189-58.h.chinabank.com.cn
- cellType: P4-NODE-POOL
cellChildren:
- cellAddress: tx-220-189-26.h.chinabank.com.cn
- cellAddress: tx-220-189-33.h.chinabank.com.cn

virtualClusters:
  vc1:
    virtualCells:
    - cellType: P4-NODE-POOL.P4-NODE
      cellNumber: 2
  vc2:
    virtualCells:
    - cellType: V100-NODE-POOL.V100-NODE
      cellNumber: 1

the mpijob yaml as fellow:
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
name: mpi-hived-cpu
namespace: kubeflow
spec:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
annotations:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: vc1
priority: 1
leafCellType: P4
leafCellNumber: 1
affinityGroup:
name: mpi-hived-cpu
members:
- podNumber: 1
leafCellNumber: 1
- podNumber: 2
leafCellNumber: 1
spec:
containers:
- command:
- /bin/bash
- -c
- horovodrun -np 2 python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model resnet50 --batch_size 32 --variable_update horovod --num_epochs=1
image: idockerhub.jd.com/zhouzijiang/horovod-training:v1.2
imagePullPolicy: Always
name: mpi-hived
resources:
limits:
cpu: "1"
memory: 2Gi
nodeSelector:
nvidia.com/accelerator: nvidia-tesla-p4
schedulerName: hivedscheduler
tolerations:
- effect: NoSchedule
key: dedicated
value: lambda-training
- effect: NoSchedule
key: nvidia.com/gpu
Worker:
replicas: 2
template:
metadata:
annotations:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: vc1
priority: 1
leafCellType: P4
leafCellNumber: 1
affinityGroup:
name: mpi-hived-cpu
members:
- podNumber: 2
leafCellNumber: 1
spec:
containers:
- image: idockerhub.jd.com/zhouzijiang/horovod-training:v1.2
imagePullPolicy: Always
name: mpi-hived
resources:
limits:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
hivedscheduler.microsoft.com/pod-scheduling-enable: 1
securityContext:
capabilities:
add:
- IPC_LOCK
nodeSelector:
nvidia.com/accelerator: nvidia-tesla-p4
schedulerName: hivedscheduler
serviceAccountName: mpi-operator
tolerations:
- effect: NoSchedule
key: dedicated
value: lambda-training
- effect: NoSchedule
key: nvidia.com/gpu

the pod can't allocate the GPU0 and GPU1 in p4 node

@yqwang-ms
Copy link
Member

Note we only best effort to Topology-Aware Intra-VC Scheduling, see https://github.com/microsoft/hivedscheduler/blob/master/example/feature/README.md#topology-aware-intra-vc-scheduling

Other gpus may already allocated to other pods

@fanyangCS
Copy link

fanyangCS commented Feb 2, 2021

we are working on an update to provide an option to enforce a job to honor the gpu topology (if the job chooses to)
#33

@olderTaoist
Copy link
Author

Note we only best effort to Topology-Aware Intra-VC Scheduling, see https://github.com/microsoft/hivedscheduler/blob/master/example/feature/README.md#topology-aware-intra-vc-scheduling

Other gpus may already allocated to other pods

my node just run two pod:
image

@yqwang-ms
Copy link
Member

When submit these current 2 pods, do you have any other previous pods running other GPUs (they may complete now)?

BTW, could you kill all these pods on the machine and try again to just submit 2 pods?

@olderTaoist
Copy link
Author

current

sorry,my mistake!!!i don't understand implement of hived‘s gpu aware, hived map cell number of physical node to gpu number,just add the fellow environment in pod template:
image

@yqwang-ms
Copy link
Member

The env is added by PAI rest server, instead of hived, see https://github.com/microsoft/pai/blob/b8fa58782addfc835ba813ad4dc261fff400ee4a/src/rest-server/src/models/v2/job/k8s.js#L653

Hived only generate the annotations.

BTW, the NVIDIA_VISIBLE_DEVICES should generally match the GPU index showed by nvidia-smi.

@olderTaoist
Copy link
Author

The env is added by PAI rest server, instead of hived, see https://github.com/microsoft/pai/blob/b8fa58782addfc835ba813ad4dc261fff400ee4a/src/rest-server/src/models/v2/job/k8s.js#L653

Hived only generate the annotations.

BTW, the NVIDIA_VISIBLE_DEVICES should generally match the GPU index showed by nvidia-smi.

i don't use PAI,so need to add NVIDIA_VISIBLE_DEVICES env in pod templates. Yeah the value of hivedscheduler.microsoft.com/pod-leaf-cell-isolation in annotations accord to the gpu index showed by nvidia-smi

@yqwang-ms
Copy link
Member

@fanyangCS maybe we should also add hived user doc for users who do not use PAI, such as tell them set the
image

@fanyangCS
Copy link

@fanyangCS maybe we should also add hived user doc for users who do not use PAI, such as tell them set the
image

Sure. Can you update the document?

@fanyangCS fanyangCS reopened this Feb 5, 2021
@fanyangCS
Copy link

The env is added by PAI rest server, instead of hived, see https://github.com/microsoft/pai/blob/b8fa58782addfc835ba813ad4dc261fff400ee4a/src/rest-server/src/models/v2/job/k8s.js#L653
Hived only generate the annotations.
BTW, the NVIDIA_VISIBLE_DEVICES should generally match the GPU index showed by nvidia-smi.

i don't use PAI,so need to add NVIDIA_VISIBLE_DEVICES env in pod templates. Yeah the value of hivedscheduler.microsoft.com/pod-leaf-cell-isolation in annotations accord to the gpu index showed by nvidia-smi

May I know which solution you use?

@olderTaoist
Copy link
Author

my mistake

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants