hived don't aware gpu topology #35

olderTaoist · 2021-02-02T10:15:58Z

run mpijob on p4 node in kubernetes1.11，one gpu per pod。
the p4 gpu topology as fellow:

the worker-0 pod see gpu as fellow:

the worker1 pod see gpu as fellow:

the hived config as fellow:
apiVersion: v1
kind: ConfigMap
metadata:
name: hivedscheduler-config
namespace: kube-system
data:
policy.cfg : |
{
"kind": "Policy",
"apiVersion": "v1",
"extenders": [
{
"urlPrefix": "http://10.220.187.143:30096/v1/extender",
"filterVerb": "filter",
"preemptVerb": "preempt",
"bindVerb": "bind",
"enableHttps": false,
"httpTimeout": 5000000000,
"nodeCacheCapable": true,
"ignorable": false,
"managedResources": [
{
"name": "hivedscheduler.microsoft.com/pod-scheduling-enable",
"ignoredByScheduler": true
}
]
}
]
}
hivedscheduler.yaml: |
webServerAddress: ":30096"
waitingPodSchedulingBlockMilliSec: 50
physicalCluster:
skuTypes:
V100:
gpu: 1
cpu: 6
memory: 6Gi
P4:
gpu: 1
cpu: 1
memory: 2Gi
cellTypes:
V100-PCIE:
childCellType: V100
childCellNumber: 4
P4-CPU:
childCellType: P4
childCellNumber: 2
V100-NODE:
childCellType: V100-PCIE
childCellNumber: 2
isNodeLevel: true
P4-NODE:
childCellType: P4-CPU
childCellNumber: 2
isNodeLevel: true
V100-NODE-POOL:
childCellType: V100-NODE
childCellNumber: 1
P4-NODE-POOL:
childCellType: P4-NODE
childCellNumber: 2
physicalCells:
- cellType: V100-NODE-POOL
cellChildren:
- cellAddress: tx-220-189-58.h.chinabank.com.cn
- cellType: P4-NODE-POOL
cellChildren:
- cellAddress: tx-220-189-26.h.chinabank.com.cn
- cellAddress: tx-220-189-33.h.chinabank.com.cn

virtualClusters:
  vc1:
    virtualCells:
    - cellType: P4-NODE-POOL.P4-NODE
      cellNumber: 2
  vc2:
    virtualCells:
    - cellType: V100-NODE-POOL.V100-NODE
      cellNumber: 1

the mpijob yaml as fellow：
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
name: mpi-hived-cpu
namespace: kubeflow
spec:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
annotations:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: vc1
priority: 1
leafCellType: P4
leafCellNumber: 1
affinityGroup:
name: mpi-hived-cpu
members:
- podNumber: 1
leafCellNumber: 1
- podNumber: 2
leafCellNumber: 1
spec:
containers:
- command:
- /bin/bash
- -c
- horovodrun -np 2 python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model resnet50 --batch_size 32 --variable_update horovod --num_epochs=1
image: idockerhub.jd.com/zhouzijiang/horovod-training:v1.2
imagePullPolicy: Always
name: mpi-hived
resources:
limits:
cpu: "1"
memory: 2Gi
nodeSelector:
nvidia.com/accelerator: nvidia-tesla-p4
schedulerName: hivedscheduler
tolerations:
- effect: NoSchedule
key: dedicated
value: lambda-training
- effect: NoSchedule
key: nvidia.com/gpu
Worker:
replicas: 2
template:
metadata:
annotations:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: vc1
priority: 1
leafCellType: P4
leafCellNumber: 1
affinityGroup:
name: mpi-hived-cpu
members:
- podNumber: 2
leafCellNumber: 1
spec:
containers:
- image: idockerhub.jd.com/zhouzijiang/horovod-training:v1.2
imagePullPolicy: Always
name: mpi-hived
resources:
limits:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
hivedscheduler.microsoft.com/pod-scheduling-enable: 1
securityContext:
capabilities:
add:
- IPC_LOCK
nodeSelector:
nvidia.com/accelerator: nvidia-tesla-p4
schedulerName: hivedscheduler
serviceAccountName: mpi-operator
tolerations:
- effect: NoSchedule
key: dedicated
value: lambda-training
- effect: NoSchedule
key: nvidia.com/gpu

the pod can't allocate the GPU0 and GPU1 in p4 node

The text was updated successfully, but these errors were encountered:

yqwang-ms · 2021-02-02T11:06:11Z

Note we only best effort to Topology-Aware Intra-VC Scheduling, see https://github.com/microsoft/hivedscheduler/blob/master/example/feature/README.md#topology-aware-intra-vc-scheduling

Other gpus may already allocated to other pods

fanyangCS · 2021-02-02T11:15:59Z

we are working on an update to provide an option to enforce a job to honor the gpu topology (if the job chooses to)
#33

olderTaoist · 2021-02-02T11:48:55Z

Note we only best effort to Topology-Aware Intra-VC Scheduling, see https://github.com/microsoft/hivedscheduler/blob/master/example/feature/README.md#topology-aware-intra-vc-scheduling

Other gpus may already allocated to other pods

my node just run two pod:

yqwang-ms · 2021-02-02T12:57:11Z

When submit these current 2 pods, do you have any other previous pods running other GPUs (they may complete now)?

BTW, could you kill all these pods on the machine and try again to just submit 2 pods?

olderTaoist · 2021-02-05T09:29:32Z

current

sorry，my mistake！！！i don't understand implement of hived‘s gpu aware, hived map cell number of physical node to gpu number，just add the fellow environment in pod template：

yqwang-ms · 2021-02-05T09:41:30Z

The env is added by PAI rest server, instead of hived, see https://github.com/microsoft/pai/blob/b8fa58782addfc835ba813ad4dc261fff400ee4a/src/rest-server/src/models/v2/job/k8s.js#L653

Hived only generate the annotations.

BTW, the NVIDIA_VISIBLE_DEVICES should generally match the GPU index showed by nvidia-smi.

olderTaoist · 2021-02-05T10:04:44Z

The env is added by PAI rest server, instead of hived, see https://github.com/microsoft/pai/blob/b8fa58782addfc835ba813ad4dc261fff400ee4a/src/rest-server/src/models/v2/job/k8s.js#L653

Hived only generate the annotations.

BTW, the NVIDIA_VISIBLE_DEVICES should generally match the GPU index showed by nvidia-smi.

i don't use PAI，so need to add NVIDIA_VISIBLE_DEVICES env in pod templates. Yeah the value of hivedscheduler.microsoft.com/pod-leaf-cell-isolation in annotations accord to the gpu index showed by nvidia-smi

yqwang-ms · 2021-02-05T10:26:08Z

@fanyangCS maybe we should also add hived user doc for users who do not use PAI, such as tell them set the

fanyangCS · 2021-02-05T12:22:22Z

@fanyangCS maybe we should also add hived user doc for users who do not use PAI, such as tell them set the

Sure. Can you update the document?

fanyangCS · 2021-02-05T12:23:03Z

The env is added by PAI rest server, instead of hived, see https://github.com/microsoft/pai/blob/b8fa58782addfc835ba813ad4dc261fff400ee4a/src/rest-server/src/models/v2/job/k8s.js#L653
Hived only generate the annotations.
BTW, the NVIDIA_VISIBLE_DEVICES should generally match the GPU index showed by nvidia-smi.

i don't use PAI，so need to add NVIDIA_VISIBLE_DEVICES env in pod templates. Yeah the value of hivedscheduler.microsoft.com/pod-leaf-cell-isolation in annotations accord to the gpu index showed by nvidia-smi

May I know which solution you use?

olderTaoist · 2021-05-21T01:18:11Z

my mistake

fanyangCS assigned yqwang-ms Feb 2, 2021

yqwang-ms closed this as completed Feb 5, 2021

fanyangCS reopened this Feb 5, 2021

olderTaoist closed this as completed May 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hived don't aware gpu topology #35

hived don't aware gpu topology #35

olderTaoist commented Feb 2, 2021

yqwang-ms commented Feb 2, 2021

fanyangCS commented Feb 2, 2021 •

edited

Loading

olderTaoist commented Feb 2, 2021

yqwang-ms commented Feb 2, 2021

olderTaoist commented Feb 5, 2021

yqwang-ms commented Feb 5, 2021

olderTaoist commented Feb 5, 2021

yqwang-ms commented Feb 5, 2021

fanyangCS commented Feb 5, 2021

fanyangCS commented Feb 5, 2021

olderTaoist commented May 21, 2021

hived don't aware gpu topology #35

hived don't aware gpu topology #35

Comments

olderTaoist commented Feb 2, 2021

yqwang-ms commented Feb 2, 2021

fanyangCS commented Feb 2, 2021 • edited Loading

olderTaoist commented Feb 2, 2021

yqwang-ms commented Feb 2, 2021

olderTaoist commented Feb 5, 2021

yqwang-ms commented Feb 5, 2021

olderTaoist commented Feb 5, 2021

yqwang-ms commented Feb 5, 2021

fanyangCS commented Feb 5, 2021

fanyangCS commented Feb 5, 2021

olderTaoist commented May 21, 2021

fanyangCS commented Feb 2, 2021 •

edited

Loading