address
: Address to listen on for web interface and telemetry.kubeconfig
: Absolute path to the kubeconfig file, default get config from pod binding ServiceAccount.
Tips :
By default, after the nvidia-docker container is started, there will be a symbolic link : /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
in the container by default, and destination of the symbolic link is a correct version libnvidia-ml.so
. you can jump to step 4
If you don't have a correct version libnvidia-ml.so
in container, you can solve it according to the following steps.
-
Find out the
libnvidia-ml.so
on you host which you want run this k8s-gpu-exporter.find /usr/ -name libnvidia-ml.so
-
Copy the
libnvidia-ml.so
underproject-dir/lib
directory -
Add line
COPY lib/libnvidia-ml.so /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
in docker/dockerfile, for example:... COPY --from=build-env /build/k8s-gpu-exporter /app/k8s-gpu-exporter COPY lib/libnvidia-ml.so /usr/lib/x86_64-linux-gnu/libnvidia-ml.so ...
-
Run MakeFile
VERSION={YOU_VERSION} make docker
If you already have Arena, use it to submit a training task.
# Preparation
# Label the GPU Node
$ kubectl lebel node {YOU_NODE} k8s-node/nvidia_count={GPU_NUM}
# First
$ kubectl apply -f k8s-gpu-exporter.yaml
# Second
# Submit a deeplearn job
$ arena submit tf --name=style-transfer \
--gpus=1 \
--workers=1 \
--workerImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/neural-style:gpu \
--workingDir=/neural-style \
--ps=1 \
--psImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/style-transfer:ps \
"python neural_style.py --styles /neural-style/examples/1-style.jpg --iterations 1000"
# Third
$ curl {HOST_IP}:{PORT}/metrics
...Omit...
# HELP nvidia_gpu_used_memory Graphics used memory
# TYPE nvidia_gpu_used_memory gauge
nvidia_gpu_used_memory{gpu_node="dev-ms-7c22",gpu_pod_name="",minor_number="0",name="GeForce GTX 1660 SUPER",namepace_name="",uuid="GPU-a1460327-d919-1478-a68f-ef4cbb8515ac"} 3.0769152e+08
nvidia_gpu_used_memory{gpu_node="dev-ms-7c22",gpu_pod_name="style-transfer-worker-0",minor_number="0",name="GeForce GTX 1660 SUPER",namepace_name="default",uuid="GPU-a1460327-d919-1478-a68f-ef4cbb8515ac"} 8.912896e+07
...Omit...
# Fourth
$ kubectl logs {YOUR_K8S_GPU_EXPORTER_POD}
SystemGetDriverVersion: 450.36.06
Not specify a config ,use default svc
:9445
We have 1 cards
GPU-0 DeviceGetComputeRunningProcesses: 1
pid: 3598, usedMemory: 89128960
node: dev-ms-7c22 pod: style-transfer-worker-0, pid: 3598 usedMemory: 89128960
Add Annotation prometheus.io/scrape: 'true'
to k8s-gpu-exporter pod so that prometheus can automatically discover metrics services
And you can use PromQL query statement nvidia_gpu_used_memory/nvidia_gpu_total_memory
to see gpu memory usage