k8s-gpu-exporter

Command flags

address : Address to listen on for web interface and telemetry.
kubeconfig : Absolute path to the kubeconfig file, default get config from pod binding ServiceAccount.

Docker Build

Tips : By default, after the nvidia-docker container is started, there will be a symbolic link : /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 in the container by default, and destination of the symbolic link is a correct version libnvidia-ml.so. you can jump to step 4

If you don't have a correct version libnvidia-ml.so in container, you can solve it according to the following steps.

Find out the libnvidia-ml.so on you host which you want run this k8s-gpu-exporter.
```
find /usr/ -name libnvidia-ml.so 
```
Copy the libnvidia-ml.so under project-dir/lib directory

Add line COPY lib/libnvidia-ml.so /usr/lib/x86_64-linux-gnu/libnvidia-ml.so in docker/dockerfile, for example:

...
COPY --from=build-env /build/k8s-gpu-exporter /app/k8s-gpu-exporter
COPY lib/libnvidia-ml.so /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
...

Run MakeFile
```
VERSION={YOU_VERSION} make docker 
```

Best Practices

If you already have Arena, use it to submit a training task.

# Preparation
# Label the GPU Node
$ kubectl lebel node {YOU_NODE} k8s-node/nvidia_count={GPU_NUM}

# First
$ kubectl apply -f k8s-gpu-exporter.yaml

# Second
# Submit a deeplearn job
$ arena submit tf --name=style-transfer \
              --gpus=1 \
              --workers=1 \
              --workerImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/neural-style:gpu \
              --workingDir=/neural-style \
              --ps=1 \
              --psImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/style-transfer:ps   \
              "python neural_style.py --styles /neural-style/examples/1-style.jpg --iterations 1000"

# Third
$ curl {HOST_IP}:{PORT}/metrics
    
    ...Omit...
    # HELP nvidia_gpu_used_memory Graphics used memory 
    # TYPE nvidia_gpu_used_memory gauge
    nvidia_gpu_used_memory{gpu_node="dev-ms-7c22",gpu_pod_name="",minor_number="0",name="GeForce GTX 1660 SUPER",namepace_name="",uuid="GPU-a1460327-d919-1478-a68f-ef4cbb8515ac"} 3.0769152e+08
    nvidia_gpu_used_memory{gpu_node="dev-ms-7c22",gpu_pod_name="style-transfer-worker-0",minor_number="0",name="GeForce GTX 1660 SUPER",namepace_name="default",uuid="GPU-a1460327-d919-1478-a68f-ef4cbb8515ac"} 8.912896e+07
    ...Omit...

# Fourth
$ kubectl logs {YOUR_K8S_GPU_EXPORTER_POD}
    SystemGetDriverVersion: 450.36.06
    Not specify a config ,use default svc
    :9445
    We have 1 cards
    GPU-0   DeviceGetComputeRunningProcesses: 1
                    pid: 3598, usedMemory: 89128960 
                    node: dev-ms-7c22 pod: style-transfer-worker-0, pid: 3598 usedMemory: 89128960

Prometheus

Add Annotation prometheus.io/scrape: 'true' to k8s-gpu-exporter pod so that prometheus can automatically discover metrics services

And you can use PromQL query statement nvidia_gpu_used_memory/nvidia_gpu_total_memory to see gpu memory usage

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/image		.github/image
collector		collector
docker		docker
helper		helper
lib		lib
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

k8s-gpu-exporter

Command flags

Docker Build

Best Practices

Prometheus

About

Releases

Packages

Languages

License

JK-97/k8s-gpu-exporter

Folders and files

Latest commit

History

Repository files navigation

k8s-gpu-exporter

Command flags

Docker Build

Best Practices

Prometheus

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages