Skip to content

iss-lab/gpu-metrics-collector

Repository files navigation

GPU Metrics Collector

A python based service that runs on a Nvidia GPU enabled machine, collects metrics and pushes to influx.

Grafana GPU Dashboard

The following metrics are collected:

memory.used [MiB]
utilization.gpu [%]
temperature.gpu
power.draw [W]

Prerequisites

You will need additional configuration on the host to enable nvidia-smi access inside of the container. Follow this if you haven't.

Usage

  1. Clone this repository.
  2. Configure .env file in the root of this repo.
  3. Build and run the docker container:
    docker build -t gpu-metrics-collector-img .
    
    docker run --name gpu-metrics-collector-cont -h=`hostname` --env-file=./.env --gpus=all gpu-metrics-collector-img

Alternatively:

Spin up the entire stack using docker-compose:

docker-compose up -d

For Development

  1. Create a python virtual environment
    python -m venv .venv
    source .venv/bin/activate
  2. Install dependencies
    pip install -r requirements.txt
  3. Configure .env file.
  4. Run the script
    cd collector
    python main.py

Grafana:

All of the data can be visualized with Grafana.

The Grafana dashboard template is in grafana-template/template.json. Upload this JSON to set up the dashboard. Note: you’ll need to manually add variable definitions for the dashboard buttons.

Steps:

  1. Add InfluxDB as a data source in Grafana:
  2. Select type as InfluxDB
  3. Select Query language as Flux
  4. Config URL. (Use http://influxdb:8086 if deployed with docker-compose)
  5. Config User/Pass/Org/Token as per your config.
  6. Config Default Bucket as gpu-usage.
  7. Import grafana-template/template.json as a template to set up the dashboard.

License

This project is licensed under the MIT License - see the LICENSE file for details

About

A service for collecting GPU metrics and pushes to influx

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •