Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMHP: slurm exporter to report gpu metrics #181

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ if sudo systemctl is-active --quiet slurmctld; then
echo "Go is already installed."
fi
echo "This was identified as the controller node because Slurmctld is running. Begining SLURM Exporter Installation"
git clone -b 0.20 https://github.com/vpenso/prometheus-slurm-exporter.git
git clone -b development https://github.com/vpenso/prometheus-slurm-exporter.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set a tag, not dev

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-gpus-acct throws error with v0.20, contrary to documentation (>=0.19). A few issues link this to Slurm version. The development branch works and I can pin to specific commit.

If development branch (pin or not) is not preferred, need to test if main branch works. Otherwise, it's either no -gpus-acct or pin to the head of development branch (latest commit was two years ago anyway).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you raise an issue on https://github.com/vpenso/prometheus-slurm-exporter to get another release cut?

cd prometheus-slurm-exporter
sudo make && sudo cp bin/prometheus-slurm-exporter /usr/bin/
sudo tee /etc/systemd/system/prometheus-slurm-exporter.service > /dev/null <<EOF
Expand All @@ -19,7 +19,7 @@ Description=Prometheus SLURM Exporter

[Service]
Environment=PATH=/opt/slurm/bin:\$PATH
ExecStart=/usr/bin/prometheus-slurm-exporter
ExecStart=/usr/bin/prometheus-slurm-exporter -gpus-acct
Restart=on-failure
RestartSec=15
Type=simple
Expand Down