Skip to content

Latest commit

 

History

History
49 lines (40 loc) · 4.28 KB

README.md

File metadata and controls

49 lines (40 loc) · 4.28 KB

puppet-slurm_stats

This module generates and prints summary statistics for users when they initiate a session. This is done by including a file in /etc/profile.d/ on the login (or login like) host.

How it works

This module has two classes. The first data is for the host where you intend to pull and store the data from Slurm. This class sets up a cronjob that runs a script which pulls the data from Slurm and then stores it to a location that is available on the network.

The second is login which is for hosts where a user will login. It sets up a cronjob to copy the data from the network location to a local location on the node /usr/local/share/slurm_stats. It then lays down a script at /etc/profile.d/zzz-slurm_stats.sh which handles the printing of the data when the user logs in. Since the profile script only looks at the local node for the data, it will not block or delay logins even if slurm is busy or down.

The statistics

+---------------- Slurm Stats for Aug 20 -----------------------+
|                  End of Day Fairshare                         |
|                    test_lab: 0.003943                         |
+-------------------- Jobs By State ----------------------------+
|       Total | Completed | Canceled | Failed | Out of |  Timed |
|             |           |          |        | Memory |    Out |
| CPU:     25 |         4 |        1 |     20 |      0 |      0 |
| GPU:     98 |        96 |        1 |      1 |      0 |      0 |
+---------------------- Job Stats ------------------------------+
|        | Average | Average   | Average    | Total Usage /     |
|        | Used    | Allocated | Efficiency | Ave. Wait Time    |
| Cores  |     4.3 |       5.5 |      69.4% |    133.00 CPU Hrs |
| Memory |   22.2G |     27.2G |      68.3% |                   |
| GPUS   |     N/A |       1.0 |        N/A |    100.20 GPU Hrs |
| Time   |  14.57h |    45.38h |      45.9% |             0.00h |
+---------------------------------------------------------------+
| https://docs.rc.fas.harvard.edu/kb/slurm-stats                |
+---------------------------------------------------------------+

The data is generated by two scripts which run nightly and pull the data for the previous day. The data is tagged with the date for that previous day so incase it gets stale you know when the process broke. Also users will know specifically which day the data was gathered for.

The script first prints out the fairshare score for each lab the user is a part of. Next it has data for the jobs that completed the previous day, split by whether or not the jobs requests a GPU. The Job Stats section is for all jobs that were not in Canceled state but are in the other four states listed. Average Used is the average of how much was actually used per job. Average Allocated is the average of how much the user asked to allocate per job. The Average Efficiency is the Used/Allocated per job which is then averaged across all jobs, thus Average Used/Average Allocated will not match Average Efficiency as that is on a per job basis. The Total Usage is how many CPU/GPU hours were used, this is not weighted by TRES of the resource in question but just raw usage. Finally Ave. Wait Time is the average amount of time the jobs waited in the queue on a per job basis. At the end the script prints out a URL in your documenation which describes the results of the query for users and how they can improve their results.

If the user has no jobs the script simply prints:

+---------------- Slurm Stats for Aug 20 -----------------------+
|                  End of Day Fairshare                         |
|                    test_lab: 0.834594                         |
+---------------No jobs completed on Aug 20 --------------------+
| https://docs.rc.fas.harvard.edu/kb/slurm-stats                |
+---------------------------------------------------------------+

If the user has no fairshare then the script prints nothing.

Usage

While this is set up as a traditional Puppet module, you can still use the scripts and logic therein to construct your own stats setup. Simply take the scripts and the template scripts, replace the variables you need and set up the relevant cronjobs and you should be good to go. Just make sure that the storage you put the data on is available to all the login nodes you intend to use.