URGENT: Please reparse your LSF or SLURM log files with the latest code #29
cartalla
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
If you are analyzing LSF logs then you must reparse the logfiles to get correct results. Your current results will grossly overstate the costs if you are running multi-core or multi-node jobs.
If you are analyzing Slurm logs and run multi-node jobs then you should reparse the logs to get correct results.
Details
A critical bug was found in the way that the LSF log file data was being interpreted for multi core jobs leading to the estimated number of instances being exponentially higher that it should be. This is reflected in the jobs.csv file by the num_hosts field being > 1.
Slurm had a different bug that effects multi-node jobs. The number of requested nodes was not correct potentially leading to under counting the number of required instances. Also, the number of cores and amount of memory was being interpreted as per node and not per job which would lead to estimating too many cores and too much memory for multi-node jobs. Finally, there was a bug in the handling of the requested memory (ReqMem) field. There are 2 undocumented suffixes, 'n' and 'c', that mean that the request is per node or per core. In those cases the memory requested should be multiplied by the number of nodes or number of cores in the job to get the total amount of memory requested.
Again, for multi-node jobs, the amount of cores and memory is per job, not per node. When choosing an instance type for multi-node jobs then the core and memory must be divided by the number of hosts/nodes to get the specification for the instance. For multi-node jobs this means that the instance type was being overprovisioned.
Beta Was this translation helpful? Give feedback.
All reactions