gather and plot data about Slurm scheduling and job statistics
Slurmmon is a system for gaining insight into Slurm and the jobs it runs.
It’s meant for cluster administrators looking to measure the effects of configuration changes and raise cluster utilization.
Features include:
sdiag
)Slurmmon is meant to run on a RHEL/CentOS/SL 6 based system and currently uses Ganglia for data collection and Apache/mod_python for reporting.
The components are:
See the doc
directory for more information, specifically:
Here is a screenshot of the basic diagnostic report from the production cluster at @fasrc:
It shows how something interesting happened on the 31st – there was a spike in job turnaround and slurmctld agent queue size.
Here is an example daily whitespace (CPU waste) report:
Of the jobs that completed in that day, the top CPU-waster was sophia’s, and it was a case of mismatched Slurm -n
(128) and mpirun -np
(16) (the latter is unnecessary – user education opportunity).
Lots of other jobs show the issue of asking for many CPU cores but using only one.
The job IDs are links to full details.
Here is a stack of plots from our Slurm upgrade from 2.6.9 to 14.03.4 around 10:00 a.m.:
It shows the much faster backfill scheduler runs (top plot), deeper backfill scheduler runs (middle plot), and higher job throughput (slope of completed jobs in bottom plot).