-
Notifications
You must be signed in to change notification settings - Fork 79
/
RELEASE_NOTES.txt
89 lines (82 loc) · 4.15 KB
/
RELEASE_NOTES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
1.4.3
=====
Please note: As with future releases in the 1.4.x series, this
release has largely been limited to bugfixes and incremental
improvements of existing code. Active feature development has already
started (and has been going on for awhile now, truthfully) on what
will eventually be the development branch for a future 1.5 release.
There will, however, be a 1.4.4 release coming up in the very near
future (hopefully) to address some of the still-outstanding
bugs/issues in the 1.4.x tree. 1.4.3 is being released largely
"as-is" due to the massive volume of real-world production testing the
current codebase has received, making it as "rock-solid" reliable as
one could hope!
New Features:
- Toggle BASH tracing or NHC debugging via SIGUSR1/SIGUSR2,
respectively
- check_nvsmi_healthmon(): New check from CSC for GPU health
monitoring via "nvidia-smi"
Fixes/Improvements:
- Corrections/cleanups of SGE integration support
- Provide added detail to tracing info ("-x" mode)
- Based on feedback from Moe Jette of SchedMD, pull node job data
directly from Slurm via "squeue" instead of the previous method
that only worked for single-node jobs.
- Support for recent additions to the Slurm node states (e.g.,
"planned")
- Pathname expansion has been disabled on startup, and re-enabled
only when being actively used, to avoid "unintended" expansions of
wildcards at random points throughout the code.
- Correct clobbering of BASH built-in variables and add tests to
prevent future recurrence
- Switch "system UID" boundary handling to a more accurate source of
truth, and ensure that the code matches the math, naming, and
intent.
- Reorder resource manager detection to improve accurate detection,
especially with respect to Slurm vs. PBS (all variants)
1.4.2
=====
New features:
- Support for negating *any* match string anywhere (configs and check
parameters)
- check_net_ping(): New check for monitoring of network connectivity
- check_ps_*(): Process owner parameters now accept match strings,
allowing them to take full advantage of wildcards, regular
expressions, ranges, and/or external dynamic matching (e.g., UNIX
group membership-based matching)
- check_ps_service(): Additional options to tell NHC to execute
stop/start/restart/cycle and/or user-defined actions synchronously
(instead of backgrounding them) so that NHC can verify they worked
correctly (or fail if not).
- -v (Verify Sync) will only fail check if action fails
- -V (Verify Check) additionally will make sure action appears to
have worked (e.g., killed process is gone, started daemon is
running, etc.)
- check_cmd_dmesg(): New check to validate/verify or catch/flag
certain important content in the output of the `dmesg` command.
This check is also a working example to facilitate user wrapping of
check_cmd_output() to generate custom/specific check failure
messages.
- New command-line flag: "-e <check>" will override config file,
execute just the specified <check>, and return success/failure
based on that single check. Useful for testing/troubleshooting new
checks. Essentially analogous to having a 1-line config file
consisting of "* || <check>" and nothing else.
- check_fs_mount(): Create missing mount points as necessary
Fixes/improvements:
- Silence warnings about watchdog timer and RM detection that were
confusing to users and/or no longer useful
- Refactored match string range checking to be comparative rather
than iterative. Large ranges will see a huge speed-up!
- Refactored generation and handling of numerical quantities to avoid
excessive rounding/truncation errors. For example, truncation of
1.5TB of RAM to 1TB is a significant error! Now uses 1536GB
instead.
- Fixed issue with leading zeros in node range expressions causing
numbers to be interpreted as octal.
- Fixed typo in getent fallback handling that kept it from working in
certain cases.
- Simplified the killing of the watchdog timer to avoid apparent BASH
bug causing aborts.
- Fixed list of SLURM node states supported by node-mark-{on,off}line
- Fix trimming of whitespace in config files to include tabs