-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added logging of Slurm subprocess failures and testing for all parsing functions #43
base: master
Are you sure you want to change the base?
Conversation
The system tests require a working Slurm cluster to be present while running, or at least a working Slurm controller. The unit tests operate within the confines of the Go code and are independent of the rest of the system. `make test` still does the same thing as before, it runs all available tests. The commands `make unittest` and `make systemtest` were added for when only running tests within these categories are of interest. These changes were motivated by me trying to package prometheus-slurm-exporter for Nix. Nix wants to run the available tests after a build is complete but is unable to provide services, such as a Slurm cluster, during the check phase of a build. Categorizing the tests in this way allows for some tests to run during the build of the package rather than none. The categories themselves are declared at the top of the Go source files through the `//build $xyz` header.
I found there were inconsistencies in tab characters within the files. So I ran `go fmt` and this is the result.
It is a compiled binary, should not be tracked by git.
... with logging functions that print from where the call was made. This significantly improves the ability to figure out where in the code the error was caused.
Instead of merely printing "exit status 1" (which could reasonably be understood as "the main process, prometheus-slurm-exporter, exited with status 1") when a subprocces fails to execute. It looks like: ``` INFO[0000] Starting Server: :8080 source="main.go:48" 2021/02/22 13:17:50 exit status 1 ``` We now instead print that a subprocess has failed to run along with the error message given by that subprocess: ``` INFO[0000] Starting Server: :8080 source="main.go:48" 2021/02/22 13:16:31 main.AccountsData: The subprocess squeue failed with the following error: 2021/02/22 13:16:31 main.AccountsData: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: No error squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source 2021/02/22 13:16:31 main.AccountsData: The subprocess squeue terminated with: exit status 1 ```
Also added docstrings to all logging functions.
Not sure if the module behaves correctly though.
The existing "test" which only printed the parsed data had been broken since august 2020 and commit 1277b79.
Is there anything that I can do to help this PR get merged @vpenso? |
This branch is starting to bitrot. As the author I obviously think it brings with it significant improvements and would suggest I rebase it on master and you merge it in before it rots even more. The value of code deduplication and having actual tests for the codebase should outweigh the risk of introduced bugs. |
@Rovanion : yes, if you want to rebase your PR on the current master branch then now is probably a good time since in August it is unlikely that we will add/modify the code base. Note that the next overhaul may be in the near future when we will integrate our new GPU nodes. One advice: if possible, try to remove entirely from your PR any kind of formatting adjustments of the source code,
Well, having more tests will help but keep in mind that this exporter is targeting HPC clusters of different size and complexity, |
bb768e7
to
ccf39b6
Compare
To decrease the size of the pull request
ccf39b6
to
611df5a
Compare
I, erm, did not end up rebasing but merging in the changes from We're hoping to put this into production on our HPC cluster this fall after upgrading to Slurm 20.11. I think we both want the same thing here, reliable monitoring of our clusters. I just value testing and code reuse more than absence of code changes. Is there anything more that I can help with to get these changes merged? |
Hi,
Over the last two weeks I've been working on improving the error messages and testing of the Prometheus Slurm Exporter. I found that if any of the subprocesses failed the main go process terminated, only printing
exit 1
before doing so; implying that the go process itself exited with status 1 when it was in fact the subprocess doing so.I think that I have gone over all parts of the project that execute a subprocess and parses the text outputted by that subprocess. I have refactored these to use a common function/procedure for starting these subprocesses and reporting errors should they fail. I have also refactored them to have a pure function for parsing the output of the subprocess and have added unittests for all of these pure functions. I have also added system tests for the impure functions that execute
sinfo
,squeue
and so on.Over all the result is that there are now unit tests for all parser functions and system tests for all the subprocess launching functions. The subprocess launching functions now report the stderr of the subprocess should they fail.
This PR also encompasses the two previous PR's for fixing the formatting and splitting the old tests into system and unit tests. It was too much work rebasing it on master and going through every single commit to undo the autoformatting.
Why did I do all this? Because it was impossible to figure out why the system tests failed without this work and I needed to know this in order to write system tests for Nix. It was also the case that the unit tests would never fail since they never asserted anything.