-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes without errors are not reported #48
Comments
I've just confirmed this behavior by recompiling a slightly modified ibqueryerrors.c. There is a function to check thresholds:
By changing |
I've pinged the mellanox/nvidia support about this, and their only answer is: use ibdiagnet and one of the output file. |
In hindsight using Like Mellanox said, (Pull requests are always welcome 😄 ) |
That's somehow what the support told me ;) Anyway, it could be valuable to know that there is no errors on some nodes of the fabric with ibqueryerrors.
This would be ideal code wise for your project. I've got a hard time wrapping my head around all the ib tools and what they can do and I really don't know if ibdiagnet could output the same output as ibqueryerrors.
I've looked at the code and could only imagine both the pain and the mental courage to parse the output of the command. I think I'll play with ibdiagnet and its database csv output file which looks like as multiple concatenated csv files and see what I can do from there. Thank you for your answer ! Feel free to close the issue. |
Hello,
I found that some nodes where missing from my grafana panels. I've converged to the behavior of ibqueryerrors which is not reporting node information if its not a "bad" node (a node with errors).
For example, here is the report for a node without errors:
And the report for a 'bad' node:
Indeed, the 'good' node does not report any errors at the moment:
In that case, I guess infiniband-exporter.py cannot extract information for this node. I can see the equivalent information from the other side of the link, using remote_name, so I can workaround it if I really need to retrieve the values. But it somehow break the global view of the fabric I've build in grafana, since I can miss nodes without errors.
Maybe I've missed something ? If not, do you have a suggestion ?
The text was updated successfully, but these errors were encountered: