Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes without errors are not reported #48

Open
jbd opened this issue Jan 25, 2022 · 4 comments
Open

Nodes without errors are not reported #48

jbd opened this issue Jan 25, 2022 · 4 comments

Comments

@jbd
Copy link
Contributor

jbd commented Jan 25, 2022

Hello,

I found that some nodes where missing from my grafana panels. I've converged to the behavior of ibqueryerrors which is not reporting node information if its not a "bad" node (a node with errors).

For example, here is the report for a node without errors:

# ibqueryerrors --verbose --details --data --report-port --switch --ca --threshold-file ./error_thresholds -G 0xb8cef60300a1d92a

## Summary: 1 nodes checked, 0 bad nodes found
##          1 ports checked, 0 ports have errors beyond threshold
## Thresholds: [SymbolErrorCounter = 0][LinkErrorRecoveryCounter = 0][LinkDownedCounter = 0][PortRcvErrors = 0][PortRcvRemotePhysicalErrors = 0][PortRcvSwitchRelayErrors = 0][PortXmitDiscards = 0][PortXmitConstraintErrors = 0][PortRcvConstraintErrors = 0][LocalLinkIntegrityErrors = 0][ExcessiveBufferOverrunErrors = 0][VL15Dropped = 0][PortXmitWait = 0]
## Suppressed:

And the report for a 'bad' node:

# ibqueryerrors --verbose --details --data --report-port --switch --ca --threshold-file ./error_thresholds -G 0x0c42a1030079989c
Errors for "maestro-3002 HCA-1"
   GUID 0xc42a1030079989c port 1: [PortXmitWait == 2544] [PortXmitData == 6399401 (24.412MB)] [PortRcvData == 1758872 (6.710MB)] [PortXmitPkts == 13959 (13.632K)] [PortRcvPkts == 13514 (13.197K)] [PortUnicastXmitPkts == 13959 (13.632K)] [PortUnicastRcvPkts == 13514 (13.197K)]
       Link info:    155   1[  ] ==( 4X        53.125 Gbps Active/  LinkUp)==>             [  ] "" ( )

## Summary: 1 nodes checked, 1 bad nodes found
##          1 ports checked, 1 ports have errors beyond threshold
## Thresholds: [SymbolErrorCounter = 0][LinkErrorRecoveryCounter = 0][LinkDownedCounter = 0][PortRcvErrors = 0][PortRcvRemotePhysicalErrors = 0][PortRcvSwitchRelayErrors = 0][PortXmitDiscards = 0][PortXmitConstraintErrors = 0][PortRcvConstraintErrors = 0][LocalLinkIntegrityErrors = 0][ExcessiveBufferOverrunErrors = 0][VL15Dropped = 0][PortXmitWait = 0]
## Suppressed:

Indeed, the 'good' node does not report any errors at the moment:

# perfquery -G 0xb8cef60300a1d92a 1
# Port counters: Lid 160 port 1 (CapMask: 0x5A00)
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrorCounter:..............0
LinkErrorRecoveryCounter:........0
LinkDownedCounter:...............0
PortRcvErrors:...................0
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
QP1Dropped:......................0
VL15Dropped:.....................0
PortXmitData:....................14804777
PortRcvData:.....................4168543
PortXmitPkts:....................32281
PortRcvPkts:.....................31220
PortXmitWait:....................0

In that case, I guess infiniband-exporter.py cannot extract information for this node. I can see the equivalent information from the other side of the link, using remote_name, so I can workaround it if I really need to retrieve the values. But it somehow break the global view of the fabric I've build in grafana, since I can miss nodes without errors.

Maybe I've missed something ? If not, do you have a suggestion ?

@jbd
Copy link
Contributor Author

jbd commented Jan 25, 2022

I've just confirmed this behavior by recompiling a slightly modified ibqueryerrors.c. There is a function to check thresholds:

static int exceeds_threshold(int field, uint64_t val)
{
        uint64_t thres = 0;
        mad_decode_field(thresholds, field, &thres);
        return (val > thres);
}

By changing val > thres with val >=0, I can see the nodes without errors in the output.

@jbd
Copy link
Contributor Author

jbd commented Jan 25, 2022

I've pinged the mellanox/nvidia support about this, and their only answer is: use ibdiagnet and one of the output file.

@guilbaults
Copy link
Owner

In hindsight using ibqueryerrors is probably the wrong tool to get measurements of healthy nodes simply based on the name of the executable.

Like Mellanox said, ibdiagnet is probably the right tool to use. It should be possible to convert this exporter to use ibdiagnet while keeping the same output format. I don't really have time to fix this issue for now, but maybe I will get around to fix this in a few weeks or months.

(Pull requests are always welcome 😄 )

@jbd
Copy link
Contributor Author

jbd commented Jan 25, 2022

In hindsight using ibqueryerrors is probably the wrong tool to get measurements of healthy nodes simply based on the name of the executable.

That's somehow what the support told me ;) Anyway, it could be valuable to know that there is no errors on some nodes of the fabric with ibqueryerrors.

It should be possible to convert this exporter to use ibdiagnet while keeping the same output format

This would be ideal code wise for your project. I've got a hard time wrapping my head around all the ib tools and what they can do and I really don't know if ibdiagnet could output the same output as ibqueryerrors.

(Pull requests are always welcome smile)

I've looked at the code and could only imagine both the pain and the mental courage to parse the output of the command. I think I'll play with ibdiagnet and its database csv output file which looks like as multiple concatenated csv files and see what I can do from there.

Thank you for your answer !

Feel free to close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants