-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metric parse error || build config #12
Comments
Hey, I completely forgot about the fact that you need the node_name_map in the docker container (#13). 😬 The 500 error comes from an empty |
Hello Carsten,
The fabric is an EDR Mellanox fabric. 2x SB7800 and 45 MT4119 HCAs. Unfortunatly the IB stack
is the one shipped with sles12.4 and not the MLNX OFED (at the moment).
The caDescription is empty indeed:
[cid:[email protected]]
Also, the port section isn’t displayed:
[cid:[email protected]]
The topology is displayed correctly:
[cid:[email protected]]
Cheers,
-Frank
|
Does the host |
sorry, for the typo's in my last post all references to nodename pluto should be read as plato. I hope this don't caused any confusion. |
Oh, its interesting that |
Yes, sure. Here you go:
(just in case something else is missing) |
Thanks for the fast response, does |
btw. you don't actually need to add normal hosts to the node_map_file, they are normally detected automatically. |
This means the string in node_desc - file:
and verified/falsified the following 'test scenarios':
I also checked in vain whether this action:
could help. |
I'm sorry that this issue takes so long to resolve. I guess you already did this, but I am pretty sure that the empty caDescription comes from the old double spaced file. Is the error on the web server the still the same? |
No worries, I really appreciate your fast responses. I'm sure the modified file is inside the image:
but the nodes plato[7-9] and bee[1-4] are displayed in the web-gui without caDescription still. |
The data is stored in the |
Hello Carsten, it's strange, after purging the DB restarting radar-web, radar-daemon (with the corrected node_name_map) and also explicitly setting the node_desc file on bee[1-4] to
|
The date of the log messages is 3 hours behind the acctual time of the time(zone) assigned to the node. |
Its interesting that the erroneous nodes are still the same with the double space. Could you try to modify the following lines: to private static getCaDescription(type: HostType, rawCa: RawTopologyCa) {
console.log('getCaDescriptionDebug: ', rawCa.description.replace(/\s/g, '[SPACE]'));
return type === HostType.Switch ? '(Switch)' : rawCa.description.split(' ')[1];
} You should see some lines with the The caDescription is parsed when when the topology is initially send. After modify you still need to |
Hi Carsten, I'll try to apply your changes asap, I'm currently busy troubleshooting a MPI Problem. Sorry for the latency. Cheers, -Frank |
Hi Carsten, I patched the TopologyTreeBuilder.ts file and did the restart as your described above
didn't work, with or without env-variable set. |
Oh my command was wrong, the right command would be with To hotfix this issue I you can patch this line here:
|
Here you go: |
Your last log already contained these changes. Did you try to apply the hotfix (and did it work)?
line again. |
You can clear the containers and database by using:
and
|
Many thanks for the hot-fix. I think you can close this issue. |
Hello Carsten,
I followed the instructions to for infiniband-radar-{daemon,web} and installed both components
on one node attached to the fabric running the ib-radar-daemon and the web/DB stuff on a separate one both operating with the community docker distro provides by SuSE (Docker version 19.03.5, build 633a0ea838f1). I could make the things work for a small fabric. (2 MLNX EDR managed switches and 45 nodes attached)
For the integration I had to 'patch' the ApiClient.cpp (see diff below) and the Dockerbuild - file to
make use of node-name-map file. Maybe I overlooked something, but using the default location
for certs for sles (/etc/ssl/certs + c_rehash) lead to startup problems of the daemon with SSL ceritifcate errors.
All Processes are running now, the topology has been transfered the fabric can be browsed, but the ib-radar-daemon can't upload the perf-counters leading to the following errors on the web-client:
and web server side:
[Fri, 17 Apr 2020 06:43:52 GMT][Debug][ApiServer] Took 8 ms to process (PUT) '/api/v2/metrics/edr-fabric' StatusCode: 500
Error: A 400 Bad Request error occurred: {"error":"partial write: unable to parse 'port_bandwidth,fabric=edr-fabric,hostname=plato8,caDescription=,portNumber=1,connectionId=7376746
d1ff6725de3984e3a527f5673,connectionSide=B rcv=57i,xmit=57i': missing tag value\nunable to parse 'port_bandwidth,fabric=edr-fabric,hostname=plato9,caDescription=,portNumber=1,conne
ctionId=7eaf49d2c1f846be39fd3a3cf610e267,connectionSide=B rcv=57i,xmit=57i': missing tag value\nunable to parse 'port_bandwidth,fabric=edr-fabric,hostname=plato7,caDescription=,por
tNumber=1,connectionId=2e74a05d1dc8e9bd381b7ae595814894,connectionSide=B rcv=57i,xmit=57i': missing tag value dropped=0"}
for each upload.
I tried to dig in the JS - code, but haven't a clue yet, what could cause the problem.
Both nodes mentioned in the error message have been added successfully to the topology on the web server side actually.
There's one HCA for which I couldn't find a node name, yet which cause a strange 'loop' between the two managed switches building the core of the fabric. I added these changes to this issue, because I'mnot 100% sure that my changes have caused the error transfering the perf counters.
Many thanks in advance for your help
Cheers,
-Frank
----------------- My code changes -------------
The text was updated successfully, but these errors were encountered: