Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metric parse error || build config #12

Open
wrussian opened this issue Apr 17, 2020 · 26 comments
Open

metric parse error || build config #12

wrussian opened this issue Apr 17, 2020 · 26 comments

Comments

@wrussian
Copy link

wrussian commented Apr 17, 2020

Hello Carsten,
I followed the instructions to for infiniband-radar-{daemon,web} and installed both components
on one node attached to the fabric running the ib-radar-daemon and the web/DB stuff on a separate one both operating with the community docker distro provides by SuSE (Docker version 19.03.5, build 633a0ea838f1). I could make the things work for a small fabric. (2 MLNX EDR managed switches and 45 nodes attached)

For the integration I had to 'patch' the ApiClient.cpp (see diff below) and the Dockerbuild - file to
make use of node-name-map file. Maybe I overlooked something, but using the default location
for certs for sles (/etc/ssl/certs + c_rehash) lead to startup problems of the daemon with SSL ceritifcate errors.

All Processes are running now, the topology has been transfered the fabric can be browsed, but the ib-radar-daemon can't upload the perf-counters leading to the following errors on the web-client:

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Internal Server Error</pre>
</body>
</html>
HTTP Request: '/v2/metrics/edr-fabric' (j/p/total)0/18/18 ms

and web server side:

at IncomingMessage.Create.res.on (/home/node/server/node_modules/influx/lib/src/pool.js:49:38)
at IncomingMessage.emit (events.js:187:15)
at IncomingMessage.EventEmitter.emit (domain.js:442:20)
at endReadableNT (_stream_readable.js:1086:12)
at process._tickCallback (internal/process/next_tick.js:63:19)

[Fri, 17 Apr 2020 06:43:52 GMT][Debug][ApiServer] Took 8 ms to process (PUT) '/api/v2/metrics/edr-fabric' StatusCode: 500
Error: A 400 Bad Request error occurred: {"error":"partial write: unable to parse 'port_bandwidth,fabric=edr-fabric,hostname=plato8,caDescription=,portNumber=1,connectionId=7376746
d1ff6725de3984e3a527f5673,connectionSide=B rcv=57i,xmit=57i': missing tag value\nunable to parse 'port_bandwidth,fabric=edr-fabric,hostname=plato9,caDescription=,portNumber=1,conne
ctionId=7eaf49d2c1f846be39fd3a3cf610e267,connectionSide=B rcv=57i,xmit=57i': missing tag value\nunable to parse 'port_bandwidth,fabric=edr-fabric,hostname=plato7,caDescription=,por
tNumber=1,connectionId=2e74a05d1dc8e9bd381b7ae595814894,connectionSide=B rcv=57i,xmit=57i': missing tag value dropped=0"}

for each upload.
I tried to dig in the JS - code, but haven't a clue yet, what could cause the problem.
Both nodes mentioned in the error message have been added successfully to the topology on the web server side actually.
There's one HCA for which I couldn't find a node name, yet which cause a strange 'loop' between the two managed switches building the core of the fabric. I added these changes to this issue, because I'mnot 100% sure that my changes have caused the error transfering the perf counters.

Many thanks in advance for your help
Cheers,
-Frank

----------------- My code changes -------------

diff --git a/Dockerfile b/Dockerfile
index a23f735..f471edb 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -8,6 +8,8 @@ WORKDIR /project/code
 COPY CMakeLists.txt ./
 COPY src/ ./src
 COPY 3d_party  ./3d_party
+COPY vlxtest.pem  ./etc/ssl/certs/vlxtest.pem
+COPY node_name_map ./etc/node_name_map

 WORKDIR /project/code/build
 RUN cmake .. && cmake --build . --target infiniband_radar_daemon -- -j 2
@@ -19,6 +21,8 @@ RUN apt update && \
     rm -rf /var/lib/apt/lists/*

 COPY --from=builder /project/code/build/infiniband_radar_daemon /usr/sbin/infiniband_radar_daemon
+COPY --from=builder /project/code/etc/ssl/certs/vlxtest.pem /etc/ssl/certs/vlxtest.pem
+COPY --from=builder /project/code/etc/node_name_map /etc/node_name_map

 VOLUME /config
 WORKDIR /config
diff --git a/src/ApiClient.cpp b/src/ApiClient.cpp
index 0045bed..afec8e2 100644
--- a/src/ApiClient.cpp
+++ b/src/ApiClient.cpp
@@ -66,6 +66,8 @@ void infiniband_radar::ApiClient::make_api_request(const std::string &path, cons
     curl_easy_setopt(curl, CURLOPT_CUSTOMREQUEST, method.c_str());
     curl_easy_setopt(curl, CURLOPT_URL, (_api_url + path).c_str());
     curl_easy_setopt(curl, CURLOPT_POSTFIELDS, payload_string.c_str());
+//    curl_easy_setopt(curl, CURLOPT_CAPATH, "/etc/ssl/certs");  // FHe
+    curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 0); // FHe
     //curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, &my_dummy_write);

     auto perform_start = std::chrono::high_resolution_clock::now();
@carstenpatzke
Copy link
Member

carstenpatzke commented Apr 17, 2020

Hey,
thanks for your in-depth details.

I completely forgot about the fact that you need the node_name_map in the docker container (#13). 😬
For the the ssl error, I might add an environment variable to allow non valid certificates.

The 500 error comes from an empty caDescription.
What infiniband cards are you using?
You said that you are able to browse the the fabric, are you able see the the description there?

image

@wrussian
Copy link
Author

wrussian commented Apr 18, 2020 via email

@wrussian
Copy link
Author

Hallo Carsten,
sorry I misunderstood your question about the caDescription. All Nodes show
image
are mapped to the same key MT4119 (the edr model name of the HCA). The /sys - file I guess the information is retrieved looks like this for all nodes on the fabric:
atlas30:~ # cat /sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/infiniband/mlx5_0/node_desc
MT4119 ConnectX5 Mellanox Technologies
Cheers,
-Frank

@carstenpatzke
Copy link
Member

carstenpatzke commented Apr 20, 2020

Does the host plato9 also shows up in the list?
(And does it have a ca description)

@wrussian
Copy link
Author

It's exactly as you suspected. The node's caDescription is missing for pluto[7-9] :
image
Although explicitly add an entry to node_name_map:

...
0x506b4b0300db0838     "plato7  MT4119 ConnectX5"
0x506b4b0300db0930     "plato8  MT4119 ConnectX5"
0x98039b030009b8ba     "plato9  MT4119 ConnectX5"
...

and the description entries in /sys FS for pluto7,8,9 looks like:

plato7:~ # find /sys -name node_des* -exec ls -1 {} ; -exec cat {} ;
/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/infiniband/mlx5_0/node_desc
MT4119 ConnectX5 Mellanox Technologies

@wrussian
Copy link
Author

sorry, for the typo's in my last post all references to nodename pluto should be read as plato. I hope this don't caused any confusion.

@carstenpatzke
Copy link
Member

Oh, its interesting that plato12 shows up as intended, can you give me node_name_map entry and the find /sys -name node_des* -exec ls -1 {} ; -exec cat {} ; command for this host too?

@wrussian
Copy link
Author

wrussian commented Apr 20, 2020

Yes, sure. Here you go:

plato12:~ # find /sys -name node_des\* -exec ls -1 {} \; -exec cat {} \;
/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/infiniband/mlx5_0/node_desc
MT4119 ConnectX5   Mellanox Technologies
cat node_name_map                                                                                                          
0x506b4b03005e9460     "infini2-atlas:MSB7800"                                                                                                                           
0x98039b0300d72610     "infiniband-s6:MSB7800/U1"                                                                                                                        
0x506b4b0300cbffde     "atlas20 MT4119 ConnectX5"                                                                                                                        
0x506b4b0300cbffce     "atlas21 MT4119 ConnectX5"                                                                                                                        
0x506b4b0300cc0002     "atlas22 MT4119 ConnectX5"                                                                                                                        
0x506b4b0300cbff66     "atlas23 MT4119 ConnectX5"                                                                                                                        
0x506b4b0300db07d4     "atlas24 MT4119 ConnectX5"                                                                                                                        
0x506b4b0300db07f8     "atlas25 MT4119 ConnectX5"                                                                                                                        
0x506b4b0300db0c7c     "atlas26 MT4119 ConnectX5"                                                                                                                        
0x506b4b0300db06e4     "atlas27 MT4119 ConnectX5"                                                                                                                        
0x506b4b0300db0aec     "atlas28 MT4119 ConnectX5"
0x506b4b0300db0b0c     "atlas29 MT4119 ConnectX5"
0x506b4b0300db0984     "atlas30 MT4119 ConnectX5"
0x506b4b0300db07d8     "atlas31 MT4119 ConnectX5"
0x506b4b0300db0a44     "atlas32 MT4119 ConnectX5"
0x506b4b0300db07d0     "atlas33 MT4119 ConnectX5"
0x506b4b0300db07dc     "atlas34 MT4119 ConnectX5"
0x506b4b0300db0c78     "atlas35 MT4119 ConnectX5"
0x506b4b0300db08b0     "atlas36 MT4119 ConnectX5"
0x506b4b0300db098c     "atlas37 MT4119 ConnectX5"
0x98039b030009b722     "atlas38 MT4119 ConnectX5"
0x98039b030009b6fe     "atlas39 MT4119 ConnectX5"
0x98039b030009b726     "atlas40 MT4119 ConnectX5"
0x98039b030009b70a     "atlas41 MT4119 ConnectX5"
0x98039b03002f2bae     "atlas42 MT4119 ConnectX5"
0x98039b03002f0a1e     "atlas43 MT4119 ConnectX5"
0x98039b03002f0d8e     "atlas44 MT4119 ConnectX5"
0x98039b03002f0d92     "atlas45 MT4119 ConnectX5"
0x98039b03002f0a32     "atlas46 MT4119 ConnectX5"
0x98039b03002f0d9a     "atlas47 MT4119 ConnectX5"
0x98039b03002f28ae     "atlas48 MT4119 ConnectX5"
0x98039b03002f0d96     "atlas49 MT4119 ConnectX5"
0x98039b03002f0a26     "atlas50 MT4119 ConnectX5"
0x98039b03002f285a     "atlas51 MT4119 ConnectX5"
0xb8599f0300c30d8a     "atlas52 MT4119 ConnectX5"
0xb8599f0300c30d6e     "atlas53 MT4119 ConnectX5"
0xb8599f0300c30ce2     "atlas54 MT4119 ConnectX5"
0xb8599f0300c30aae     "atlas55 MT4119 ConnectX5"
0x506b4b0300db0838     "plato7  MT4119 ConnectX5"
0x506b4b0300db0930     "plato8  MT4119 ConnectX5"
0x98039b030009b8ba     "plato9  MT4119 ConnectX5"
0x506b4b0300db0610     "plato10 MT4119 ConnectX5"
0x506b4b0300db0834     "plato11 MT4119 ConnectX5"
**0x98039b030009b696     "plato12 MT4119 ConnectX5"**
0x506b4b03001571e6     "bee1    MT4119 ConnectX5"
0x506b4b03001571fa     "bee2    MT4119 ConnectX5"
0x98039b030084992a     "bee3    MT4119 ConnectX5"
0x98039b0300849992     "bee4    MT4119 ConnectX5"
0x506b4b03001571ae     "UNKNOWN MT4119 ConnectX5"

(just in case something else is missing)

@carstenpatzke
Copy link
Member

Thanks for the fast response, does bee1 and co show up? I am a little bit worried about the double spaces. I've checked our config, and we only use a single space after the hostname

@carstenpatzke
Copy link
Member

btw. you don't actually need to add normal hosts to the node_map_file, they are normally detected automatically.
We use this file for unmanged/unnamed switches only, but it should also work with normal hosts of course.

@wrussian
Copy link
Author

This means the string in node_desc - file:
'MT4119 ConnectX5 Mellanox Technologies'
cause the problem, as it contains two spaces between 'ConnectX5' and 'Mellanox'. I fixed node_name_map to:

0x506b4b03005e9460 "infini2-atlas:MSB7800"

0x98039b0300d72610 "infiniband-s6:MSB7800/U1"
0x506b4b0300cbffde "atlas20 MT4119 ConnectX5"
0x506b4b0300cbffce "atlas21 MT4119 ConnectX5"
0x506b4b0300cc0002 "atlas22 MT4119 ConnectX5"
0x506b4b0300cbff66 "atlas23 MT4119 ConnectX5"
0x506b4b0300db07d4 "atlas24 MT4119 ConnectX5"
0x506b4b0300db07f8 "atlas25 MT4119 ConnectX5"
0x506b4b0300db0c7c "atlas26 MT4119 ConnectX5"
0x506b4b0300db06e4 "atlas27 MT4119 ConnectX5"
0x506b4b0300db0aec "atlas28 MT4119 ConnectX5"
0x506b4b0300db0b0c "atlas29 MT4119 ConnectX5"
0x506b4b0300db0984 "atlas30 MT4119 ConnectX5"
0x506b4b0300db07d8 "atlas31 MT4119 ConnectX5"
0x506b4b0300db0a44 "atlas32 MT4119 ConnectX5"
0x506b4b0300db07d0 "atlas33 MT4119 ConnectX5"
0x506b4b0300db07dc "atlas34 MT4119 ConnectX5"
0x506b4b0300db0c78 "atlas35 MT4119 ConnectX5"
0x506b4b0300db08b0 "atlas36 MT4119 ConnectX5"
0x506b4b0300db098c "atlas37 MT4119 ConnectX5"
0x98039b030009b722 "atlas38 MT4119 ConnectX5"
0x98039b030009b6fe "atlas39 MT4119 ConnectX5"
0x98039b030009b726 "atlas40 MT4119 ConnectX5"
0x98039b030009b70a "atlas41 MT4119 ConnectX5"
0x98039b03002f2bae "atlas42 MT4119 ConnectX5"
0x98039b03002f0a1e "atlas43 MT4119 ConnectX5"
0x98039b03002f0d8e "atlas44 MT4119 ConnectX5"
0x98039b03002f0d92 "atlas45 MT4119 ConnectX5"
0x98039b03002f0a32 "atlas46 MT4119 ConnectX5"
0x98039b03002f0d9a "atlas47 MT4119 ConnectX5"
0x98039b03002f28ae "atlas48 MT4119 ConnectX5"
0x98039b03002f0d96 "atlas49 MT4119 ConnectX5"
0x98039b03002f0a26 "atlas50 MT4119 ConnectX5"
0x98039b03002f285a "atlas51 MT4119 ConnectX5"
0xb8599f0300c30d8a "atlas52 MT4119 ConnectX5"
0xb8599f0300c30d6e "atlas53 MT4119 ConnectX5"
0xb8599f0300c30ce2 "atlas54 MT4119 ConnectX5"
0xb8599f0300c30aae "atlas55 MT4119 ConnectX5"
0x506b4b0300db0838 "plato7 MT4119 ConnectX5"
0x506b4b0300db0930 "plato8 MT4119 ConnectX5"
0x98039b030009b8ba "plato9 MT4119 ConnectX5"
0x506b4b0300db0610 "plato10 MT4119 ConnectX5"
0x506b4b0300db0834 "plato11 MT4119 ConnectX5"
0x98039b030009b696 "plato12 MT4119 ConnectX5"
0x506b4b03001571e6 "bee1 MT4119 ConnectX5"
0x506b4b03001571fa "bee2 MT4119 ConnectX5"
0x98039b030084992a "bee3 MT4119 ConnectX5"
0x98039b0300849992 "bee4 MT4119 ConnectX5"
0x506b4b03001571ae "UNKNOWN 4119 ConnectX5"

and verified/falsified the following 'test scenarios':

  • rebuild and restarted the radar-daemon, only ---> Error messages and state in GUI unchanged
  • restarted the radar-web apps ---> error messages and state in GUI unchanged.
  • stopped web-gui, radar-daemon; started web-gui, radar-daemon --> Errror messages and state in GUI unchanged

I also checked in vain whether this action:

plato7:~ # echo -e "MT4119 ConnectX5 Mellanox Technologies\c" > /sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/infiniband/mlx5_0/node_desc
plato7:~ # find /sys -name node_des* -exec ls -1 {} ; -exec cat {} ;
/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/infiniband/mlx5_0/node_desc
MT4119 ConnectX5 Mellanox Technologies

could help.

@carstenpatzke
Copy link
Member

I'm sorry that this issue takes so long to resolve.

I guess you already did this, but I am pretty sure that the empty caDescription comes from the old double spaced file.
Since I did not had the time to update the Dockerfile yet, are you sure that the new daemon container has the updated node-map?

Is the error on the web server the still the same?
Error: A 400 Bad Request error occurred: {"error":"partial write: unable to parse 'port_bandwidth,fabric=edr-fabric,hostname=plato8,caDescription=,portNumber=1,connectionId=7376746
d1ff6725de3984e3a527f5673,connectionSide=B rcv=57i,xmit=57i': missing tag value\nunable to parse 'port_bandwidth,fabric=edr-fabric,hostname=plato9,caDescription=,portNumber=1,conne
ctionId=7eaf49d2c1f846be39fd3a3cf610e267,connectionSide=B rcv=57i,xmit=57i': missing tag value\nunable to parse 'port_bandwidth,fabric=edr-fabric,hostname=plato7,caDescription=,por
tNumber=1,connectionId=2e74a05d1dc8e9bd381b7ae595814894,connectionSide=B rcv=57i,xmit=57i': missing tag value dropped=0"}

@wrussian
Copy link
Author

No worries, I really appreciate your fast responses. I'm sure the modified file is inside the image:

> atlas52:~/software/infiniband-radar-daemon # docker run -it --entrypoint /bin/bash 54bbb6315351 -s
> 
> root@16109ac85bae:/config# grep plato /etc/node_name_map 
> 0x506b4b0300db0838 "plato7 MT4119 ConnectX5"
> 0x506b4b0300db0930 "plato8 MT4119 ConnectX5"
> 0x98039b030009b8ba "plato9 MT4119 ConnectX5"
> 0x506b4b0300db0610 "plato10 MT4119 ConnectX5"
> 0x506b4b0300db0834 "plato11 MT4119 ConnectX5"
> 0x98039b030009b696 "plato12 MT4119 ConnectX5"

but the nodes plato[7-9] and bee[1-4] are displayed in the web-gui without caDescription still.
What I don't understand, yet is, that the timeline display events starting from April, 16th although
I also rm-ed all radar-web docker images. Not clear to me where the persistent DB info is stored. (I thought a DB clean-up might help, too, but actually the application should handle all (topo-)changes, right(?))

@carstenpatzke
Copy link
Member

The data is stored in the data directory.
https://github.com/infiniband-radar/infiniband-radar-web/blob/master/docker-compose.yml#L92-L93
To clear all the containers you can use docker-compose down

@wrussian
Copy link
Author

Hello Carsten, it's strange, after purging the DB restarting radar-web, radar-daemon (with the corrected node_name_map) and also explicitly setting the node_desc file on bee[1-4] to
'MT4119 ConnectX5 Mellanox Technologies' (via echo - command) the gateway still reports
the following error for each upload:

[Tue, 21 Apr 2020 06:09:16 GMT][Debug][ApiServer] Took 6 ms to process (PUT) '/api/v2/metrics/edr-fabric' StatusCode: 500
Error: A 400 Bad Request error occurred: {"error":"partial write: unable to parse 'port_bandwidth,fabric=edr-fabric,hostname=bee4,caDescription=,portNumber=1,connectionId=d35b7151dafa46f6cfc9bca2bf9f50f2,connectionSide=B rcv=57i,xmit=57i': missing tag value\nunable to parse 'port_bandwidth,fabric=edr-fabric,hostname=bee3,caDescription=,portNumber=1,connectionId=031487f1d974f0d8d6791be27d600d8f,connectionSide=B rcv=57i,xmit=57i': missing tag value\nunable to parse 'port_bandwidth,fabric=edr-fabric,hostname=bee2,caDescription=,portNumber=1,connectionId=d443f49ca9380426a20f3010fb44733e,connectionSide=B rcv=57i,xmit=57i': missing tag value\nunable to parse 'port_bandwidth,fabric=edr-fabric,hostname=bee1,caDescription=,portNumber=1,connectionId=ede6f69f317e9b56a239d64a4cb08e5f,connectionSide=B rcv=57i,xmit=57i': missing tag value\nunable to parse 'port_bandwidth,fabric=edr-fabric,hostname=plato8,caDescription=,portNumber=1,connectionId=7376746d1ff6725de3984e3a527f5673,connectionSide=B rcv=57i,xmit=57i': missing tag value\nunable to parse 'port_bandwidth,fabric=edr-fabric,hostname=plato9,caDescription=,portNumber=1,connectionId=7eaf49d2c1f846be39fd3a3cf610e267,connectionSide=B rcv=57i,xmit=57i': missing tag value\nunable to parse 'port_bandwidth,fabric=edr-fabric,hostname=plato7,caDescription=,portNumber=1,connectionId=2e74a05d1dc8e9bd381b7ae595814894,connectionSide=B rcv=57i,xmit=57i': missing tag value dropped=0"}

at IncomingMessage.Create.res.on (/home/node/server/node_modules/influx/lib/src/pool.js:49:38)
at IncomingMessage.emit (events.js:187:15)
at IncomingMessage.EventEmitter.emit (domain.js:442:20)
at endReadableNT (_stream_readable.js:1086:12)
at process._tickCallback (internal/process/next_tick.js:63:19)

@wrussian
Copy link
Author

The date of the log messages is 3 hours behind the acctual time of the time(zone) assigned to the node.

@carstenpatzke
Copy link
Member

carstenpatzke commented Apr 21, 2020

Its interesting that the erroneous nodes are still the same with the double space.

Could you try to modify the following lines:
https://github.com/infiniband-radar/infiniband-radar-web/blob/c3a66bda20f55a0e974ed3dc4792a33fe4d2b5c6/server/src/lib/TopologyTreeBuilder.ts#L101-L103

to

    private static getCaDescription(type: HostType, rawCa: RawTopologyCa) {
        console.log('getCaDescriptionDebug: ', rawCa.description.replace(/\s/g, '[SPACE]'));
        return type === HostType.Switch ? '(Switch)' : rawCa.description.split(' ')[1];
    }

You should see some lines with the getCaDescriptionDebug: prefix.
With this information I might be able to see what the system is receiving.

The caDescription is parsed when when the topology is initially send.

After modify you still need to
docker-compose up --build server
or
docker-compose up --build -d server and docker-compose logs -f server

@wrussian
Copy link
Author

Hi Carsten, I'll try to apply your changes asap, I'm currently busy troubleshooting a MPI Problem. Sorry for the latency. Cheers, -Frank

@wrussian
Copy link
Author

Hi Carsten, I patched the TopologyTreeBuilder.ts file and did the restart as your described above
The command

vlxtest01:~/software/infiniband-radar-web # GATEWAY_CERTS="$PWD/ssl_certs" docker-compose logs -f server
ERROR: No such service: server

didn't work, with or without env-variable set.
I capatured the output of:
vlxtest01:~/software/infiniband-radar-web # GATEWAY_CERTS="$PWD/ssl_certs" docker-compose logs > /tmp/ib-radar-web-server-20200424-1014.log 2>&1
and attacted the file. In contains the desired output, especially for the problematic nodes bee*,palto*.
I'm sorry f
ib-radar-web-server-20200424-1014.log
or the delay. Cheers, -Frank

@carstenpatzke
Copy link
Member

Oh my command was wrong, the right command would be with api instead of server but the log contained the changes anyways.
So I currently don't know why the description is empty, I assume that somehow the database survived despised the deletion and I have a caching issue in the api-server.

To hotfix this issue I you can patch this line here:

https://github.com/infiniband-radar/infiniband-radar-web/blob/703fd57c0115952bdcbe730d9966d48ac4b121e0/server/src/services/MetricDatabase.ts#L246

                tags: {
                    fabric: fabricId,
                    hostname: port.ca.host.hostname,
                    caDescription: port.ca.description || 'Unknown',
                    portNumber: String(port.portNumber),
                    connectionId: port.connection.connectionId,
                    connectionSide: port.connection.portA === port ?
                        MetricDatabase.CONNECTION_SIDE_A : MetricDatabase.CONNECTION_SIDE_B,
                }

@wrussian
Copy link
Author

Here you go:
ib-radar-web-server-20200424-1014.log

@carstenpatzke
Copy link
Member

carstenpatzke commented Apr 27, 2020

Your last log already contained these changes. Did you try to apply the hotfix (and did it work)?
If so, you can delete the

console.log('getCaDescriptionDebug: ', rawCa.description.replace(/\s/g, '[SPACE]'));

line again.

@wrussian
Copy link
Author

Sorry, I overlooked your hot-fix. Applied it and removed the debug code snippet. Seems the hotfix solved the upload problem. On ib-radar-web (docker compose logs - api):

api_1       | [Mon, 27 Apr 2020 11:30:26 GMT][Debug][ApiServer] Took 6 ms to process (PUT) '/api/v2/metrics/edr-fabric' StatusCode: 204
api_1       | [Mon, 27 Apr 2020 11:30:31 GMT][Debug][ApiServer] Took 5 ms to process (PUT) '/api/v2/metrics/edr-fabric' StatusCode: 204
api_1       | [Mon, 27 Apr 2020 11:30:36 GMT][Debug][ApiServer] Took 6 ms to process (PUT) '/api/v2/metrics/edr-fabric' StatusCode: 204
api_1       | [Mon, 27 Apr 2020 11:30:41 GMT][Debug][ApiServer] Took 5 ms to process (PUT) '/api/v2/metrics/edr-fabric' StatusCode: 204
api_1       | [Mon, 27 Apr 2020 11:30:46 GMT][Debug][ApiServer] Took 6 ms to process (PUT) '/api/v2/metrics/edr-fabric' StatusCode: 204
...

and on ib-radar-daemon node:

...
HTTP Request: '/v2/metrics/edr-fabric' (j/p/total)0/21/21 ms
Send port stats: 57ms
HTTP Request: '/v2/metrics/edr-fabric' (j/p/total)0/18/19 ms
Send port stats: 56ms
HTTP Request: '/v2/metrics/edr-fabric' (j/p/total)0/19/19 ms
Send port stats: 56ms
...

the upload and parser errors are gone now, but the nodes are still displayed without 'caDescription' and ports aren't displayed on node level:
image

image

Also the stats are not displayed for any other interval then 'last 6 hours', although I executed a load test on atlas40 (<->atlas39):
image
image

Is there a way to cleanup the database vi docker-compose to start with a purified uploaded topo?

@carstenpatzke
Copy link
Member

You can clear the containers and database by using:

docker-compose down

and

rm -rf data

@wrussian
Copy link
Author

Many thanks for the hot-fix. I think you can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants