CI/CD e2e-tier-0 tests failures in bluechi-controller #462

Yarboa · 2024-06-14T15:55:04Z

Tiers e2e tests started to fail on qm-node

Jun 14 15:48:57 control bluechi-controller[158]: Node 'qm-node1' disconnected
Jun 14 15:49:53 control bluechi-controller[158]: Registered managed node from fd 11 as 'qm-node1'

While bluechi-agent inside qm indicates log connectivity

podman exec  control bash -c "systemctl status bluechi-controller"
● bluechi-controller.service - BlueChi Controller systemd service
     Loaded: loaded (/usr/lib/systemd/system/bluechi-controller.service; enabled; preset: disabled)
     Active: active (running) since Fri 2024-06-14 14:12:45 UTC; 1h 40min ago

Jun 14 15:52:16 control bluechi-controller[158]: Node 'qm-node1' disconnected
Jun 14 15:52:45 control bluechi-controller[158]: Registered managed node from fd 11 as 'qm-node1'

While the following

podman exec node1 bash -c "systemctl status bluechi-agent"
● bluechi-agent.service - BlueChi systemd service controller agent daemon
     Loaded: loaded (/usr/lib/systemd/system/bluechi-agent.service; enabled; preset: disabled)
     Active: active (running) since Fri 2024-06-14 14:14:00 UTC; 1h 40min ago


Jun 14 14:14:00 node1 systemd[1]: Started BlueChi systemd service controller agent daemon.
Jun 14 14:14:00 node1 bluechi-agent[1960]: Starting bluechi-agent 0.9.0-0.202405230627.git23191d3
Jun 14 14:14:00 node1 bluechi-agent[1960]: Connecting to controller on tcp:host=10.90.0.2,port=842
Jun 14 14:14:00 node1 bluechi-agent[1960]: Connected to controller as 'node1'

More tests here, it seems that network is down every 60-90 seconds
Install in node1
dnf -y install --releasever 9 --installroot /usr/lib/qm/rootfs python iputils

ControllerHost=10.90.0.2

[root@node1 ~]# podman exec qm bash -c "ping 10.90.0.2"

bash-5.1# ping  10.90.0.2
PING 10.90.0.2 (10.90.0.2) 56(84) bytes of data.
64 bytes from 10.90.0.2: icmp_seq=1 ttl=63 time=0.255 ms
64 bytes from 10.90.0.2: icmp_seq=14 ttl=63 time=0.213 ms
**64 bytes from 10.90.0.2: icmp_seq=48 ttl=63 time=2368 ms
64 bytes from 10.90.0.2: icmp_seq=49 ttl=63 time=1344 ms**
64 bytes from 10.90.0.2: icmp_seq=50 ttl=63 time=320 ms
64 bytes from 10.90.0.2: icmp_seq=51 ttl=63 time=0.282 ms
64 bytes from 10.90.0.2: icmp_seq=52 ttl=63 time=0.136 ms
64 bytes from 10.90.0.2: icmp_seq=53 ttl=63 time=0.135 ms
64 bytes from 10.90.0.2: icmp_seq=54 ttl=63 time=0.216 ms

It also happens from the namespace itself

[root@node1 ~]# ip netns exec netns-f7133da7-ba6f-4ba2-f366-ad80f5835436 ping 10.90.0.2
PING 10.90.0.2 (10.90.0.2) 56(84) bytes of data.

While ping from node1 to controller adress is not uninterruptible

The text was updated successfully, but these errors were encountered:

Yarboa · 2024-06-16T07:49:08Z

I also see this from ip netns

[root@node1 ~]# ip netns exec netns-f7133da7-ba6f-4ba2-f366-ad80f5835436 netstat -st
IcmpMsg:
    InType0: 2594
    OutType3: 1
    OutType8: 9218
Tcp:
    1756 active connection openings
    0 passive connection openings
    0 failed connection attempts
    1717 connection resets received
    1 connections established
    136362 segments received
    141673 segments sent out
    14798 segments retransmitted
    0 bad segments received
    16 resets sent
UdpLite:
TcpExt:
    3 TCP sockets finished time wait in fast timer
    9 packets rejected in established connections because of timestamp
    1747 delayed acks sent
    Quick ack mode was activated 763 times
    1759 packet headers predicted
    61560 acknowledgments not containing data payload received
    32411 predicted acknowledgments
    TCPLostRetransmit: 11314
    TCPTimeouts: 13056
    TCPLossProbes: 1742
    TCPBacklogCoalesce: 5
    TCPDSACKOldSent: 763
    TCPRcvCoalesce: 70
    TCPOrigDataSent: 39469
    TCPKeepAlive: 62338
    TCPDelivered: 37735
    TcpTimeoutRehash: 13056

@dougsland Maybe need to sync all containers with ntp

Yarboa · 2024-06-16T08:39:25Z

I also see this

[root@default-0 ~]# date
Sun Jun 16 04:37:05 AM EDT 2024
[root@default-0 ~]# podman exec -it node1 bash
[root@node1 ~]# date
Sun Jun 16 08:37:17 UTC 2024
[root@node1 ~]# 
[root@node1 ~]# exit
exit
[root@default-0 ~]# podman exec -it control bash
[root@control ~]# date
Sun Jun 16 08:37:33 UTC 2024

Need to check adding --tz=local to control and node1

Followed this blog https://www.redhat.com/sysadmin/tick-tock-container-time

dougsland · 2024-07-03T06:37:09Z

I also see this

[root@default-0 ~]# date
Sun Jun 16 04:37:05 AM EDT 2024
[root@default-0 ~]# podman exec -it node1 bash
[root@node1 ~]# date
Sun Jun 16 08:37:17 UTC 2024
[root@node1 ~]# 
[root@node1 ~]# exit
exit
[root@default-0 ~]# podman exec -it control bash
[root@control ~]# date
Sun Jun 16 08:37:33 UTC 2024

Need to check adding --tz=local to control and node1

I remember this one: #394

Followed this blog https://www.redhat.com/sysadmin/tick-tock-container-time

Yarboa added bug Something isn't working testing labels Jun 14, 2024

dougsland changed the title ~~CI/CD e2e-tier-0 tests failures in bluchi-controller~~ CI/CD e2e-tier-0 tests failures in bluechi-controller Jul 3, 2024

Yarboa mentioned this issue Jul 7, 2024

test stability changes #460

Merged

Yarboa closed this as completed in #460 Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI/CD e2e-tier-0 tests failures in bluechi-controller #462

CI/CD e2e-tier-0 tests failures in bluechi-controller #462

Yarboa commented Jun 14, 2024 •

edited

Loading

Yarboa commented Jun 16, 2024

Yarboa commented Jun 16, 2024 •

edited

Loading

dougsland commented Jul 3, 2024

CI/CD e2e-tier-0 tests failures in bluechi-controller #462

CI/CD e2e-tier-0 tests failures in bluechi-controller #462

Comments

Yarboa commented Jun 14, 2024 • edited Loading

Yarboa commented Jun 16, 2024

Yarboa commented Jun 16, 2024 • edited Loading

dougsland commented Jul 3, 2024

Yarboa commented Jun 14, 2024 •

edited

Loading

Yarboa commented Jun 16, 2024 •

edited

Loading