Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI/CD e2e-tier-0 tests failures in bluechi-controller #462

Closed
Yarboa opened this issue Jun 14, 2024 · 3 comments · Fixed by #460
Closed

CI/CD e2e-tier-0 tests failures in bluechi-controller #462

Yarboa opened this issue Jun 14, 2024 · 3 comments · Fixed by #460
Labels
bug Something isn't working testing

Comments

@Yarboa
Copy link
Collaborator

Yarboa commented Jun 14, 2024

Tiers e2e tests started to fail on qm-node

Jun 14 15:48:57 control bluechi-controller[158]: Node 'qm-node1' disconnected
Jun 14 15:49:53 control bluechi-controller[158]: Registered managed node from fd 11 as 'qm-node1'

While bluechi-agent inside qm indicates log connectivity

podman exec  control bash -c "systemctl status bluechi-controller"
● bluechi-controller.service - BlueChi Controller systemd service
     Loaded: loaded (/usr/lib/systemd/system/bluechi-controller.service; enabled; preset: disabled)
     Active: active (running) since Fri 2024-06-14 14:12:45 UTC; 1h 40min ago

Jun 14 15:52:16 control bluechi-controller[158]: Node 'qm-node1' disconnected
Jun 14 15:52:45 control bluechi-controller[158]: Registered managed node from fd 11 as 'qm-node1'

While the following

podman exec node1 bash -c "systemctl status bluechi-agent"
● bluechi-agent.service - BlueChi systemd service controller agent daemon
     Loaded: loaded (/usr/lib/systemd/system/bluechi-agent.service; enabled; preset: disabled)
     Active: active (running) since Fri 2024-06-14 14:14:00 UTC; 1h 40min ago


Jun 14 14:14:00 node1 systemd[1]: Started BlueChi systemd service controller agent daemon.
Jun 14 14:14:00 node1 bluechi-agent[1960]: Starting bluechi-agent 0.9.0-0.202405230627.git23191d3
Jun 14 14:14:00 node1 bluechi-agent[1960]: Connecting to controller on tcp:host=10.90.0.2,port=842
Jun 14 14:14:00 node1 bluechi-agent[1960]: Connected to controller as 'node1'

More tests here, it seems that network is down every 60-90 seconds
Install in node1
dnf -y install --releasever 9 --installroot /usr/lib/qm/rootfs python iputils

ControllerHost=10.90.0.2

[root@node1 ~]# podman exec qm bash -c "ping 10.90.0.2"

bash-5.1# ping  10.90.0.2
PING 10.90.0.2 (10.90.0.2) 56(84) bytes of data.
64 bytes from 10.90.0.2: icmp_seq=1 ttl=63 time=0.255 ms
64 bytes from 10.90.0.2: icmp_seq=14 ttl=63 time=0.213 ms
**64 bytes from 10.90.0.2: icmp_seq=48 ttl=63 time=2368 ms
64 bytes from 10.90.0.2: icmp_seq=49 ttl=63 time=1344 ms**
64 bytes from 10.90.0.2: icmp_seq=50 ttl=63 time=320 ms
64 bytes from 10.90.0.2: icmp_seq=51 ttl=63 time=0.282 ms
64 bytes from 10.90.0.2: icmp_seq=52 ttl=63 time=0.136 ms
64 bytes from 10.90.0.2: icmp_seq=53 ttl=63 time=0.135 ms
64 bytes from 10.90.0.2: icmp_seq=54 ttl=63 time=0.216 ms

It also happens from the namespace itself

[root@node1 ~]# ip netns exec netns-f7133da7-ba6f-4ba2-f366-ad80f5835436 ping 10.90.0.2
PING 10.90.0.2 (10.90.0.2) 56(84) bytes of data.

While ping from node1 to controller adress is not uninterruptible

@Yarboa Yarboa added bug Something isn't working testing labels Jun 14, 2024
@Yarboa
Copy link
Collaborator Author

Yarboa commented Jun 16, 2024

I also see this from ip netns

[root@node1 ~]# ip netns exec netns-f7133da7-ba6f-4ba2-f366-ad80f5835436 netstat -st
IcmpMsg:
    InType0: 2594
    OutType3: 1
    OutType8: 9218
Tcp:
    1756 active connection openings
    0 passive connection openings
    0 failed connection attempts
    1717 connection resets received
    1 connections established
    136362 segments received
    141673 segments sent out
    14798 segments retransmitted
    0 bad segments received
    16 resets sent
UdpLite:
TcpExt:
    3 TCP sockets finished time wait in fast timer
    9 packets rejected in established connections because of timestamp
    1747 delayed acks sent
    Quick ack mode was activated 763 times
    1759 packet headers predicted
    61560 acknowledgments not containing data payload received
    32411 predicted acknowledgments
    TCPLostRetransmit: 11314
    TCPTimeouts: 13056
    TCPLossProbes: 1742
    TCPBacklogCoalesce: 5
    TCPDSACKOldSent: 763
    TCPRcvCoalesce: 70
    TCPOrigDataSent: 39469
    TCPKeepAlive: 62338
    TCPDelivered: 37735
    TcpTimeoutRehash: 13056

@dougsland Maybe need to sync all containers with ntp

@Yarboa
Copy link
Collaborator Author

Yarboa commented Jun 16, 2024

I also see this

[root@default-0 ~]# date
Sun Jun 16 04:37:05 AM EDT 2024
[root@default-0 ~]# podman exec -it node1 bash
[root@node1 ~]# date
Sun Jun 16 08:37:17 UTC 2024
[root@node1 ~]# 
[root@node1 ~]# exit
exit
[root@default-0 ~]# podman exec -it control bash
[root@control ~]# date
Sun Jun 16 08:37:33 UTC 2024

Need to check adding --tz=local to control and node1

Followed this blog https://www.redhat.com/sysadmin/tick-tock-container-time

@dougsland dougsland changed the title CI/CD e2e-tier-0 tests failures in bluchi-controller CI/CD e2e-tier-0 tests failures in bluechi-controller Jul 3, 2024
@dougsland
Copy link
Collaborator

I also see this

[root@default-0 ~]# date
Sun Jun 16 04:37:05 AM EDT 2024
[root@default-0 ~]# podman exec -it node1 bash
[root@node1 ~]# date
Sun Jun 16 08:37:17 UTC 2024
[root@node1 ~]# 
[root@node1 ~]# exit
exit
[root@default-0 ~]# podman exec -it control bash
[root@control ~]# date
Sun Jun 16 08:37:33 UTC 2024

Need to check adding --tz=local to control and node1

I remember this one: #394

Followed this blog https://www.redhat.com/sysadmin/tick-tock-container-time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working testing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants