-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mizar][local][Scale-out][2TPx2RPx2worker] Intermittently failed to connect to tenant's service: Connection timed out #1410
Comments
In [mizar][local][Scale-out][2TPx2RPx2worker] AWS Ubuntu20.04 environment, reproduced the issue by two enigineers. curl 10.0.12.172(service IP of service) -- 4 failures of 20 tests for-loop test: service has 4 pods: |
same symptom was reproduced on kube-up env (2TP, 2RP, each RP w/ 2 workers): |
@phudtran If you have the time, can you take a look? |
I believe it is related to this issue that @Hong-Chang looked into a while back. |
Hong and I narrowed down the issue in local 2x2x2 scale-out development environment with AWS Ubuntu20.04 - when deployment contained 1 replica and started 1 pod, the service mapped 1 pod, we found when login one pod to connect to service which mapped this pod itself, connectivity was 100% hanging.
10.0.217.56 (service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local) --- pod/ying-nginx-1bbb-7564746584-lbkl8 When we login the pod (ns-1bbb pod/ying-nginx-1bbb-7564746584-lbkl8 9106484152452311463 1/1 Running 0 19s 11.1.0.16 ip-172-31-9-41 ), it was failed to connect to service IP 10.0.217.56 (service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local). BUT when connect to other two service IPs - 10.0.104.1 and 10.0.228.105 for 20 times, respectively, the connections were all successful.
|
I did some further investigation. This issue is not related with checksum calculation. I don't see any log mentioned checksum error. Later I figured out it's a scenario that missed consideration by mizar. The scenario is: a pod can connect to a service, while this service pointing back to the exact same pod. Further more, this issue shall be in the mizar all the time, but we just haven't noticed it. It's not related with Arktos. I repro'd the exact same issue with kubernetes + mizar. So this issue is not proper to be in arktos repo. I created a new issue in mizar repro and close this one. The new issue is CentaurusInfra/mizar#648 Connectivity issue when pod A send traffic to service which points to pod A back. |
What happened:
In terms of test plan at https://github.com/CentaurusInfra/arktos/wiki/Mizar-Arktos-Integration-Release-2022-0130-Test-Plan, when test the connectivity to the tenant bbb's services, the error "failed to connect to <service-name.ns.svc,cluster.local> port 80: Connection timed out" intermittently happens during for-loop tests in 11 rounds to connect to the following tenant bbb's three services.
Error:
What you expected to happen:
The connectivity to tenant's service always works during for-loop tests.
How to reproduce it (as minimally and precisely as possible):
The environment of Local Scale-OUT 2 TP X 2 RP x 2 worker with cniplugin=mizar:
TP1: ip-172-31-4-135
TP2: ip-172-31-1-189
RP1: ip-172-31-2-248
RP1-Worker1: ip-172-31-8-138
RP1-Worker2: ip-172-31-3-8
RP2: ip-172-31-9-41
RP2-Worker1: ip-172-31-10-32
RP2-Worker2: ip-172-31-3-16
For example:
Reference: Failure summary in 12 nginx containers
RP1: 3 containers
RP2: 1 container
RP1-Worker1: 1 container
RP1-Worker2: 2 containers
RP2-Worker1: 3 containers
RP2-Worker2: 2 containers
Anything else we need to know?:
The error "failed to connect to <service-name.ns.svc,cluster.local> port 80: Connection timed out" intermittently happens in [mizar][local][Scale-out][2TPx2RP] clusters as well.
Environment:
kubectl version
):cat /etc/os-release
): Ubuntu 20.04uname -a
): 5.13.0-1017-awsThe text was updated successfully, but these errors were encountered: