Skip to content
This repository has been archived by the owner on Sep 16, 2024. It is now read-only.

Repeated CI instability on minikube (travis) #240

Closed
1 of 4 tasks
mmazur opened this issue May 31, 2019 · 16 comments
Closed
1 of 4 tasks

Repeated CI instability on minikube (travis) #240

mmazur opened this issue May 31, 2019 · 16 comments
Labels
kind/bug lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@mmazur
Copy link
Contributor

mmazur commented May 31, 2019

/kind bug

We've been hitting a random issue when running our pvc creation tests, which ends up with this error message:

Failed to create object: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"Internal error occurred: resource quota evaluates timeout\",\"reason\":\"InternalError\",\"details\":{\"causes\":[{\"message\":\"resource quota evaluates timeout\"}]},\"code\":500}\\n'"

Since it wasn't consistent, we've chalked it up to TravisCI being weird. Now we're on Jenkins with completely different hw running the cluster+tests and are seeing the same issue.

Google says that this is rare, but does happen on occasion:

  1. Few PVCs are are failed to create while creating 100 pvcs using script gluster/gluster-csi-driver#92
  2. kubectl failed to create resource due to "resource quota evaluates timeout" kubernetes/kubernetes#67531

Two scenarios come to my mind at this time:

  1. K8S has performance knobs that can be tuned and the https://github.com/kubevirt/kubevirtci cluster we're using has them turned down too much and we're on occasion getting some parts of the cluster (etcd?) to choke on the workload.
    • Possible action: ask kubevirtci people about their take on this
  2. This is simply an obscure bug and we're unlucky.
    • Action1: mess around with task ordering in our test playbooks to maybe help avoid hitting the issue. Tried, didn't help, it just happens on a different task now. (See comments.)
    • Action2: get our jenkins mock-runner to endlessly loop the tests on my own hw and hope to hit the bug, at which point I can get debug info out of it.
    • Action3: connect with our CI people and get them to somehow freeze a CI machine for further debugging next time we hit the issue.

Suggestions welcome.

@mmazur
Copy link
Contributor Author

mmazur commented May 31, 2019

@pkliczewski
Copy link

@mmazur can you check how kubevirt ci is working? I think they should have solved the issues already. Let's reuse what they did so far.

@mmazur
Copy link
Contributor Author

mmazur commented May 31, 2019

@davidvossel has started working on a consumable version of the cluster setup/teardown scripts just now. As soon as they're ready, we'll switch. If it fixes this issue, great. If not, then it won't be just our problem. :)

@mmazur
Copy link
Contributor Author

mmazur commented May 31, 2019

Just occurred to me that travis was minikube and jenkins is kubevirtci, so it must be a weird bug we're hitting.

pkliczewski pushed a commit that referenced this issue Jun 3, 2019
* Run PVC tests first to maybe avoid hitting #240

* Show useful info when finishing tests

* Test both stable and devel ansible
@mmazur mmazur changed the title Repeated CI instability when running PVC tests Repeated CI instability on Travis Jun 25, 2019
@mmazur
Copy link
Contributor Author

mmazur commented Jun 25, 2019

Seems that ever since kubevirt_pvc tests were moved to run first the same issue gets hit on a different playbook. This build log hits the same issue in the preset playbook, which is also near the end of the tests, as kubevirt_pvc.yml used to be.

This would suggest we are occasionally hitting some kind of bug when we overburden the cluster with all of our testing.

It's unfortunate I didn't start documenting this earlier, as I'm not sure now whether this is Travis–only or whether we've seen this on Jenkins as well. I'll go with Travis–only until I get a same log from Jenkins to link here, though if I am to trust the younger me from a month ago, this did happen on Jenkins as well.

@mmazur
Copy link
Contributor Author

mmazur commented Jun 25, 2019

Yup, the fails happen close to the end of the ansible run, but not always at the same point:

@mmazur
Copy link
Contributor Author

mmazur commented Jun 26, 2019

Just checked and minikube does not set any memory limits on the containers (except the coredns ones), so with 7+ gigs at its disposal, the cluster is probably not running out of memory.

@mmazur
Copy link
Contributor Author

mmazur commented Jun 26, 2019

NAMESPACE     NAME                               READY   STATUS             RESTARTS   AGE
cdi           cdi-apiserver-599d6fd49-kdzzb      1/1     Running            0          11m
cdi           cdi-deployment-5d4764ff96-gvsx5    0/1     CrashLoopBackOff   4          11m
cdi           cdi-operator-7cd554798c-xpx7k      0/1     CrashLoopBackOff   4          12m
cdi           cdi-uploadproxy-5c57b4db65-mwnxn   1/1     Running            0          11m
kube-system   coredns-fb8b8dccf-pg527            1/1     Running            0          13m
kube-system   coredns-fb8b8dccf-qx6r2            1/1     Running            0          13m
kube-system   etcd-minikube                      1/1     Running            0          12m
kube-system   kube-addon-manager-minikube        1/1     Running            0          12m
kube-system   kube-apiserver-minikube            1/1     Running            0          12m
kube-system   kube-controller-manager-minikube   0/1     CrashLoopBackOff   5          12m
kube-system   kube-proxy-bkmx6                   1/1     Running            0          13m
kube-system   kube-scheduler-minikube            0/1     CrashLoopBackOff   5          12m
kube-system   storage-provisioner                1/1     Running            0          13m
kubevirt      virt-api-854b6cbbb8-95hn6          1/1     Running            0          11m
kubevirt      virt-api-854b6cbbb8-nqvpz          1/1     Running            1          11m
kubevirt      virt-controller-546799f76-5jqwk    1/1     Running            2          11m
kubevirt      virt-controller-546799f76-t5k8d    0/1     CrashLoopBackOff   2          11m
kubevirt      virt-handler-fxfgc                 1/1     Running            0          11m
kubevirt      virt-operator-9646fdf49-5z7z2      0/1     CrashLoopBackOff   3          12m
kubevirt      virt-operator-9646fdf49-78wtb      1/1     Running            2          12m

Source: https://api.travis-ci.org/v3/job/550713626/log.txt

@mmazur mmazur changed the title Repeated CI instability on Travis Repeated CI instability on minikube (travis) Jun 27, 2019
@mmazur
Copy link
Contributor Author

mmazur commented Jun 27, 2019

Added a workaround for this to our tests that kick in when we're running on travis (here). Basically a bunch of sleeps to let the cluster recover from errors between playbooks.

@kubevirt-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@pkliczewski
Copy link

@mmazur Is it something we still want to work on?

@mmazur
Copy link
Contributor Author

mmazur commented Sep 25, 2019

Yes, very much so. Next time I'm working on something in the vicinity, I'll definitively try to get this debugged, as we rely on the travis CI very heavily and it'd be great if that didn't fail every once in a while for no good reason.

@kubevirt-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubevirt-bot kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2019
@kubevirt-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@kubevirt-bot kubevirt-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 23, 2020
@kubevirt-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

@kubevirt-bot
Copy link

@kubevirt-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

3 participants