Repeated CI instability on minikube (travis) #240

mmazur · 2019-05-31T09:31:32Z

/kind bug

We've been hitting a random issue when running our ~~pvc creation~~ tests, which ends up with this error message:

Failed to create object: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"Internal error occurred: resource quota evaluates timeout\",\"reason\":\"InternalError\",\"details\":{\"causes\":[{\"message\":\"resource quota evaluates timeout\"}]},\"code\":500}\\n'"

Since it wasn't consistent, we've chalked it up to TravisCI being weird. Now we're on Jenkins with completely different hw running the cluster+tests and are seeing the same issue.

Google says that this is rare, but does happen on occasion:

Two scenarios come to my mind at this time:

K8S has performance knobs that can be tuned and the https://github.com/kubevirt/kubevirtci cluster we're using has them turned down too much and we're on occasion getting some parts of the cluster (etcd?) to choke on the workload.
- Possible action: ask kubevirtci people about their take on this
This is simply an obscure bug and we're unlucky.
- ~~Action1: mess around with task ordering in our test playbooks to maybe help avoid hitting the issue.~~ Tried, didn't help, it just happens on a different task now. (See comments.)
- Action2: get our jenkins mock-runner to endlessly loop the tests on my own hw and hope to hit the bug, at which point I can get debug info out of it.
- Action3: connect with our CI people and get them to somehow freeze a CI machine for further debugging next time we hit the issue.

Suggestions welcome.

The text was updated successfully, but these errors were encountered:

mmazur · 2019-05-31T09:36:23Z

Btw, the timeout on this is only 10 seconds: https://github.com/kubernetes/kubernetes/blob/cc67ccfd7f4f0bc96d7f1c8e5fe8577821757d03/plugin/pkg/admission/resourcequota/controller.go#L620

pkliczewski · 2019-05-31T11:48:26Z

@mmazur can you check how kubevirt ci is working? I think they should have solved the issues already. Let's reuse what they did so far.

mmazur · 2019-05-31T15:21:38Z

@davidvossel has started working on a consumable version of the cluster setup/teardown scripts just now. As soon as they're ready, we'll switch. If it fixes this issue, great. If not, then it won't be just our problem. :)

mmazur · 2019-05-31T16:03:01Z

Just occurred to me that travis was minikube and jenkins is kubevirtci, so it must be a weird bug we're hitting.

* Run PVC tests first to maybe avoid hitting #240 * Show useful info when finishing tests * Test both stable and devel ansible

mmazur · 2019-06-25T11:12:22Z

Seems that ever since kubevirt_pvc tests were moved to run first the same issue gets hit on a different playbook. This build log hits the same issue in the preset playbook, which is also near the end of the tests, as kubevirt_pvc.yml used to be.

This would suggest we are occasionally hitting some kind of bug when we overburden the cluster with all of our testing.

It's unfortunate I didn't start documenting this earlier, as I'm not sure now whether this is Travis–only or whether we've seen this on Jenkins as well. I'll go with Travis–only until I get a same log from Jenkins to link here, though if I am to trust the younger me from a month ago, this did happen on Jenkins as well.

mmazur · 2019-06-25T14:33:01Z

Yup, the fails happen close to the end of the ansible run, but not always at the same point:

mmazur · 2019-06-26T10:11:28Z

Just checked and minikube does not set any memory limits on the containers (except the coredns ones), so with 7+ gigs at its disposal, the cluster is probably not running out of memory.

mmazur · 2019-06-26T10:56:30Z

NAMESPACE     NAME                               READY   STATUS             RESTARTS   AGE
cdi           cdi-apiserver-599d6fd49-kdzzb      1/1     Running            0          11m
cdi           cdi-deployment-5d4764ff96-gvsx5    0/1     CrashLoopBackOff   4          11m
cdi           cdi-operator-7cd554798c-xpx7k      0/1     CrashLoopBackOff   4          12m
cdi           cdi-uploadproxy-5c57b4db65-mwnxn   1/1     Running            0          11m
kube-system   coredns-fb8b8dccf-pg527            1/1     Running            0          13m
kube-system   coredns-fb8b8dccf-qx6r2            1/1     Running            0          13m
kube-system   etcd-minikube                      1/1     Running            0          12m
kube-system   kube-addon-manager-minikube        1/1     Running            0          12m
kube-system   kube-apiserver-minikube            1/1     Running            0          12m
kube-system   kube-controller-manager-minikube   0/1     CrashLoopBackOff   5          12m
kube-system   kube-proxy-bkmx6                   1/1     Running            0          13m
kube-system   kube-scheduler-minikube            0/1     CrashLoopBackOff   5          12m
kube-system   storage-provisioner                1/1     Running            0          13m
kubevirt      virt-api-854b6cbbb8-95hn6          1/1     Running            0          11m
kubevirt      virt-api-854b6cbbb8-nqvpz          1/1     Running            1          11m
kubevirt      virt-controller-546799f76-5jqwk    1/1     Running            2          11m
kubevirt      virt-controller-546799f76-t5k8d    0/1     CrashLoopBackOff   2          11m
kubevirt      virt-handler-fxfgc                 1/1     Running            0          11m
kubevirt      virt-operator-9646fdf49-5z7z2      0/1     CrashLoopBackOff   3          12m
kubevirt      virt-operator-9646fdf49-78wtb      1/1     Running            2          12m

Source: https://api.travis-ci.org/v3/job/550713626/log.txt

mmazur · 2019-06-27T10:06:49Z

Added a workaround for this to our tests that kick in when we're running on travis (here). Basically a bunch of sleeps to let the cluster recover from errors between playbooks.

kubevirt-bot · 2019-09-25T10:26:04Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

pkliczewski · 2019-09-25T10:27:08Z

@mmazur Is it something we still want to work on?

mmazur · 2019-09-25T10:31:50Z

Yes, very much so. Next time I'm working on something in the vicinity, I'll definitively try to get this debugged, as we rely on the travis CI very heavily and it'd be great if that didn't fail every once in a while for no good reason.

kubevirt-bot · 2019-12-24T10:57:12Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot · 2020-01-23T11:53:54Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kubevirt-bot · 2020-02-22T12:41:32Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

kubevirt-bot · 2020-02-22T12:41:35Z

@kubevirt-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kubevirt-bot added the kind/bug label May 31, 2019

mmazur mentioned this issue May 31, 2019

ansible @devel and @stable-2.8 break kubevirt_pvc (annotations aren't set) #242

Closed

mmazur added a commit to mmazur/ansible-kubevirt-modules that referenced this issue May 31, 2019

Run PVC tests first to maybe avoid hitting kubevirt#240

ada6967

pkliczewski pushed a commit that referenced this issue Jun 3, 2019

Better logging + test multiple ansible versions (#243)

bfb4913

* Run PVC tests first to maybe avoid hitting #240 * Show useful info when finishing tests * Test both stable and devel ansible

mmazur changed the title ~~Repeated CI instability when running PVC tests~~ Repeated CI instability on Travis Jun 25, 2019

mmazur mentioned this issue Jun 25, 2019

Switch to upstream kubevirtci scripts for cluster setup #255

Merged

mmazur changed the title ~~Repeated CI instability on Travis~~ Repeated CI instability on minikube (travis) Jun 27, 2019

mmazur mentioned this issue Jul 9, 2019

Make minikube ver configurable; bump default to 1.2.0 fabiand/traviskube#47

Closed

kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2019

kubevirt-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 23, 2020

kubevirt-bot closed this as completed Feb 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated CI instability on minikube (travis) #240

Repeated CI instability on minikube (travis) #240

mmazur commented May 31, 2019 •

edited

Loading

mmazur commented May 31, 2019

pkliczewski commented May 31, 2019

mmazur commented May 31, 2019

mmazur commented May 31, 2019

mmazur commented Jun 25, 2019 •

edited

Loading

mmazur commented Jun 25, 2019

mmazur commented Jun 26, 2019

mmazur commented Jun 26, 2019

mmazur commented Jun 27, 2019

kubevirt-bot commented Sep 25, 2019

pkliczewski commented Sep 25, 2019

mmazur commented Sep 25, 2019

kubevirt-bot commented Dec 24, 2019

kubevirt-bot commented Jan 23, 2020

kubevirt-bot commented Feb 22, 2020

kubevirt-bot commented Feb 22, 2020

Repeated CI instability on minikube (travis) #240

Repeated CI instability on minikube (travis) #240

Comments

mmazur commented May 31, 2019 • edited Loading

mmazur commented May 31, 2019

pkliczewski commented May 31, 2019

mmazur commented May 31, 2019

mmazur commented May 31, 2019

mmazur commented Jun 25, 2019 • edited Loading

mmazur commented Jun 25, 2019

mmazur commented Jun 26, 2019

mmazur commented Jun 26, 2019

mmazur commented Jun 27, 2019

kubevirt-bot commented Sep 25, 2019

pkliczewski commented Sep 25, 2019

mmazur commented Sep 25, 2019

kubevirt-bot commented Dec 24, 2019

kubevirt-bot commented Jan 23, 2020

kubevirt-bot commented Feb 22, 2020

kubevirt-bot commented Feb 22, 2020

mmazur commented May 31, 2019 •

edited

Loading

mmazur commented Jun 25, 2019 •

edited

Loading