-
Notifications
You must be signed in to change notification settings - Fork 16
Repeated CI instability on minikube (travis) #240
Comments
Btw, the timeout on this is only 10 seconds: https://github.com/kubernetes/kubernetes/blob/cc67ccfd7f4f0bc96d7f1c8e5fe8577821757d03/plugin/pkg/admission/resourcequota/controller.go#L620 |
@mmazur can you check how kubevirt ci is working? I think they should have solved the issues already. Let's reuse what they did so far. |
@davidvossel has started working on a consumable version of the cluster setup/teardown scripts just now. As soon as they're ready, we'll switch. If it fixes this issue, great. If not, then it won't be just our problem. :) |
Just occurred to me that travis was minikube and jenkins is kubevirtci, so it must be a weird bug we're hitting. |
* Run PVC tests first to maybe avoid hitting #240 * Show useful info when finishing tests * Test both stable and devel ansible
Seems that ever since This would suggest we are occasionally hitting some kind of bug when we overburden the cluster with all of our testing. It's unfortunate I didn't start documenting this earlier, as I'm not sure now whether this is Travis–only or whether we've seen this on Jenkins as well. I'll go with Travis–only until I get a same log from Jenkins to link here, though if I am to trust the younger me from a month ago, this did happen on Jenkins as well. |
Yup, the fails happen close to the end of the ansible run, but not always at the same point: |
Just checked and minikube does not set any memory limits on the containers (except the coredns ones), so with 7+ gigs at its disposal, the cluster is probably not running out of memory. |
|
Added a workaround for this to our tests that kick in when we're running on travis (here). Basically a bunch of sleeps to let the cluster recover from errors between playbooks. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
@mmazur Is it something we still want to work on? |
Yes, very much so. Next time I'm working on something in the vicinity, I'll definitively try to get this debugged, as we rely on the travis CI very heavily and it'd be great if that didn't fail every once in a while for no good reason. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. /close |
@kubevirt-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
We've been hitting a random issue when running our
pvc creationtests, which ends up with this error message:Since it wasn't consistent, we've chalked it up to TravisCI being weird. Now we're on Jenkins with completely different hw running the cluster+tests and are seeing the same issue.
Google says that this is rare, but does happen on occasion:
Two scenarios come to my mind at this time:
Action1: mess around with task ordering in our test playbooks to maybe help avoid hitting the issue.Tried, didn't help, it just happens on a different task now. (See comments.)Suggestions welcome.
The text was updated successfully, but these errors were encountered: