Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky cross build #974

Closed
BenTheElder opened this issue Oct 20, 2019 · 10 comments · Fixed by #975
Closed

flaky cross build #974

BenTheElder opened this issue Oct 20, 2019 · 10 comments · Fixed by #975
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@BenTheElder
Copy link
Member

BenTheElder commented Oct 20, 2019

/assign
/priority important-soon
/lifecycle active

docker: Error response from daemon: failed to mkdir /docker-graph/volumes/kind-build-cache/_data/bin: mkdir /docker-graph/volumes/kind-build-cache/_data/bin: file exists.

see:
#961 (comment)

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kind/961/pull-kind-build/1185113192976617474/build-log.txt

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kind-build/1185845621353877504

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kind/648/pull-kind-build/1185680916383535105

@BenTheElder BenTheElder added the kind/bug Categorizes issue or PR as related to a bug. label Oct 20, 2019
@k8s-ci-robot k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Oct 20, 2019
@BenTheElder
Copy link
Member Author

This is likely because we're launching many containers against this volume in parallel, which is mounted over a directory structure already containing sub-directories (and thus unioned).

Without any mounts:

$ docker run --rm -it golang:1.13.3 ls /go
bin  src

Locally this is typically not an issue because the volume has already been used. It might happen on clean builds, but it's going to be racy.

The straightforward answer is to ensure that the directory / volume is fully setup before we start launching many parallel containers.

@BenTheElder
Copy link
Member Author

#975 should fix this.

@tao12345666333
Copy link
Member

I did some quick checks when I found this error.

Under normal circumstances, there is no problem with creating volume repeatedly.

I quickly searched the docker source code and found that there are three places where the same error might be output. But I still don't have time to verify it.

@BenTheElder
Copy link
Member Author

Right.
The problem isn't that we create the volume repeatedly, the problem is that we concurrently mount it for the first time over existing directories.

The bin subdir that is already in the container image must be created in the volume. When we start many containers concurrently for the first time against the volume they racily try to make this directory.

@tao12345666333
Copy link
Member

If we can continue to reproduce this problem, then I am very happy to verify and fix it after the holiday (on the docker side)

It seems that the most recent CIs have passed. 😂

@BenTheElder
Copy link
Member Author

I already sent a fix :-) #974 (comment) -- working on verification.

CI failed twice in the past 3 times https://prow.k8s.io/?job=ci-kind-build

@BenTheElder
Copy link
Member Author

BenTheElder commented Oct 20, 2019

oh: do you mean in docker proper?
I'm not sure this is actually considered a bug on docker's end? We're certainly abusing docker a bit in CI here 🤔

@tao12345666333
Copy link
Member

Yes, on the docker side. I think this user case should be quite a lot. (concurrent mount)

If I remember correctly, the last time I saw it was probably in the code related to the file copy.(I just don’t have a computer now, I can’t verify it in more detail.

@BenTheElder
Copy link
Member Author

so far I've not been able to hit the bug again (testing in #648), but I'm reasonably confident in the solution.

@tao12345666333
Copy link
Member

Then we can keep your solution👍 after three days, my holiday is over, I can do more testing and verification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants