Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network infra flakes for quay.io cdn DNS #852

Open
dustymabe opened this issue Apr 11, 2023 · 10 comments
Open

network infra flakes for quay.io cdn DNS #852

dustymabe opened this issue Apr 11, 2023 · 10 comments
Labels
jira For syncing to JIRA

Comments

@dustymabe
Copy link
Member

We occasionally see a DNS flake when utilizing our aarch64 mutli-arch builder.

[2023-04-11T21:46:41.208Z] + cosa remote-session create --image quay.io/coreos-assembler/coreos-assembler:main --expiration 4h --workdir /home/jenkins/agent/workspace/kola-upgrade
[2023-04-11T21:46:41.208Z] notice: failed to look up uid in /etc/passwd; enabling workaround
[2023-04-11T21:46:41.463Z] Trying to pull quay.io/coreos-assembler/coreos-assembler:main...
[2023-04-11T21:46:41.720Z] Error: copying system image from manifest list: parsing image configuration: Get "https://cdn03.quay.io/sha256/ff/ff59ae06a00f4d7543304a98dc73e8673786327b2dec2e853547b98c762c354b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAI5LUAQGPZRPNKSJA%2F20230411%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230411T214641Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&X-Amz-Signature=b6996d8ba1615daa726fd54fa1c0b3bf07f1b53c7413f1bc2c8c19be7b9e86ba&cf_sign=c14ihmYA50IPx0pEFD5QKHb9lWxFjkCHqeHsSHAboHM3edzLcFyxdLso5XVbxvk9QQlU3k1%2B03axO8emqmmh6sdm7gfaO4LbyYPUg0S7lKiaNEp5E6QhxUO2gCot3m0qHUtIgEz3KNX6wWwPFIHIsUbMjR5VUuJdHFR%2B36RYJo5J4w3g1BvDIcwRjiBml6GIKlfWCvImELxRZtS1%2FISds3stNENUJCTv%2FFgiygbuJrLKumDONeTFlAFgYnlNqM1uSuB2qt%2FJgJaYkoSuBlcPMQpU37bMe9TEYwJUnKjh4Fdqy9ywBQ8tiyJ51VtsJPfalWoboG8hNJ%2FnFv2INWwYeQ%3D%3D&cf_expiry=1681250201&region=us-east-1": dial tcp: lookup cdn03.quay.io: no such host
[2023-04-11T21:46:41.720Z] Error: exit status 125
[2023-04-11T21:46:41.720Z] Usage:
[2023-04-11T21:46:41.720Z]   remote-session create [flags]
[2023-04-11T21:46:41.720Z] 
[2023-04-11T21:46:41.720Z] Flags:
[2023-04-11T21:46:41.720Z]       --expiration string   The amount of time before the remote-session auto-exits (default "infinity")
[2023-04-11T21:46:41.720Z]   -h, --help                help for create
[2023-04-11T21:46:41.720Z]       --image string        The COSA container image to use on the remote (default "quay.io/coreos-assembler/coreos-assembler:main")
[2023-04-11T21:46:41.720Z]       --workdir string      The COSA working directory to use inside the container (default "/srv")
[2023-04-11T21:46:41.720Z] 
[2023-04-11T21:46:41.720Z] error: exit status 125

We only seem to see this on our aarch64 builder, which is located in AWS, which is also where quay's infra is hosted IIUC.

@jlebon
Copy link
Member

jlebon commented Apr 13, 2023

Should we add a retry knob to cosa remote-session create?

@dustymabe
Copy link
Member Author

dustymabe commented Apr 13, 2023

Probably. I'm not sure how intermittent the network problem is. It might resolve itself in a second or it might be something that lasts 10s of seconds. So we'd have to experiment.

@dustymabe
Copy link
Member Author

We could also experiment with using DNS from outside AWS on that builder and see if that helps.

@dustymabe
Copy link
Member Author

at least podman build has the ability to retry when pulling from the registry. I see no such options for podman run

@jlebon
Copy link
Member

jlebon commented Apr 28, 2023

We discussed this out-of-band. There's no retry for podman pull either, but we could retry it e.g. 3 times.

@dustymabe
Copy link
Member Author

@edsantiago
Copy link

@dustymabe disabling systemd-resolved fixed everything for us. We went from dozens of flakes per day, to zero in a month --- except, we're still seeing the flake in Fedora gating tests, a different setup than Cirrus, on which I have not disabled systemd-resolved (it has been on my TODO list for two weeks). And no, this is not an AWS-only issue. Anywhere that systemd-resolved is used, it will flake.

@c4rt0 c4rt0 added the jira For syncing to JIRA label Jul 20, 2023
@thrix
Copy link

thrix commented Jul 24, 2023

In the same boat here, we disable systemd-resolved on Testing Farm workers back in 2021 or so, no more weird DNS issues afterwards :(

I will follow up on this tomorrow, seems it is time to find the root cause of this problem.

Until then, we will most probably just disable it as a workaround in Fedora CI, CentOS Stream CI and Packit

@dustymabe
Copy link
Member Author

I chimed in over in containers/podman#19770 (comment)

@dustymabe
Copy link
Member Author

We should be able to switch to running a podman pull with a --retry once containers/podman@80b1e95 lands in a FCOS release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira For syncing to JIRA
Projects
None yet
Development

No branches or pull requests

5 participants