-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567
Conversation
b205500
to
bbc87bc
Compare
It seems like a lot of code for what should, hopefully, only ever show up in RHCOS promotion jobs. do we have more cases where getting the end of the bootkube logs or something like that would also be a really good set of analysis? |
/hold let's make sure have an github issue that we use to discuss the actual use and scope of the analysis or even an openshift/enhancement before we think about adding this. |
Spun off #2569 for "do we want this?" |
@wking is there a new tarball available anywhere to use as base data for testing? thx |
bbc87bc
to
e20102c
Compare
And yes, many, many of these in CI: https://search.ci.openshift.org/?search=Bootstrap%20gather%20logs%20captured%20here |
Closing this as it has stalled for very long. Please feel free to re-open in case you intent to keep pushing it, or consider tracking the work in an issue if this requires agreement for implementation. /close |
@abhinavdahiya: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen I will keep pushing :) |
@wking: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
61bf20c
to
1c874dc
Compare
759c627
to
2755198
Compare
I've pushed 759c627 -> 2755198 updating my fixtures to include more entry samples, and adjusted the parser to be more robust so it doesn't choke on any of those samples. |
Instead of just dropping them into the users lap "here's a big tarball, have fun", look through them for obvious things that we can summarize. With: func runGatherBootstrapCmd(directory string) error { + return analyzeGatheredBootstrap("/tmp/log-bundle.tar.gz") to feed [1] into the analysis logic, the output looks like: WARNING control-plane/10.0.134.229 had failing systemd units: crio.service WARNING control-plane/10.0.134.229: crio.service: ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:31 UTC; 320ms ago Docs: https://github.com/cri-o/cri-o Process: 8491 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE) Main PID: 8491 (code=exited, status=1/FAILURE) CPU: 144ms Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Starting Open Container Initiative Daemon... Oct 24 11:11:31 ip-10-0-134-229 crio[8491]: time="2019-10-24 11:11:31.895986612Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory" Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Failed with result 'exit-code'. Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Failed to start Open Container Initiative Daemon. Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Consumed 144ms CPU time WARNING control-plane/10.0.134.243 had failing systemd units: crio.service WARNING control-plane/10.0.134.243: crio.service: ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:35 UTC; 8s ago Docs: https://github.com/cri-o/cri-o Process: 8439 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE) Main PID: 8439 (code=exited, status=1/FAILURE) CPU: 151ms Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Starting Open Container Initiative Daemon... Oct 24 11:11:35 ip-10-0-134-243 crio[8439]: time="2019-10-24 11:11:35.238163016Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory" Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Failed with result 'exit-code'. Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Failed to start Open Container Initiative Daemon. Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Consumed 151ms CPU time WARNING control-plane/10.0.157.61 had failing systemd units: crio.service WARNING control-plane/10.0.157.61: crio.service: ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:36 UTC; 1s ago Docs: https://github.com/cri-o/cri-o Process: 8379 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE) Main PID: 8379 (code=exited, status=1/FAILURE) CPU: 158ms Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Starting Open Container Initiative Daemon... Oct 24 11:11:36 ip-10-0-157-61 crio[8379]: time="2019-10-24 11:11:36.807284677Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory" Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Failed with result 'exit-code'. Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Failed to start Open Container Initiative Daemon. Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Consumed 158ms CPU time That's maybe a bit noisy, but mostly because all three control-plane machines failed the same way. Journal field docs are linked from the Go docstrings, but if these evolve later, the ones I'm using are from [2], except for UNIT, where I have opened an upstream documentation request [3]. It might be worth exposing this as: $ openshift-install analyze bootstrap PATH so folks could look at bootstrap logs which had been gathered by third-parties, but I'm punting on that for now. [1]: https://storage.googleapis.com/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.3/2455/artifacts/e2e-aws/installer/log-bundle-20191024111122.tar [2]: https://github.com/systemd/systemd/blob/v246/man/systemd.journal-fields.xml [3]: systemd/systemd#17538
46e6124
to
f7a1d33
Compare
❤️ I've pushed 46e6124 -> f7a1d33 to scrape SSH errors out of #4345's |
f7a1d33
to
f7c9ff4
Compare
Because release-image access is apparently open: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1324825813257818112/artifacts/e2e-aws/ipi-install-install/log-bundle-20201106221538.tar | tar xOz log-bundle-20201106221538/bootstrap/journals/release-image.log
-- Logs begin at Fri 2020-11-06 21:50:55 UTC, end at Fri 2020-11-06 22:15:38 UTC. --
Nov 06 21:51:12 ip-10-0-9-32 systemd[1]: Starting Download the OpenShift Release Image...
Nov 06 21:51:12 ip-10-0-9-32 release-image-download.sh[1667]: Pulling registry.build02.ci.openshift.org/ci-op-x1lmrvsl/release@sha256:63e99e4df15cf875fbbf7b660227a1c6833cce624d5ec66f6767c81b4c96659b...
Nov 06 21:51:22 ip-10-0-9-32 release-image-download.sh[1667]: d1c62a5431ad4b5886dc9dcb36ed1b86ade893ab6a12dd80d3807d65f19fe9ca
Nov 06 21:51:22 ip-10-0-9-32 systemd[1]: Started Download the OpenShift Release Image. while CI-release images are not: $ cat auth.json
{}
$ podman pull --authfile auth.json registry.build02.ci.openshift.org/ci-op-x1lmrvsl/release@sha256:63e99e4df15cf875fbbf7b660227a1c6833cce624d5ec66f6767c81b4c96659b
Trying to pull registry.build02.ci.openshift.org/ci-op-x1lmrvsl/release@sha256:63e99e4df15cf875fbbf7b660227a1c6833cce624d5ec66f6767c81b4c96659b...Getting image source signatures
Copying blob sha256:ec1681b6a383e4ecedbeddd5abc596f3de835aed6db39a735f62395c8edbff30
3.44 MB / 70.44 MB [==>-------------------------------------------------------]^C
$ podman pull --authfile auth.json registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141
Trying to pull registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141...Failed
error pulling image "registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141": unable to pull registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141: unable to pull image: Error determining manifest MIME type for docker://registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141: Error reading manifest 4.7.0-0.ci-2020-11-06-074141 in registry.svc.ci.openshift.org/ocp/release: unauthorized: authentication required I've pushed f7a1d33 -> f7c9ff4, hard-coding the auth-requiring CI release. |
f7c9ff4
to
fa8d68e
Compare
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1324841426894196736/artifacts/e2e-aws/ipi-install-install/log-bundle-20201106231049.tar | tar xOz log-bundle-20201106231049/bootstrap/journals/release-image.log
-- Logs begin at Fri 2020-11-06 22:45:49 UTC, end at Fri 2020-11-06 23:10:49 UTC. --
Nov 06 22:46:06 ip-10-0-25-221 systemd[1]: Starting Download the OpenShift Release Image...
Nov 06 22:46:06 ip-10-0-25-221 release-image-download.sh[1651]: Pulling registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141...
Nov 06 22:46:17 ip-10-0-25-221 release-image-download.sh[1651]: 75e4113eaae0d4aa041d405dea7dfb3745961dae05421c0420b3965d66deb5d2
Nov 06 22:46:17 ip-10-0-25-221 systemd[1]: Started Download the OpenShift Release Image. I dunno why my empty auth-file didn't break the pull, but I've pushed f7c9ff4 -> fa8d68e to slot in a |
fa8d68e
to
fe43889
Compare
This is because we are looping in the pulling script, not failing and restarting the unit: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1324857293598101504/artifacts/e2e-aws/ipi-install-install/log-bundle-20201107001451.tar | tar xOz log-bundle-20201107001451/bootstrap/journals/release-image.log | tail -n4
Nov 07 00:14:49 ip-10-0-63-105 release-image-download.sh[1624]: Error: error pulling image "registry.svc.ci.openshift.org/ocp/release:does-not-exist": unable to pull registry.svc.ci.openshift.org/ocp/release:does-not-exist: unable to pull image: Error initializing source docker://registry.svc.ci.openshift.org/ocp/release:does-not-exist: Error reading manifest does-not-exist in registry.svc.ci.openshift.org/ocp/release: manifest unknown: manifest unknown
Nov 07 00:14:49 ip-10-0-63-105 release-image-download.sh[1624]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp/release:does-not-exist...
Nov 07 00:14:51 ip-10-0-63-105 release-image-download.sh[1624]: Error: error pulling image "registry.svc.ci.openshift.org/ocp/release:does-not-exist": unable to pull registry.svc.ci.openshift.org/ocp/release:does-not-exist: unable to pull image: Error initializing source docker://registry.svc.ci.openshift.org/ocp/release:does-not-exist: Error reading manifest does-not-exist in registry.svc.ci.openshift.org/ocp/release: manifest unknown: manifest unknown
Nov 07 00:14:51 ip-10-0-63-105 release-image-download.sh[1624]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp/release:does-not-exist... Ideally we'd have a oneshot unit that we could restart until it succeeded, but it would still remain around with logs and block dependent units until it completed. Looks like systemd grew support for that in systemd/systemd#13754, which went out v244. Unfortunately, 4.7 is currently running: $ oc adm release info registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-10-27-051128 | grep "Red Hat Enterprise Linux CoreOS"
machine-os 47.82.202010270142-0 Red Hat Enterprise Linux CoreOS
$ curl -s https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.7/47.82.202010270142-0/x86_64/commitmeta.json | jq -c '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "systemd")'
["systemd","0","239","31.el8_2.2","x86_64"] So we'll have to wait a bit before we get to 244 (or get a backport) before we get nice reporting for failed release-image pulls. I've pushed fa8d68e -> fe43889, dropping my debugging commits, and I think with that we're good to land this. /assign @staebler |
/retest |
/hold cancel Debugging commits removed. |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/hold I guess I'll take another shot at building consensus around this when 4.8 opens up ;). |
Sorry that I did not get time to look at this for 4.7. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
@wking: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Replaced by #4751. |
Instead of just dropping them into the users lap "here's a big tarball, have fun", look through them for obvious things that we can summarize. With:
func runGatherBootstrapCmd(directory string) error { + return analyzeGatheredBootstrap("/tmp/log-bundle.tar.gz")
to feed this tarball into the analysis logic, the output looks like:
That's maybe a bit noisy, but mostly because all three control-plane machines failed the same way.
It might be worth exposing this as:
$ openshift-install analyze bootstrap PATH
so folks could look at bootstrap logs which had been gathered by third-parties, but I'm punting on that for now.