cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567

wking · 2019-10-24T18:34:46Z

Instead of just dropping them into the users lap "here's a big tarball, have fun", look through them for obvious things that we can summarize. With:

 func runGatherBootstrapCmd(directory string) error {
+       return analyzeGatheredBootstrap("/tmp/log-bundle.tar.gz")

to feed this tarball into the analysis logic, the output looks like:

WARNING control-plane/10.0.134.229 had failing systemd units: crio.service
WARNING control-plane/10.0.134.229: crio.service:
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─10-default-env.conf
   Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:31 UTC; 320ms ago
     Docs: https://github.com/cri-o/cri-o
  Process: 8491 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 8491 (code=exited, status=1/FAILURE)
      CPU: 144ms

Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Starting Open Container Initiative Daemon...
Oct 24 11:11:31 ip-10-0-134-229 crio[8491]: time="2019-10-24 11:11:31.895986612Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory"
Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Failed with result 'exit-code'.
Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Failed to start Open Container Initiative Daemon.
Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Consumed 144ms CPU time
WARNING control-plane/10.0.134.243 had failing systemd units: crio.service
WARNING control-plane/10.0.134.243: crio.service:
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─10-default-env.conf
   Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:35 UTC; 8s ago
     Docs: https://github.com/cri-o/cri-o
  Process: 8439 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 8439 (code=exited, status=1/FAILURE)
      CPU: 151ms

Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Starting Open Container Initiative Daemon...
Oct 24 11:11:35 ip-10-0-134-243 crio[8439]: time="2019-10-24 11:11:35.238163016Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory"
Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Failed with result 'exit-code'.
Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Failed to start Open Container Initiative Daemon.
Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Consumed 151ms CPU time
WARNING control-plane/10.0.157.61 had failing systemd units: crio.service
WARNING control-plane/10.0.157.61: crio.service:
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─10-default-env.conf
   Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:36 UTC; 1s ago
     Docs: https://github.com/cri-o/cri-o
  Process: 8379 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 8379 (code=exited, status=1/FAILURE)
      CPU: 158ms

Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Starting Open Container Initiative Daemon...
Oct 24 11:11:36 ip-10-0-157-61 crio[8379]: time="2019-10-24 11:11:36.807284677Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory"
Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Failed with result 'exit-code'.
Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Failed to start Open Container Initiative Daemon.
Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Consumed 158ms CPU time

That's maybe a bit noisy, but mostly because all three control-plane machines failed the same way.

It might be worth exposing this as:

$ openshift-install analyze bootstrap PATH

so folks could look at bootstrap logs which had been gathered by third-parties, but I'm punting on that for now.

eparis · 2019-10-24T18:47:01Z

It seems like a lot of code for what should, hopefully, only ever show up in RHCOS promotion jobs. do we have more cases where getting the end of the bootkube logs or something like that would also be a really good set of analysis?

abhinavdahiya · 2019-10-24T21:43:40Z

/hold

let's make sure have an github issue that we use to discuss the actual use and scope of the analysis or even an openshift/enhancement before we think about adding this.

wking · 2019-10-24T22:53:54Z

Spun off #2569 for "do we want this?"

iranzo · 2020-08-13T15:27:56Z

@wking is there a new tarball available anywhere to use as base data for testing? thx

wking · 2020-08-13T19:26:51Z

bbc87bc -> e20102c rebases around #3489 and adds a new commit to force a failure so folks can see this working in CI. If/when folks are satisfied (still no comment in #2569 about which first thing folks want analyzed), we can drop that commit and lift the hold here.

/hold

wking · 2020-08-13T19:27:56Z

And yes, many, many of these in CI: https://search.ci.openshift.org/?search=Bootstrap%20gather%20logs%20captured%20here

abhinavdahiya · 2020-10-05T20:25:17Z

Closing this as it has stalled for very long. Please feel free to re-open in case you intent to keep pushing it, or consider tracking the work in an issue if this requires agreement for implementation.

/close

openshift-ci-robot · 2020-10-05T20:25:35Z

@abhinavdahiya: Closed this PR.

In response to this:

Closing this as it has stalled for very long. Please feel free to re-open in case you intent to keep pushing it, or consider tracking the work in an issue if this requires agreement for implementation.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2020-10-06T14:56:19Z

/reopen

I will keep pushing :)

openshift-ci-robot · 2020-10-06T14:56:37Z

@wking: Reopened this PR.

In response to this:

/reopen

I will keep pushing :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-10-06T14:57:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign wking after the PR has been reviewed.
You can assign the PR to them by writing /assign @wking in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2020-10-06T16:30:37Z

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-scaleup-rhel7	`bbc87bc`	link	`/test e2e-aws-scaleup-rhel7`
ci/prow/e2e-aws-fips	`e20102c`	link	`/test e2e-aws-fips`
ci/prow/e2e-aws-workers-rhel7	`e20102c`	link	`/test e2e-aws-workers-rhel7`
ci/prow/e2e-crc	`e20102c`	link	`/test e2e-crc`
ci/prow/e2e-aws	`e20102c`	link	`/test e2e-aws`
ci/prow/e2e-openstack	`e20102c`	link	`/test e2e-openstack`
ci/prow/e2e-metal-ipi	`e20102c`	link	`/test e2e-metal-ipi`
ci/prow/e2e-ovirt	`e20102c`	link	`/test e2e-ovirt`
ci/prow/e2e-libvirt	`e20102c`	link	`/test e2e-libvirt`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

pkg/systemd/entry.go

wking · 2020-11-06T20:21:48Z

e2e-aws:

level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20201106191638.tar.gz"
level=error msg=Attempted to gather debug logs after installation failure: failed to analyze "/tmp/installer/log-bundle-20201106191638.tar.gz": parsing bootstrap log bundle /tmp/installer/log-bundle-20201106191638.tar.gz: parsing systemd log JSON "log-bundle-20201106191638/unit-status/bootkube.service.log.json": parse N_RESTARTS: strconv.Atoi: parsing "": invalid syntax
level=fatal msg=Bootstrap failed to complete: failed waiting for Kubernetes API: Get "https://api.ci-op-crddhnmp-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443/version?timeout=32s": dial tcp 3.19.166.52:6443: connect: connection refused
2020/11/06 19:16:43 Failed to upload $KUBECONFIG: timed out waiting for the condition

I've pushed 759c627 -> 2755198 updating my fixtures to include more entry samples, and adjusted the parser to be more robust so it doesn't choke on any of those samples.

Instead of just dropping them into the users lap "here's a big tarball, have fun", look through them for obvious things that we can summarize. With: func runGatherBootstrapCmd(directory string) error { + return analyzeGatheredBootstrap("/tmp/log-bundle.tar.gz") to feed [1] into the analysis logic, the output looks like: WARNING control-plane/10.0.134.229 had failing systemd units: crio.service WARNING control-plane/10.0.134.229: crio.service: ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:31 UTC; 320ms ago Docs: https://github.com/cri-o/cri-o Process: 8491 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE) Main PID: 8491 (code=exited, status=1/FAILURE) CPU: 144ms Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Starting Open Container Initiative Daemon... Oct 24 11:11:31 ip-10-0-134-229 crio[8491]: time="2019-10-24 11:11:31.895986612Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory" Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Failed with result 'exit-code'. Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Failed to start Open Container Initiative Daemon. Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Consumed 144ms CPU time WARNING control-plane/10.0.134.243 had failing systemd units: crio.service WARNING control-plane/10.0.134.243: crio.service: ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:35 UTC; 8s ago Docs: https://github.com/cri-o/cri-o Process: 8439 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE) Main PID: 8439 (code=exited, status=1/FAILURE) CPU: 151ms Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Starting Open Container Initiative Daemon... Oct 24 11:11:35 ip-10-0-134-243 crio[8439]: time="2019-10-24 11:11:35.238163016Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory" Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Failed with result 'exit-code'. Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Failed to start Open Container Initiative Daemon. Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Consumed 151ms CPU time WARNING control-plane/10.0.157.61 had failing systemd units: crio.service WARNING control-plane/10.0.157.61: crio.service: ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:36 UTC; 1s ago Docs: https://github.com/cri-o/cri-o Process: 8379 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE) Main PID: 8379 (code=exited, status=1/FAILURE) CPU: 158ms Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Starting Open Container Initiative Daemon... Oct 24 11:11:36 ip-10-0-157-61 crio[8379]: time="2019-10-24 11:11:36.807284677Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory" Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Failed with result 'exit-code'. Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Failed to start Open Container Initiative Daemon. Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Consumed 158ms CPU time That's maybe a bit noisy, but mostly because all three control-plane machines failed the same way. Journal field docs are linked from the Go docstrings, but if these evolve later, the ones I'm using are from [2], except for UNIT, where I have opened an upstream documentation request [3]. It might be worth exposing this as: $ openshift-install analyze bootstrap PATH so folks could look at bootstrap logs which had been gathered by third-parties, but I'm punting on that for now. [1]: https://storage.googleapis.com/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.3/2455/artifacts/e2e-aws/installer/log-bundle-20191024111122.tar [2]: https://github.com/systemd/systemd/blob/v246/man/systemd.journal-fields.xml [3]: systemd/systemd#17538

wking · 2020-11-06T21:29:10Z

e2e-aws:

level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20201106211245.tar.gz"
level=warning msg=bootstrap had failing or restarting systemd units: bootkube.service
level=warning msg=bootstrap had restarting systemd unit bootkube.service
level=warning msg=● bootkube.service - Bootstrap a Kubernetes cluster
level=warning msg=   Loaded: loaded (/etc/systemd/system/bootkube.service; static; vendor preset: disabled)
level=warning msg=   Active: activating (auto-restart) (Result: exit-code) since Fri 2020-11-06 21:12:45 UTC; 412ms ago
level=warning msg=  Process: 15924 ExecStart=/usr/local/bin/bootkube.sh (code=exited, status=127)
level=warning msg= Main PID: 15924 (code=exited, status=127)
level=warning
level=warning msg=2020-11-06T21:12:45Z ip-10-0-5-167 systemd[1]: Started Bootstrap a Kubernetes cluster.
level=warning msg=2020-11-06T21:12:45Z ip-10-0-5-167 bootkube.sh[15924]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found
level=warning msg=2020-11-06T21:12:45Z ip-10-0-5-167 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a
level=warning msg=2020-11-06T21:12:45Z ip-10-0-5-167 systemd[1]: bootkube.service: Failed with result 'exit-code'.
level=warning msg=control-plane 10.0.168.25 access error: failed to SSH into host
level=warning msg=control-plane 10.0.183.105 access error: failed to SSH into host
level=warning msg=control-plane 10.0.222.207 access error: failed to SSH into host
level=fatal msg=Bootstrap failed to complete: failed waiting for Kubernetes API: Get "https://api.ci-op-84bd5z4s-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443/version?timeout=32s": dial tcp 35.155.151.1:6443: connect: connection refused

❤️ I've pushed 46e6124 -> f7a1d33 to scrape SSH errors out of #4345's gather.log, and added a new test-only commit to break release-image auth, so we can see what the "I am installing a CI release, and my CI-registry token is missing/stale" use-case looks like.

wking · 2020-11-06T22:30:12Z

e2e-aws:

level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20201106221538.tar.gz"
level=warning msg=control-plane 10.0.182.219 access error: ssh: connect to host 10.0.182.219 port 22: Connection refused
level=warning msg=control-plane 10.0.188.197 access error: ssh: connect to host 10.0.188.197 port 22: Connection refused
level=warning msg=control-plane 10.0.240.47 access error: ssh: connect to host 10.0.240.47 port 22: Connection refused
level=warning msg=bootstrap had failing or restarting systemd units: bootkube.service
level=warning msg=bootstrap had restarting systemd unit bootkube.service
level=warning msg=● bootkube.service - Bootstrap a Kubernetes cluster
level=warning msg=   Loaded: loaded (/etc/systemd/system/bootkube.service; static; vendor preset: disabled)
level=warning msg=   Active: activating (auto-restart) (Result: exit-code) since Fri 2020-11-06 22:15:35 UTC; 3s ago
level=warning msg=  Process: 15832 ExecStart=/usr/local/bin/bootkube.sh (code=exited, status=127)
level=warning msg= Main PID: 15832 (code=exited, status=127)
level=warning msg=    Tasks: 0 (limit: 49124)
level=warning msg=   Memory: 0B
level=warning msg=   CGroup: /system.slice/bootkube.service
level=warning
level=warning msg=2020-11-06T22:15:35Z ip-10-0-9-32 systemd[1]: Started Bootstrap a Kubernetes cluster.
level=warning msg=2020-11-06T22:15:35Z ip-10-0-9-32 bootkube.sh[15832]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found
level=warning msg=2020-11-06T22:15:35Z ip-10-0-9-32 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a
level=warning msg=2020-11-06T22:15:35Z ip-10-0-9-32 systemd[1]: bootkube.service: Failed with result 'exit-code'.
level=fatal msg=Bootstrap failed to complete: failed waiting for Kubernetes API: Get "https://api.ci-op-x1lmrvsl-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443/version?timeout=32s": dial tcp 3.18.149.28:6443: connect: connection refused

Because release-image access is apparently open:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1324825813257818112/artifacts/e2e-aws/ipi-install-install/log-bundle-20201106221538.tar | tar xOz log-bundle-20201106221538/bootstrap/journals/release-image.log
-- Logs begin at Fri 2020-11-06 21:50:55 UTC, end at Fri 2020-11-06 22:15:38 UTC. --
Nov 06 21:51:12 ip-10-0-9-32 systemd[1]: Starting Download the OpenShift Release Image...
Nov 06 21:51:12 ip-10-0-9-32 release-image-download.sh[1667]: Pulling registry.build02.ci.openshift.org/ci-op-x1lmrvsl/release@sha256:63e99e4df15cf875fbbf7b660227a1c6833cce624d5ec66f6767c81b4c96659b...
Nov 06 21:51:22 ip-10-0-9-32 release-image-download.sh[1667]: d1c62a5431ad4b5886dc9dcb36ed1b86ade893ab6a12dd80d3807d65f19fe9ca
Nov 06 21:51:22 ip-10-0-9-32 systemd[1]: Started Download the OpenShift Release Image.

while CI-release images are not:

$ cat auth.json 
{}
$ podman pull --authfile auth.json registry.build02.ci.openshift.org/ci-op-x1lmrvsl/release@sha256:63e99e4df15cf875fbbf7b660227a1c6833cce624d5ec66f6767c81b4c96659b
Trying to pull registry.build02.ci.openshift.org/ci-op-x1lmrvsl/release@sha256:63e99e4df15cf875fbbf7b660227a1c6833cce624d5ec66f6767c81b4c96659b...Getting image source signatures
Copying blob sha256:ec1681b6a383e4ecedbeddd5abc596f3de835aed6db39a735f62395c8edbff30
 3.44 MB / 70.44 MB [==>-------------------------------------------------------]^C
$ podman pull --authfile auth.json registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141
Trying to pull registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141...Failed
error pulling image "registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141": unable to pull registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141: unable to pull image: Error determining manifest MIME type for docker://registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141: Error reading manifest 4.7.0-0.ci-2020-11-06-074141 in registry.svc.ci.openshift.org/ocp/release: unauthorized: authentication required

I've pushed f7a1d33 -> f7c9ff4, hard-coding the auth-requiring CI release.

wking · 2020-11-06T23:33:36Z

e2e-aws:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1324841426894196736/artifacts/e2e-aws/ipi-install-install/log-bundle-20201106231049.tar | tar xOz log-bundle-20201106231049/bootstrap/journals/release-image.log
-- Logs begin at Fri 2020-11-06 22:45:49 UTC, end at Fri 2020-11-06 23:10:49 UTC. --
Nov 06 22:46:06 ip-10-0-25-221 systemd[1]: Starting Download the OpenShift Release Image...
Nov 06 22:46:06 ip-10-0-25-221 release-image-download.sh[1651]: Pulling registry.svc.ci.openshift.org/ocp/release:4.7.0-0.ci-2020-11-06-074141...
Nov 06 22:46:17 ip-10-0-25-221 release-image-download.sh[1651]: 75e4113eaae0d4aa041d405dea7dfb3745961dae05421c0420b3965d66deb5d2
Nov 06 22:46:17 ip-10-0-25-221 systemd[1]: Started Download the OpenShift Release Image.

I dunno why my empty auth-file didn't break the pull, but I've pushed f7c9ff4 -> fa8d68e to slot in a does-not-exist tag that will certainly (I hope ;)) fail.

wking · 2020-11-07T00:36:12Z

e2e-aws:

level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20201107001451.tar.gz"
level=warning msg=control-plane 10.0.149.19 access error: ssh: connect to host 10.0.149.19 port 22: Connection refused
level=warning msg=control-plane 10.0.164.179 access error: ssh: connect to host 10.0.164.179 port 22: Connection refused
level=warning msg=control-plane 10.0.249.133 access error: ssh: connect to host 10.0.249.133 port 22: Connection refused
level=info msg=bootstrap had no failing or restarting systemd units
level=fatal msg=Bootstrap failed to complete: failed waiting for Kubernetes API: Get "https://api.ci-op-kqxfq9f7-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443/version?timeout=32s": dial tcp 54.184.162.11:6443: connect: connection refused

This is because we are looping in the pulling script, not failing and restarting the unit:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1324857293598101504/artifacts/e2e-aws/ipi-install-install/log-bundle-20201107001451.tar | tar xOz log-bundle-20201107001451/bootstrap/journals/release-image.log | tail -n4
Nov 07 00:14:49 ip-10-0-63-105 release-image-download.sh[1624]: Error: error pulling image "registry.svc.ci.openshift.org/ocp/release:does-not-exist": unable to pull registry.svc.ci.openshift.org/ocp/release:does-not-exist: unable to pull image: Error initializing source docker://registry.svc.ci.openshift.org/ocp/release:does-not-exist: Error reading manifest does-not-exist in registry.svc.ci.openshift.org/ocp/release: manifest unknown: manifest unknown
Nov 07 00:14:49 ip-10-0-63-105 release-image-download.sh[1624]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp/release:does-not-exist...
Nov 07 00:14:51 ip-10-0-63-105 release-image-download.sh[1624]: Error: error pulling image "registry.svc.ci.openshift.org/ocp/release:does-not-exist": unable to pull registry.svc.ci.openshift.org/ocp/release:does-not-exist: unable to pull image: Error initializing source docker://registry.svc.ci.openshift.org/ocp/release:does-not-exist: Error reading manifest does-not-exist in registry.svc.ci.openshift.org/ocp/release: manifest unknown: manifest unknown
Nov 07 00:14:51 ip-10-0-63-105 release-image-download.sh[1624]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp/release:does-not-exist...

Ideally we'd have a oneshot unit that we could restart until it succeeded, but it would still remain around with logs and block dependent units until it completed. Looks like systemd grew support for that in systemd/systemd#13754, which went out v244. Unfortunately, 4.7 is currently running:

$ oc adm release info registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-10-27-051128 | grep "Red Hat Enterprise Linux CoreOS"
  machine-os 47.82.202010270142-0 Red Hat Enterprise Linux CoreOS
$ curl -s https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.7/47.82.202010270142-0/x86_64/commitmeta.json | jq -c '.["rpmostree.rpmdb.pkglist"][] | select(.[0] == "systemd")'
["systemd","0","239","31.el8_2.2","x86_64"]

So we'll have to wait a bit before we get to 244 (or get a backport) before we get nice reporting for failed release-image pulls. I've pushed fa8d68e -> fe43889, dropping my debugging commits, and I think with that we're good to land this.

/assign @staebler

wking · 2020-11-10T09:55:00Z

/retest

wking · 2020-11-10T09:57:21Z

/hold cancel

Debugging commits removed.

openshift-merge-robot · 2020-11-10T11:59:33Z

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-crc	`fe43889`	link	`/test e2e-crc`
ci/prow/e2e-ovirt	`fe43889`	link	`/test e2e-ovirt`
ci/prow/e2e-aws-fips	`fe43889`	link	`/test e2e-aws-fips`
ci/prow/e2e-aws-upgrade	`fe43889`	link	`/test e2e-aws-upgrade`
ci/prow/e2e-aws	`fe43889`	link	`/test e2e-aws`
ci/prow/e2e-openstack	`fe43889`	link	`/test e2e-openstack`
ci/prow/e2e-aws-workers-rhel7	`fe43889`	link	`/test e2e-aws-workers-rhel7`
ci/prow/e2e-libvirt	`fe43889`	link	`/test e2e-libvirt`
ci/prow/e2e-metal-ipi	`fe43889`	link	`/test e2e-metal-ipi`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2020-12-05T04:59:55Z

/hold

I guess I'll take another shot at building consensus around this when 4.8 opens up ;).

staebler · 2020-12-07T14:00:54Z

/hold

I guess I'll take another shot at building consensus around this when 4.8 opens up ;).

Sorry that I did not get time to look at this for 4.7.

romfreiman · 2021-02-21T19:46:06Z

@eranco74 @tsorya @ronniel1 any reuse we can do to this logic in AI?

openshift-bot · 2021-06-05T01:45:05Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-ci · 2021-06-05T01:45:12Z

@wking: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2021-06-14T18:24:59Z

Replaced by #4751.

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 24, 2019

openshift-ci-robot requested review from jcpowermac and jhixson74 October 24, 2019 18:35

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 24, 2019

wking force-pushed the summarize-failed-units branch from b205500 to bbc87bc Compare October 24, 2019 18:35

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2019

abhinavdahiya removed request for jhixson74 and jcpowermac October 24, 2019 21:43

wking mentioned this pull request Oct 24, 2019

RFE: Automatically analyze gathered bootstrap logs? #2569

Open

abhinavdahiya added the version/4.5 label Feb 10, 2020

abhinavdahiya added version/4.6 Tracking changes that should end up in 4.6 release and removed version/4.5 labels May 8, 2020

wking force-pushed the summarize-failed-units branch from bbc87bc to e20102c Compare August 13, 2020 19:25

openshift-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 13, 2020

openshift-ci-robot closed this Oct 5, 2020

openshift-ci-robot reopened this Oct 6, 2020

wking force-pushed the summarize-failed-units branch 2 times, most recently from 61bf20c to 1c874dc Compare November 4, 2020 22:32

wking commented Nov 6, 2020

View reviewed changes

pkg/systemd/entry.go Outdated Show resolved Hide resolved

wking force-pushed the summarize-failed-units branch from 759c627 to 2755198 Compare November 6, 2020 20:20

wking force-pushed the summarize-failed-units branch from 46e6124 to f7a1d33 Compare November 6, 2020 21:27

wking force-pushed the summarize-failed-units branch from f7a1d33 to f7c9ff4 Compare November 6, 2020 22:29

wking force-pushed the summarize-failed-units branch from f7c9ff4 to fa8d68e Compare November 6, 2020 23:32

wking force-pushed the summarize-failed-units branch from fa8d68e to fe43889 Compare November 7, 2020 00:35

openshift-ci-robot assigned staebler Nov 7, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 5, 2020

wking mentioned this pull request Jan 14, 2021

oc adm release new: add check for the existence of profile annotations in release manifests openshift/oc#663

Closed

staebler removed the version/4.6 Tracking changes that should end up in 4.6 release label Feb 19, 2021

openshift-ci bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 5, 2021

wking closed this Jun 14, 2021

wking deleted the summarize-failed-units branch June 14, 2021 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567

cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567

wking commented Oct 24, 2019 •

edited

Loading

eparis commented Oct 24, 2019

abhinavdahiya commented Oct 24, 2019

wking commented Oct 24, 2019

iranzo commented Aug 13, 2020

wking commented Aug 13, 2020

wking commented Aug 13, 2020

abhinavdahiya commented Oct 5, 2020

openshift-ci-robot commented Oct 5, 2020

wking commented Oct 6, 2020

openshift-ci-robot commented Oct 6, 2020

openshift-ci-robot commented Oct 6, 2020

openshift-ci-robot commented Oct 6, 2020 •

edited

Loading

wking commented Nov 6, 2020

wking commented Nov 6, 2020

wking commented Nov 6, 2020 •

edited

Loading

wking commented Nov 6, 2020

wking commented Nov 7, 2020

wking commented Nov 10, 2020

wking commented Nov 10, 2020

openshift-merge-robot commented Nov 10, 2020

wking commented Dec 5, 2020

staebler commented Dec 7, 2020

romfreiman commented Feb 21, 2021

openshift-bot commented Jun 5, 2021

openshift-ci bot commented Jun 5, 2021

wking commented Jun 14, 2021

cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567

cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567

Conversation

wking commented Oct 24, 2019 • edited Loading

eparis commented Oct 24, 2019

abhinavdahiya commented Oct 24, 2019

wking commented Oct 24, 2019

iranzo commented Aug 13, 2020

wking commented Aug 13, 2020

wking commented Aug 13, 2020

abhinavdahiya commented Oct 5, 2020

openshift-ci-robot commented Oct 5, 2020

wking commented Oct 6, 2020

openshift-ci-robot commented Oct 6, 2020

openshift-ci-robot commented Oct 6, 2020

openshift-ci-robot commented Oct 6, 2020 • edited Loading

wking commented Nov 6, 2020

wking commented Nov 6, 2020

wking commented Nov 6, 2020 • edited Loading

wking commented Nov 6, 2020

wking commented Nov 7, 2020

wking commented Nov 10, 2020

wking commented Nov 10, 2020

openshift-merge-robot commented Nov 10, 2020

wking commented Dec 5, 2020

staebler commented Dec 7, 2020

romfreiman commented Feb 21, 2021

openshift-bot commented Jun 5, 2021

openshift-ci bot commented Jun 5, 2021

wking commented Jun 14, 2021

wking commented Oct 24, 2019 •

edited

Loading

openshift-ci-robot commented Oct 6, 2020 •

edited

Loading

wking commented Nov 6, 2020 •

edited

Loading