-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567
Closed
Commits on Nov 6, 2020
-
data/bootstrap/files/usr/local/bin/installer-gather: Look for unit re…
…starts From [1]: > Note that service restart is subject to unit start rate limiting > configured with StartLimitIntervalSec= and StartLimitBurst=, see > systemd.unit(5) for details. A restarted service enters the failed > state only after the start limits are reached. And [2]: > Configure unit start rate limiting. Units which are started more > than burst times within an interval time interval are not permitted > to start any more We don't set those StartLimit* properties on our units, so they are endlessly restarted without ever entering the 'failed' state and being collected by failed-units.txt [3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1313493438984884224/artifacts/e2e-aws/ipi-install-install/log-bundle-20201006155840.tar >log-bundle.tar.gz $ tar xOz log-bundle-20201006155840/bootstrap/journals/bootkube.log <log-bundle.tar.gz | tail Oct 06 15:58:33 ip-10-0-1-187 bootkube.sh[15702]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found Oct 06 15:58:33 ip-10-0-1-187 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a Oct 06 15:58:33 ip-10-0-1-187 systemd[1]: bootkube.service: Failed with result 'exit-code'. Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart. Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 273. Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: Stopped Bootstrap a Kubernetes cluster. Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: Started Bootstrap a Kubernetes cluster. Oct 06 15:58:38 ip-10-0-1-187 bootkube.sh[15762]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Failed with result 'exit-code'. $ tar xOz log-bundle-20201006155840/failed-units.txt <log-bundle.tar.gz 0 loaded units listed. Pass --all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'. With this commit, we look for log entries with automatic-restart events [4], and use those to identify units which may be having trouble, even if they aren't formally 'failed'. Recording json-pretty output in ${UNIT}.log.json will be a bit chatty, but JSON will allow us to to consumer-side processing to pull out the restart count (N_RESTARTS), logs from the final failed invocation (INVOCATION_ID) and similar useful subsets. json-pretty will take more transport space than '-o json', but it will also be easier to process with grep and other tools that are not JSON-aware, for folks who are just hunting for matching strings in the tarball and similar. [1]: https://www.freedesktop.org/software/systemd/man/systemd.service.html [2]: https://www.freedesktop.org/software/systemd/man/systemd.unit.html [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1313493438984884224 [4]: https://github.com/systemd/systemd/blob/4b28e50f9ef7655542a5ce5bc05857508ddf1495/catalog/systemd.catalog.in#L341-L342
Configuration menu - View commit details
-
Copy full SHA for 26cf75e - Browse repository at this point
Copy the full SHA 26cf75eView commit details -
cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs
Instead of just dropping them into the users lap "here's a big tarball, have fun", look through them for obvious things that we can summarize. With: func runGatherBootstrapCmd(directory string) error { + return analyzeGatheredBootstrap("/tmp/log-bundle.tar.gz") to feed [1] into the analysis logic, the output looks like: WARNING control-plane/10.0.134.229 had failing systemd units: crio.service WARNING control-plane/10.0.134.229: crio.service: ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:31 UTC; 320ms ago Docs: https://github.com/cri-o/cri-o Process: 8491 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE) Main PID: 8491 (code=exited, status=1/FAILURE) CPU: 144ms Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Starting Open Container Initiative Daemon... Oct 24 11:11:31 ip-10-0-134-229 crio[8491]: time="2019-10-24 11:11:31.895986612Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory" Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Failed with result 'exit-code'. Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Failed to start Open Container Initiative Daemon. Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Consumed 144ms CPU time WARNING control-plane/10.0.134.243 had failing systemd units: crio.service WARNING control-plane/10.0.134.243: crio.service: ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:35 UTC; 8s ago Docs: https://github.com/cri-o/cri-o Process: 8439 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE) Main PID: 8439 (code=exited, status=1/FAILURE) CPU: 151ms Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Starting Open Container Initiative Daemon... Oct 24 11:11:35 ip-10-0-134-243 crio[8439]: time="2019-10-24 11:11:35.238163016Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory" Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Failed with result 'exit-code'. Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Failed to start Open Container Initiative Daemon. Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Consumed 151ms CPU time WARNING control-plane/10.0.157.61 had failing systemd units: crio.service WARNING control-plane/10.0.157.61: crio.service: ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:36 UTC; 1s ago Docs: https://github.com/cri-o/cri-o Process: 8379 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE) Main PID: 8379 (code=exited, status=1/FAILURE) CPU: 158ms Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Starting Open Container Initiative Daemon... Oct 24 11:11:36 ip-10-0-157-61 crio[8379]: time="2019-10-24 11:11:36.807284677Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory" Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Failed with result 'exit-code'. Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Failed to start Open Container Initiative Daemon. Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Consumed 158ms CPU time That's maybe a bit noisy, but mostly because all three control-plane machines failed the same way. Journal field docs are linked from the Go docstrings, but if these evolve later, the ones I'm using are from [2], except for UNIT, where I have opened an upstream documentation request [3]. It might be worth exposing this as: $ openshift-install analyze bootstrap PATH so folks could look at bootstrap logs which had been gathered by third-parties, but I'm punting on that for now. [1]: https://storage.googleapis.com/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.3/2455/artifacts/e2e-aws/installer/log-bundle-20191024111122.tar [2]: https://github.com/systemd/systemd/blob/v246/man/systemd.journal-fields.xml [3]: systemd/systemd#17538
Configuration menu - View commit details
-
Copy full SHA for fe43889 - Browse repository at this point
Copy the full SHA fe43889View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.