Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
FIXME: data/bootstrap/files/usr/local/bin/installer-gather: Look for …
…unit restarts From [1]: > Note that service restart is subject to unit start rate limiting > configured with StartLimitIntervalSec= and StartLimitBurst=, see > systemd.unit(5) for details. A restarted service enters the failed > state only after the start limits are reached. And [2]: > Configure unit start rate limiting. Units which are started more > than burst times within an interval time interval are not permitted > to start any more We don't set those StartLimit* properties on our units, so they are endlessly restarted without ever entering the 'failed' state and being collected by failed-units.txt [3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1313493438984884224/artifacts/e2e-aws/ipi-install-install/log-bundle-20201006155840.tar >log-bundle.tar.gz $ tar xOz log-bundle-20201006155840/bootstrap/journals/bootkube.log <log-bundle.tar.gz | tail Oct 06 15:58:33 ip-10-0-1-187 bootkube.sh[15702]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found Oct 06 15:58:33 ip-10-0-1-187 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a Oct 06 15:58:33 ip-10-0-1-187 systemd[1]: bootkube.service: Failed with result 'exit-code'. Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart. Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 273. Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: Stopped Bootstrap a Kubernetes cluster. Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: Started Bootstrap a Kubernetes cluster. Oct 06 15:58:38 ip-10-0-1-187 bootkube.sh[15762]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Failed with result 'exit-code'. $ tar xOz log-bundle-20201006155840/failed-units.txt <log-bundle.tar.gz 0 loaded units listed. Pass --all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'. With this commit, we look for log entries with automatic-restart events [4], and use those to identify units which may be having trouble. [1]: https://www.freedesktop.org/software/systemd/man/systemd.service.html [2]: https://www.freedesktop.org/software/systemd/man/systemd.unit.html [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1313493438984884224 [4]: https://github.com/systemd/systemd/blob/4b28e50f9ef7655542a5ce5bc05857508ddf1495/catalog/systemd.catalog.in#L341-L342
- Loading branch information
Looks like a useful start, though with just a slightly bit more effort I think we can extract the failing unit name and get its logs as text.