-
-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CI backstop timeout to limit hanging jobs #2593
Comments
Since the jobs can take up to an hour under normal circumstances (pypy IT shards), one hour would be too small, I'd want to have the timeout be at least 2 hours. What I really don't want is to bandaid over good signal. Before considering this I'd like to examine logs from the job you've linked |
Ok, used this on mac-it-1-of-2.log.txt above: #!/usr/bin/env python3
import os
import re
import sys
from pathlib import Path
from typing import Any
def analyze(log: Path) -> Any:
tests: dict[str, bool] = {}
with log.open() as fp:
for line in fp:
# E.G.: 2024-11-13T06:29:33.3456360Z tests/integration/test_issue_1018.py::test_execute_module_alter_sys[ep-function-zipapp-VENV]
match = re.match(r"^.*\d+Z (?P<test>tests/\S+(?:\[[^\]]+\])?).*", line)
if match:
test = match.group("test")
if test not in tests:
tests[test] = False
continue
# E.G.: 2024-11-13T06:29:33.3478200Z [gw3] PASSED tests/integration/venv_ITs/test_issue_1745.py::test_interpreter_mode_python_options[-c <code>-VENV]
match = re.match(r"^.*\d+Z \[gw\d+\] [A-Z]+ (?P<test>tests/\S+(?:\[[^\]]+\])?).*", line)
if match:
tests[match.group("test")] = True
continue
hung_tests = sorted(test for test, complete in tests.items() if not complete)
if hung_tests:
return f"The following tests never finished:\n{os.linesep.join(hung_tests)}"
def main() -> Any:
if len(sys.argv) != 2:
return f"Usage: {sys.argv[0]} <CI log file>"
log = Path(sys.argv[1])
if not log.exists():
return f"The log specified at {sys.argv[0]} does not exist."
return analyze(Path(sys.argv[1]))
if __name__ == "__main__":
sys.exit(main()) And it yielded: :; ./detect-hung.py mac-it-1-of-2.log.txt
The following tests never finished:
tests/integration/test_issue_2186.py::test_incompatible_resolve_error |
Ok, that's just 1 data-point. I'd like to get a few 6 hour timeouts on mac shards analyzed to see if its the same test hanging (looking at that test, it seems super unlikely that test in particular is hangy, but). I'll get this script checked in. |
Ok, available on main here: 7e3c248 |
Ok, and the 1 guess I had as to where a hang could be produced in Pex code has a fix in on main at 30c2ec8. I won't close this, but I'll remove the in-progress label. If there are no similar issues in a month or so, I'll close. If there are, I'll try the new script to analyze which test is hung and see if it matches the data point above, |
This job failed after six hours or so. CI should have a backstop timeout in place so hanging jobs fail much earlier than that. Say one hour or so.
The text was updated successfully, but these errors were encountered: