Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CI backstop timeout to limit hanging jobs #2593

Open
tdyas opened this issue Nov 13, 2024 · 5 comments
Open

Add CI backstop timeout to limit hanging jobs #2593

tdyas opened this issue Nov 13, 2024 · 5 comments
Assignees
Labels
bug tech-debt Issue that addresses technical debt.

Comments

@tdyas
Copy link
Contributor

tdyas commented Nov 13, 2024

This job failed after six hours or so. CI should have a backstop timeout in place so hanging jobs fail much earlier than that. Say one hour or so.

@jsirois
Copy link
Member

jsirois commented Nov 13, 2024

Since the jobs can take up to an hour under normal circumstances (pypy IT shards), one hour would be too small, I'd want to have the timeout be at least 2 hours.

What I really don't want is to bandaid over good signal. Before considering this I'd like to examine logs from the job you've linked
mac-it-1-of-2.log.txt and see if the problem is a bad test or even bad production code and actually fix the real problem. I've observed hung mac shards over the past few months so I'd like to spend more time ruling me in or out as the problem before throwing up my hands and blaming GH hosted macs. I firmly suspect PEBKAC here.

@jsirois jsirois self-assigned this Nov 13, 2024
@jsirois jsirois added bug tech-debt Issue that addresses technical debt. in progress labels Nov 13, 2024
@jsirois
Copy link
Member

jsirois commented Nov 15, 2024

Ok, used this on mac-it-1-of-2.log.txt above:

#!/usr/bin/env python3

import os
import re
import sys
from pathlib import Path
from typing import Any


def analyze(log: Path) -> Any:
    tests: dict[str, bool] = {}
    with log.open() as fp:
        for line in fp:
            # E.G.: 2024-11-13T06:29:33.3456360Z tests/integration/test_issue_1018.py::test_execute_module_alter_sys[ep-function-zipapp-VENV]
            match = re.match(r"^.*\d+Z (?P<test>tests/\S+(?:\[[^\]]+\])?).*", line)
            if match:
                test = match.group("test")
                if test not in tests:
                    tests[test] = False
                continue

            # E.G.: 2024-11-13T06:29:33.3478200Z [gw3] PASSED tests/integration/venv_ITs/test_issue_1745.py::test_interpreter_mode_python_options[-c <code>-VENV]
            match = re.match(r"^.*\d+Z \[gw\d+\] [A-Z]+ (?P<test>tests/\S+(?:\[[^\]]+\])?).*", line)
            if match:
                tests[match.group("test")] = True
                continue

    hung_tests = sorted(test for test, complete in tests.items() if not complete)
    if hung_tests:
        return f"The following tests never finished:\n{os.linesep.join(hung_tests)}"



def main() -> Any:
    if len(sys.argv) != 2:
        return f"Usage: {sys.argv[0]} <CI log file>"

    log = Path(sys.argv[1])
    if not log.exists():
        return f"The log specified at {sys.argv[0]} does not exist."

    return analyze(Path(sys.argv[1]))


if __name__ == "__main__":
    sys.exit(main())

And it yielded:

:; ./detect-hung.py mac-it-1-of-2.log.txt
The following tests never finished:
tests/integration/test_issue_2186.py::test_incompatible_resolve_error

@jsirois
Copy link
Member

jsirois commented Nov 15, 2024

Ok, that's just 1 data-point. I'd like to get a few 6 hour timeouts on mac shards analyzed to see if its the same test hanging (looking at that test, it seems super unlikely that test in particular is hangy, but). I'll get this script checked in.

@jsirois
Copy link
Member

jsirois commented Nov 15, 2024

Ok, available on main here: 7e3c248

@jsirois
Copy link
Member

jsirois commented Nov 17, 2024

Ok, and the 1 guess I had as to where a hang could be produced in Pex code has a fix in on main at 30c2ec8. I won't close this, but I'll remove the in-progress label. If there are no similar issues in a month or so, I'll close. If there are, I'll try the new script to analyze which test is hung and see if it matches the data point above,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug tech-debt Issue that addresses technical debt.
Projects
None yet
Development

No branches or pull requests

2 participants