Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: get CI timings on long-running jobs #4552

Closed
wants to merge 8 commits into from
Closed

Conversation

jameslamb
Copy link
Collaborator

I've noticed recently that QEMU builds are regularly timing out after 3 hours. This slows down development on the repository to a level that I'm now really starting to feel, whiich is especially impactful given the timezone difference between maintainers.

Opening this PR to continue the investigation in #4501 into which tests are taking the longest time to run, so I can propose possibly skipping some of them for QEMU builds.

I tried to do this testing on my fork, but Azure DevOps was not picking up PRs there (jameslamb#48).

@StrikerRUS
Copy link
Collaborator

Thanks for starting a work on this! I wanted to create an issue for this problem this week.

Just my observations: QEMU either works fine within 2 hours, or approximately doubles this time. I have a weak guess that this might be a hardware issue or deep system-wide software one.

@jameslamb
Copy link
Collaborator Author

QEMU either works fine within 2 hours, or approximately doubles this time

Yeah I've seen this too! This most recent build only took 1h23m:

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10794&view=logs&j=c2f9361f-3c13-57db-3206-cee89820d5e3&t=bf5b33f9-9072-549e-d8b1-cab79a1dd2a9

I have a weak guess that this might be a hardware issue or deep system-wide software one.

I agree with you, that's my guess too. But I think we still might be able to at least temporarily avoid timeouts by skipping or reducing the size of some tests (#4501 (comment)).

@jameslamb
Copy link
Collaborator Author

jameslamb commented Aug 27, 2021

To help in this investigation, I wrote up a quick script tonight that will figure out how much time passed between different log lines.

script (click me)

click "view raw log" in Azure DevOps, then set log_url in the code below to that URL

import math
import re
import requests

from datetime import datetime

# get logs from Azure
log_url = "https://dev.azure.com/lightgbm-ci/8461a79b-5dce-4085-ad70-4410b7135276/_apis/build/builds/10810/logs/284"
res = requests.get(log_url)
res.raise_for_status()
log_lines = [
    line for line in res.text.split("\n") if line != ""
]

print("")
print("--- parsing timings from logs ---")
print("")
section_key = "initial-setup"
next_section_key = ""
pattern = ""
start_time = None
end_time = None

timings = []

breakpoints = [
    ("lightgbm-build", ".*running bdist_wheel.*"),
    ("test-setup", ".*test session starts.*"),
    ("test_.py", ".*test_\.py.*"),
    ("test_basic.py", ".*test_basic\.py.*"),
    ("test_consistency.py", ".*test_consistency\.py.*"),
    ("test_dask.py", ".*test_dask\.py.*"),
    ("test_dual.py", ".*test_dual\.py.*"),
    ("test_engine.py", ".*test_engine\.py.*"),
    ("test_plotting.py", ".*test_plotting\.py.*"),
    ("test_sklearn.py", ".*test_sklearn\.py.*"),
    ("test_utilities.py", ".*test_utilities.py.*"),
    ("test-of-job", " ")
]

def _get_line_time(line: str) -> datetime:
    timestamp_pattern = re.compile(
        r"([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}).*"
    )
    timestamp_str = re.search(timestamp_pattern, line).groups()[0]
    return datetime.strptime(timestamp_str, "%Y-%m-%dT%H:%M:%S")

section_key = "initial-setup"
num_lines = len(log_lines)
for i in range(num_lines):
    if i == 0:
        start_time = _get_line_time(log_lines[i])
        next_section_key, pattern = breakpoints.pop(0)
        pattern = re.compile(pattern)
        continue
    
    # if the current pattern matches this line, finalize the timings
    # for the current section
    if bool(re.match(pattern, log_lines[i])):
        print(f"Found start of pattern '{pattern}' at line {i}")
        end_time = _get_line_time(log_lines[i])
        time_in_section = (end_time - start_time).total_seconds() / 60.0
        timings.append((section_key, round(time_in_section, 2)))
        section_key = next_section_key
        
        # if there aren't any left, just finish up and exit the loop
        if len(breakpoints) == 0:
            print(f"({i}): nothing left, exiting")
            start_time = end_time
            end_time = _get_line_time(log_lines[num_lines - 1])
            time_in_section = (end_time - start_time).total_seconds() / 60.0
            timings.append((section_key, round(time_in_section, 2)))
            break

        # if there are more things left to parse, start on the next one
        start_time = end_time
        next_section_key, pattern = breakpoints.pop(0)
        pattern = re.compile(pattern)
        continue
    
    # if you reached the end, do one more update
    if i == (num_lines - 1):
        end_time = _get_line_time(log_lines[num_lines - 1])
        time_in_section = (end_time - start_time).total_seconds() / 60.0
        timings.append((section_key, round(time_in_section, 2)))

print("")
print("--- total time: ---")
print("")
total_seconds = (
    _get_line_time(log_lines[num_lines - 1]) - _get_line_time(log_lines[0])
).total_seconds()
total_hours = math.floor(total_seconds / (60 * 60))
total_minutes = int((total_seconds / 60.0) % 60)
print(f"{total_hours}h{total_minutes}m")
        
print("")
print("--- minutes spend in each section: ---")
print("")

for timing in timings:
    print(timing)

Here's what it shows for the most recent build on this PR, https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10794&view=logs&j=c2f9361f-3c13-57db-3206-cee89820d5e3&t=bf5b33f9-9072-549e-d8b1-cab79a1dd2a9.

--- total time: ---

1h22m

--- minutes spend in each section: ---

('initial-setup', 8.43)
('lightgbm-build', 16.03)
('test-setup', 0.07)
('test_.py', 0.03)
('test_basic.py', 2.83)
('test_consistency.py', 3.63)
('test_dask.py', 41.32)
('test_dual.py', 1.0)
('test_engine.py', 7.9)
('test_plotting.py', 0.4)
('test_sklearn.py', 1.08)
('test_utilities.py', 0.1)

The dask tests account for about half of the time.

Looking at this build that timed out (https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10810&view=logs&j=380bc479-1396-5123-5f0f-4e5f44cf253e&t=d4eba53f-d934-5d3f-b687-2f058f88f797):

--- total time: ---

2h58m

--- minutes spend in each section: ---

('initial-setup', 10.48)
('lightgbm-build', 22.98)
('test-setup', 0.17)
('test_.py', 0.07)
('test_basic.py', 13.43)
('test_consistency.py', 6.15)
('test_dask.py', 125.65)

It looks like the setup and building stages were a bit slower but not 2x slower. But the test_basic tests took 5x longer, and then the build timed out before the Dask tests were able to finish (after taking at LEAST 3x longer).

@jameslamb
Copy link
Collaborator Author

I'm not going to push anything for a while since there are several builds currently pending and waiting on QEMU to finish, but want to document some things I've been researching and am planning to try, maybe later tonight.

Specifying platform in docker run

I see the following warnings in all QEMU builds here (those that succeed and those that timeout).

WARNING: The requested image's platform (linux/arm64) does not match the detected host platform (linux/amd64) and no specific platform was requested

After some research, I think this warning can be suppressed by explicitly telling Docker which platform you're attempting to emulate, and want to test specifying --platform linux/arm/v7 right here

LightGBM/.vsts-ci.yml

Lines 193 to 199 in ee5636f

docker run \
--rm \
--env-file docker.env \
-v "$(Build.SourcesDirectory)":"$ROOT_DOCKER_FOLDER" \
-v "$(Build.ArtifactStagingDirectory)":"$(Build.ArtifactStagingDirectory)" \
"quay.io/pypa/manylinux2014_$ARCH" \
/bin/bash $ROOT_DOCKER_FOLDER/docker-script.sh

some links I referenced while investigating this

Running fewer Dask tests on aarch64 builds

Thinking we can define something like the following in tests/python_package_test/test_dask.py

on_non_x86_arch = machine() != 'x86_64'

# run fewer tests on aarch64 builds, since they are slower in CI
if on_non_x86_arch:
    tasks = ['binary-classification', 'regression', 'ranking']
    distributed_training_algorithms = ['data']
    data_output = ['array', 'dataframe-with-categorical']
    boosting_types = ['gbdt']

I've been testing different versions of these choices by running the following, which reports how many tests will be run but doesn't actually run them.

cd tests/python_package_test
pytest --collect-only test_dask.py

The configuration above would reduce the number of Dask tests in QEMU builds from to 351 to 127

Pinning to a specific pypa/manylinux tag

I've noticed that pypa/manylinux publishes images every few days. Many of those updates seems like minor dependency bumps using a bot like dependabot.

https://github.com/pypa/manylinux/commits/master

It's possible that pinning to a specific tag instead of using :latest could eliminate one source of variability in builds for this CI job.

@jameslamb
Copy link
Collaborator Author

@StrikerRUS since the QEMU timeouts problems seems to be really affecting the development velocity in this project recently, I think we should consider substantially reducing the number of Dask tests run in those builds, with something like the suggestion I left in "Running fewer Dask tests on aarch64 builds" in #4552 (comment).

Would you support that if I did some work tonight to repurpose this PR to add such skips (or some similar combination of skips)?

I'd prefer to not lose any test coverage across compiler / Python version / operating system / architecture combinations, but in this case I think that it's necessary for a while to get through the backlog of open pull requests and try to get a 3.3.0 release out soon (#4310).

@StrikerRUS
Copy link
Collaborator

Yes, I support this.

Also please consider the following option: revert a372ed5, create a new issue and investigate more why Dask tests stuck during QEMU builds.

@jameslamb
Copy link
Collaborator Author

Yes, I support this.

Also please consider the following option: revert a372ed5, create a new issue and investigate more why Dask tests stuck during QEMU builds.

ok sure, can work on it tonight!

I'm not really sure that it's that "the Dask tests get stuck" and I think it might be more like "MOST tests take significantly longer occasionally, and this is worst for Dask because there are just a lot more Dask tests". But I think you're right, this is having such a significant impact on development that it might be better to just completely turn off the Dask tests there for now and investigate this separately.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Sep 7, 2021

We can try to set timeoutInMinutes: 360 and check whether percentage of completed tests will be different from current 49% on timed out builds.

@jameslamb
Copy link
Collaborator Author

Since we already have so much discussion here, I decided to keep this PR for testing ideas and opened #4600 to completely skip the Dask tests on other architectures for now.

@jameslamb
Copy link
Collaborator Author

I'm going to close this PR for now. I'd like to resume running the Dask tests on QEMU CI builds in the future, but I'm not actively working on that.

Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants