WIP: get CI timings on long-running jobs #4552

jameslamb · 2021-08-25T03:36:36Z

I've noticed recently that QEMU builds are regularly timing out after 3 hours. This slows down development on the repository to a level that I'm now really starting to feel, whiich is especially impactful given the timezone difference between maintainers.

Opening this PR to continue the investigation in #4501 into which tests are taking the longest time to run, so I can propose possibly skipping some of them for QEMU builds.

I tried to do this testing on my fork, but Azure DevOps was not picking up PRs there (jameslamb#48).

StrikerRUS · 2021-08-25T13:53:07Z

Thanks for starting a work on this! I wanted to create an issue for this problem this week.

Just my observations: QEMU either works fine within 2 hours, or approximately doubles this time. I have a weak guess that this might be a hardware issue or deep system-wide software one.

jameslamb · 2021-08-25T14:13:15Z

QEMU either works fine within 2 hours, or approximately doubles this time

Yeah I've seen this too! This most recent build only took 1h23m:

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10794&view=logs&j=c2f9361f-3c13-57db-3206-cee89820d5e3&t=bf5b33f9-9072-549e-d8b1-cab79a1dd2a9

I have a weak guess that this might be a hardware issue or deep system-wide software one.

I agree with you, that's my guess too. But I think we still might be able to at least temporarily avoid timeouts by skipping or reducing the size of some tests (#4501 (comment)).

jameslamb · 2021-08-27T04:33:13Z

To help in this investigation, I wrote up a quick script tonight that will figure out how much time passed between different log lines.

script (click me)

click "view raw log" in Azure DevOps, then set log_url in the code below to that URL

import math
import re
import requests

from datetime import datetime

# get logs from Azure
log_url = "https://dev.azure.com/lightgbm-ci/8461a79b-5dce-4085-ad70-4410b7135276/_apis/build/builds/10810/logs/284"
res = requests.get(log_url)
res.raise_for_status()
log_lines = [
    line for line in res.text.split("\n") if line != ""
]

print("")
print("--- parsing timings from logs ---")
print("")
section_key = "initial-setup"
next_section_key = ""
pattern = ""
start_time = None
end_time = None

timings = []

breakpoints = [
    ("lightgbm-build", ".*running bdist_wheel.*"),
    ("test-setup", ".*test session starts.*"),
    ("test_.py", ".*test_\.py.*"),
    ("test_basic.py", ".*test_basic\.py.*"),
    ("test_consistency.py", ".*test_consistency\.py.*"),
    ("test_dask.py", ".*test_dask\.py.*"),
    ("test_dual.py", ".*test_dual\.py.*"),
    ("test_engine.py", ".*test_engine\.py.*"),
    ("test_plotting.py", ".*test_plotting\.py.*"),
    ("test_sklearn.py", ".*test_sklearn\.py.*"),
    ("test_utilities.py", ".*test_utilities.py.*"),
    ("test-of-job", " ")
]

def _get_line_time(line: str) -> datetime:
    timestamp_pattern = re.compile(
        r"([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}).*"
    )
    timestamp_str = re.search(timestamp_pattern, line).groups()[0]
    return datetime.strptime(timestamp_str, "%Y-%m-%dT%H:%M:%S")

section_key = "initial-setup"
num_lines = len(log_lines)
for i in range(num_lines):
    if i == 0:
        start_time = _get_line_time(log_lines[i])
        next_section_key, pattern = breakpoints.pop(0)
        pattern = re.compile(pattern)
        continue
    
    # if the current pattern matches this line, finalize the timings
    # for the current section
    if bool(re.match(pattern, log_lines[i])):
        print(f"Found start of pattern '{pattern}' at line {i}")
        end_time = _get_line_time(log_lines[i])
        time_in_section = (end_time - start_time).total_seconds() / 60.0
        timings.append((section_key, round(time_in_section, 2)))
        section_key = next_section_key
        
        # if there aren't any left, just finish up and exit the loop
        if len(breakpoints) == 0:
            print(f"({i}): nothing left, exiting")
            start_time = end_time
            end_time = _get_line_time(log_lines[num_lines - 1])
            time_in_section = (end_time - start_time).total_seconds() / 60.0
            timings.append((section_key, round(time_in_section, 2)))
            break

        # if there are more things left to parse, start on the next one
        start_time = end_time
        next_section_key, pattern = breakpoints.pop(0)
        pattern = re.compile(pattern)
        continue
    
    # if you reached the end, do one more update
    if i == (num_lines - 1):
        end_time = _get_line_time(log_lines[num_lines - 1])
        time_in_section = (end_time - start_time).total_seconds() / 60.0
        timings.append((section_key, round(time_in_section, 2)))

print("")
print("--- total time: ---")
print("")
total_seconds = (
    _get_line_time(log_lines[num_lines - 1]) - _get_line_time(log_lines[0])
).total_seconds()
total_hours = math.floor(total_seconds / (60 * 60))
total_minutes = int((total_seconds / 60.0) % 60)
print(f"{total_hours}h{total_minutes}m")
        
print("")
print("--- minutes spend in each section: ---")
print("")

for timing in timings:
    print(timing)

Here's what it shows for the most recent build on this PR, https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10794&view=logs&j=c2f9361f-3c13-57db-3206-cee89820d5e3&t=bf5b33f9-9072-549e-d8b1-cab79a1dd2a9.

--- total time: ---

1h22m

--- minutes spend in each section: ---

('initial-setup', 8.43)
('lightgbm-build', 16.03)
('test-setup', 0.07)
('test_.py', 0.03)
('test_basic.py', 2.83)
('test_consistency.py', 3.63)
('test_dask.py', 41.32)
('test_dual.py', 1.0)
('test_engine.py', 7.9)
('test_plotting.py', 0.4)
('test_sklearn.py', 1.08)
('test_utilities.py', 0.1)

The dask tests account for about half of the time.

Looking at this build that timed out (https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10810&view=logs&j=380bc479-1396-5123-5f0f-4e5f44cf253e&t=d4eba53f-d934-5d3f-b687-2f058f88f797):

--- total time: ---

2h58m

--- minutes spend in each section: ---

('initial-setup', 10.48)
('lightgbm-build', 22.98)
('test-setup', 0.17)
('test_.py', 0.07)
('test_basic.py', 13.43)
('test_consistency.py', 6.15)
('test_dask.py', 125.65)

It looks like the setup and building stages were a bit slower but not 2x slower. But the test_basic tests took 5x longer, and then the build timed out before the Dask tests were able to finish (after taking at LEAST 3x longer).

jameslamb · 2021-08-28T23:18:13Z

I'm not going to push anything for a while since there are several builds currently pending and waiting on QEMU to finish, but want to document some things I've been researching and am planning to try, maybe later tonight.

Specifying `platform` in `docker run`

I see the following warnings in all QEMU builds here (those that succeed and those that timeout).

WARNING: The requested image's platform (linux/arm64) does not match the detected host platform (linux/amd64) and no specific platform was requested

After some research, I think this warning can be suppressed by explicitly telling Docker which platform you're attempting to emulate, and want to test specifying --platform linux/arm/v7 right here

LightGBM/.vsts-ci.yml

Lines 193 to 199 in ee5636f

    
                 docker run \ 
        
                   --rm \ 
        
                   --env-file docker.env \ 
        
                   -v "$(Build.SourcesDirectory)":"$ROOT_DOCKER_FOLDER" \ 
        
                   -v "$(Build.ArtifactStagingDirectory)":"$(Build.ArtifactStagingDirectory)" \ 
        
                   "quay.io/pypa/manylinux2014_$ARCH" \ 
        
                   /bin/bash $ROOT_DOCKER_FOLDER/docker-script.sh

some links I referenced while investigating this

originally learned about this from Warning message linux/arm/v7 dani-garcia/vaultwarden#1264
https://www.docker.com/blog/multi-platform-docker-builds/

Running fewer Dask tests on aarch64 builds

Thinking we can define something like the following in tests/python_package_test/test_dask.py

on_non_x86_arch = machine() != 'x86_64'

# run fewer tests on aarch64 builds, since they are slower in CI
if on_non_x86_arch:
    tasks = ['binary-classification', 'regression', 'ranking']
    distributed_training_algorithms = ['data']
    data_output = ['array', 'dataframe-with-categorical']
    boosting_types = ['gbdt']

I've been testing different versions of these choices by running the following, which reports how many tests will be run but doesn't actually run them.

cd tests/python_package_test
pytest --collect-only test_dask.py

The configuration above would reduce the number of Dask tests in QEMU builds from to 351 to 127

Pinning to a specific `pypa/manylinux` tag

I've noticed that pypa/manylinux publishes images every few days. Many of those updates seems like minor dependency bumps using a bot like dependabot.

https://github.com/pypa/manylinux/commits/master

It's possible that pinning to a specific tag instead of using :latest could eliminate one source of variability in builds for this CI job.

jameslamb · 2021-09-07T18:40:08Z

@StrikerRUS since the QEMU timeouts problems seems to be really affecting the development velocity in this project recently, I think we should consider substantially reducing the number of Dask tests run in those builds, with something like the suggestion I left in "Running fewer Dask tests on aarch64 builds" in #4552 (comment).

Would you support that if I did some work tonight to repurpose this PR to add such skips (or some similar combination of skips)?

I'd prefer to not lose any test coverage across compiler / Python version / operating system / architecture combinations, but in this case I think that it's necessary for a while to get through the backlog of open pull requests and try to get a 3.3.0 release out soon (#4310).

StrikerRUS · 2021-09-07T21:12:44Z

Yes, I support this.

Also please consider the following option: revert a372ed5, create a new issue and investigate more why Dask tests stuck during QEMU builds.

jameslamb · 2021-09-07T21:25:33Z

Yes, I support this.

Also please consider the following option: revert a372ed5, create a new issue and investigate more why Dask tests stuck during QEMU builds.

ok sure, can work on it tonight!

I'm not really sure that it's that "the Dask tests get stuck" and I think it might be more like "MOST tests take significantly longer occasionally, and this is worst for Dask because there are just a lot more Dask tests". But I think you're right, this is having such a significant impact on development that it might be better to just completely turn off the Dask tests there for now and investigate this separately.

StrikerRUS · 2021-09-07T22:39:40Z

We can try to set timeoutInMinutes: 360 and check whether percentage of completed tests will be different from current 49% on timed out builds.

jameslamb · 2021-09-08T03:35:35Z

Since we already have so much discussion here, I decided to keep this PR for testing ideas and opened #4600 to completely skip the Dask tests on other architectures for now.

jameslamb · 2022-01-17T03:10:10Z

I'm going to close this PR for now. I'd like to resume running the Dask tests on QEMU CI builds in the future, but I'm not actively working on that.

github-actions · 2023-11-15T00:21:12Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added 2 commits August 24, 2021 22:29

get CI timings on long-running jobs

bfc6f4b

comment out more jobs

8ca7c42

jameslamb added 4 commits August 28, 2021 20:02

try reducing number of Dask tests

2bc5012

Merge branch 'master' into ci/qemu-tests

5e4982d

remove testing changes

2b04882

platform should be linux/arm64

173e189

jameslamb mentioned this pull request Sep 1, 2021

[ci] explicitly set --platform when running aarch64 image in QEMU builds #4579

Merged

jameslamb added the in progress label Sep 1, 2021

jameslamb added 2 commits September 7, 2021 17:08

merge master

270a9cb

remove a lot of Dask tests

e9b5d39

jameslamb mentioned this pull request Sep 8, 2021

[ci] skip Dask tests on QEMU builds #4600

Merged

jameslamb closed this Jan 17, 2022

jameslamb deleted the ci/qemu-tests branch January 17, 2022 03:10

StrikerRUS mentioned this pull request Jan 26, 2022

[ci] use conda-forge in Linux and macOS CI jobs #4953

Merged

jameslamb removed the in progress label Aug 13, 2023

github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: get CI timings on long-running jobs #4552

WIP: get CI timings on long-running jobs #4552

jameslamb commented Aug 25, 2021

StrikerRUS commented Aug 25, 2021

jameslamb commented Aug 25, 2021

jameslamb commented Aug 27, 2021 •

edited

Loading

jameslamb commented Aug 28, 2021

jameslamb commented Sep 7, 2021

StrikerRUS commented Sep 7, 2021

jameslamb commented Sep 7, 2021

StrikerRUS commented Sep 7, 2021 •

edited

Loading

jameslamb commented Sep 8, 2021

jameslamb commented Jan 17, 2022

github-actions bot commented Nov 15, 2023

WIP: get CI timings on long-running jobs #4552

WIP: get CI timings on long-running jobs #4552

Conversation

jameslamb commented Aug 25, 2021

StrikerRUS commented Aug 25, 2021

jameslamb commented Aug 25, 2021

jameslamb commented Aug 27, 2021 • edited Loading

jameslamb commented Aug 28, 2021

Specifying platform in docker run

Running fewer Dask tests on aarch64 builds

Pinning to a specific pypa/manylinux tag

jameslamb commented Sep 7, 2021

StrikerRUS commented Sep 7, 2021

jameslamb commented Sep 7, 2021

StrikerRUS commented Sep 7, 2021 • edited Loading

jameslamb commented Sep 8, 2021

jameslamb commented Jan 17, 2022

github-actions bot commented Nov 15, 2023

jameslamb commented Aug 27, 2021 •

edited

Loading

Specifying `platform` in `docker run`

Pinning to a specific `pypa/manylinux` tag

StrikerRUS commented Sep 7, 2021 •

edited

Loading