Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce CI usage #14983

Open
straight-shoota opened this issue Sep 6, 2024 · 11 comments
Open

Reduce CI usage #14983

straight-shoota opened this issue Sep 6, 2024 · 11 comments

Comments

@straight-shoota
Copy link
Member

straight-shoota commented Sep 6, 2024

We're currently running 57 individual workflows in CI on every commit,1 and counting (#14964).

Some runs are fairly small, like the library compatiblity tests for OpenSSL and libpcre which take ~25 seconds each and most of that is setup. Not too much to worry about those. But most are orders of magnitude bigger and produce a quite noticable load.

The majority of workflows run std_spec, compiler_spec, build the compiler itself and std_spec again or a part of that. The full routine of bin/ci build usually takes 30-40 minutes.

Our CI runners are generously sponsored by GitHub, so using more resources doesn't incur an immediate cost for us. But we should still use the resources responsibly. And we suffer from significant congestion when there's lots of activity because parallel runners are limited.2

I think we have some potential to reduce the number of runs for some workflows. We don't need to test everything on every commit.

Reduce matrix in linux.yml

We currently run a matrix job to test forward compatibility for every single Crystal version since 1.0.0. That's a total of 14 versions and only getting more.
I think we can safely reduce that number. I do not recall if this has ever brought any valuable insight. If something breaks compatibility with older compilers, it's usually broken from a specific compiler version downwards. So testing the oldest and most recent versions (currently 1.0.0 and 1.13.2) should theoretically be sufficient. You only need the versions in between to pinpoint where exactly the breakage appears, but that's part of the debugging process and doesn't need to be in CI.
We could still keep a couple more versions in between for due diligence, but there's definitely no need to test all versions on every commit. Perhaps we could run the full set on release builds (maintenance and nightlies) just to be sure.

Limit llvm.yml & other library version tests

We're currently testing support with all major LLVM versions between 13 and 18. These jobs also run on every commit. We certainly want to keep testing all these versions as long as they're supported.

But it would not be necessary to do that on every commit. I think we could limit this workflow to only run when llvm related source code is directly affected (src/llvm).
Similar restrictions could apply to other workflows that test library-support across multiple versions.

The problem with these is that changes outside the code tree somewhere else in stdlib can have an effect as well. If a change in src/pointer.cr would break something in llvm it would get unnoticed because the workflow doesn't run. The chances for this are probably quite low and we have some general coverage with std_spec as well.
We should make sure to run all workflows on release builds (nightlies and maintenances) though.

Reduce smoke tests

We run smoke test for targets that are somewhat supported but we don't have any CI runners for these platforms. Currently, these are 9 platforms.

Smoke test means we only build the object files for std_spec, compiler_spec and compiler for the respective target, but do not actually link it or execute any code.

So these tests are naturally quite limited. They can only detect platform-specific compile time errors. These may happen when working on code related to a specific platform, but otherwise they're very unlikely. And changes to the platform-specific code should be expected to be tested on the respective platform anyway, so smoke test won't do much.

I think we can easily limit smoke tests to run only in release builds.

Prerequisites

In Windows CI we have a couple of workflows to build the requires libraries. Those are cached so these workflows usually just download the cached values and do nothing else. Later jobs directly pull the assets from cache. The lib jobs are just to ensure the cache is populated. They are quite lightweight at ~20 seconds, but these jobs are basically useless in ~99% of runs.
Perhaps we could find a more efficient way to provide lib assets to the build jobs? On Linux we're using Docker images which contain all necessary dependencies, and Nix on macos.

Other measurements

I'm sure there are other things we could do to improve the performance of individual workflows. But they may require more research and digging. Hard to say upfront what would be fruitful.

Ideas such as #13413 come to mind.

Footnotes

  1. All information is based on the state of the latest completed CI run on master, at this point that's https://github.com/crystal-lang/crystal/commit/a310dee1bbf30839964e798d7cd5653c5149ba3d

  2. For example, on Monday September 2, 2024 there were 7 successful runs of the Linux CI workflow with an average duration (time to completion, i.e. wait time + run time) of 64 minutes. On Thursday, September 5 there were 22 succesful runs with an average duration of 103 minutes.

@HertzDevil
Copy link
Contributor

HertzDevil commented Oct 4, 2024

For PRs like #14969 and #15052 coming from branches of this repository itself rather than from forks, is there any point in running the same set of jobs for both push and pull request triggers?

@straight-shoota
Copy link
Member Author

straight-shoota commented Oct 4, 2024

No, there's not really any point. But it's not trivial to deactivate that.
We had a proposal before in #9890 but that wasn't good enough. #10636 is pending review.

@oprypin
Copy link
Member

oprypin commented Oct 4, 2024

No, but there is a point to run tests on pull requests and there is a point to run tests when you push to a branch to check something.

There are alternatives such as

  1. the widespread bad one: not running push tests when pushing to a branch, which I hate with a passion. But if everyone else is really fine with always creating pull requests to try something out, then I can concede

  2. not running pull request tests when the repository is the same, so the push tests must be running already
    https://github.com/abseil/abseil-py/blob/296c08b0137fc3576cafbb0b85aea5fc06c391c1/.github/workflows/test.yml#L14
    This one is actually something I've had on my radar to try out, it's just worth double-checking what edge case effects it might have.
    But nobody seems to push to a branch of this repo anyway when creating pull requests, so this condition is not going to kick in, other than the bot's PRs

  3. maybe we want to just hardcode somehow (similarly to the above example?): if the repo is this particular one and the branch is not main or a release branch, then don't run tests on push. Then people can still try things out on their fork.

@oprypin
Copy link
Member

oprypin commented Oct 4, 2024

@straight-shoota
Copy link
Member Author

straight-shoota commented Oct 4, 2024

Regarding 3.

if the repo is this particular one and the branch is not main or a release branch, then don't run tests on push. Then people can still try things out on their fork.

That won't work. At least not without exceptions (and that would make it complicated).
We regularly push branches to this repo explicitly to run CI (without creating a PR because the code is still WIP and not ready for review, or testing some integration which will never lead to a PR).

@straight-shoota
Copy link
Member Author

CI usage is constantly growing and there's more and more congestion in the pipeline.

We have some good ideas which workflows we can reduce significantly. Generally, these jobs don't contribute much. They rarely fail on generic changes. So we can be fairly confident without them running every time.

We can reduce the total duration of CI runs for every single commit by approximatly 14 hours just by filtering some of the workflows:

  • Compatibility with all Crystal releases between the oldest forward-compatible (1.0.0) and latest: 14 * 0:35 = 9:20
  • Smoke tests: 9 * 0:11 = 1:40
  • LLVM: 7 * 0:25 = 2:55

The tricky part however is how to actually trigger these jobs with lower frequency. We still want to run them at some point in order to catch the odd issue. I think we can allow compromising master occasionally. It should be sufficient to run a full CI suite on master regularly. This would be very easy to implement with a nightly schedule.
That means in case a change breaks but the issue is only caught in the extensive workflows, we might not realize before merging the PR. But we'll know the day after which should be good enough and we can simply revert the change (and solve the issue).
We'll have to see how this plays out in practice, but I'm quite confident this should be a good start.

@straight-shoota
Copy link
Member Author

straight-shoota commented Feb 7, 2025

I have a concrete plan for this:

  • The matrix in linux.yml#x86_64-gnu-test only tests against latest and the oldest forward-compatible version on every commit. ([CI] Extract forward compatibility checks and run on nightly schedule #15437)
  • A new workflow ci-extensive.yml with a nightly schedule as well as manual trigger by pushing to special branches (ci-extensive/*?). This workflow contains:
    • Duplicate of x86_64-gnu-test with the removed entries to test against each forward-compatible version
    • call smoke.yml
    • call llvm.yml
  • smoke.yml push and pull_request triggers filter on platform-related paths: e.g. src/lib_c/*{bsd,i386,android,solaris,dragonfly}*/**, src/crystal/system/** ([CI] Filter runs of Smoke Test workflow #15457)
  • llvm.yml push and pull_request triggers filter on llvm-related paths: e.g. src/llvm/**, spec/std/llvm/** ([CI] Filter runs of LLVM Test workflow #15458)

This is a rough sketch. We might have to figure out a couple of details along the way. I'll probably split this up into separate PRs so we can discuss specifics individually.

@ysbaddaden
Copy link
Contributor

Sounds good.

Testing with Crystal 1.0 means that we already test an old enough LLVM release to catch issues early (so we can skip all the LLVM workflows) 👍

Nightly CI on master sounds good for starters. Maybe we can skip if nothing got merged since the last run?

Maybe we can configure manual workflow runs? Maybe also trigger actions from PR comments? For example type [ci:bsd] and that will trigger the smoke tests for *BSD?

Note: the OpenSSL and PCRE workflows are fast but they're not really useful to run always, only if we changed something related to Regex or OpenSSL. Maybe we can skip based on git paths?

@straight-shoota
Copy link
Member Author

straight-shoota commented Feb 8, 2025

Maybe we can skip if nothing got merged since the last run?

I don't think there's a trivial means for that. Also even if nothing in tree has changed, we're using external dependencies which may change between runs. So it certainly has some merit. And I believe master changes on the majority of days anyway.

Maybe we can configure manual workflow runs?

That's a good idea. Maybe this could even replace the branch name trigger in some cases (I guess we'll have to see).

Maybe also trigger actions from PR comments? For example type [ci:bsd] and that will trigger the smoke tests for *BSD?

This could be a convenience feature for a follow-up if we need to run some workflows often (and want non-commiters to trigger them).

I've left OpenSSL and PCRE out of the base proposal because the runtimes are insignificant. We can follow up with that later, of course.

@straight-shoota
Copy link
Member Author

Regarding the skip if nothing has changed: I think it might be feasible to reduce the frequency. Maybe we don't need all of these tests to run even daily. Every other day or weekly would probably be sufficient (especially when combined with explicit manual triggers for relevant changes).

@straight-shoota
Copy link
Member Author

One aspect to note is that nightly schedules only run on the default branch (i.e. master).
We currently don't have any automation to run expansive tests on release branches. I think this would be quite helpful though. At least for the most recent release branch which migth see some backports. Release branches usually don't have many changes directly, so it could be best to run on every push instead of on a schedule.

I don't think we can combine such a trigger with the path filtering in the workflows directly. So we'll need an orchestrator workflow that triggers on push to a release branch an then calls the actual workflows (forward-compat, llvm, smoke).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants