-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce CI usage #14983
Comments
No, but there is a point to run tests on pull requests and there is a point to run tests when you push to a branch to check something. There are alternatives such as
|
|
Regarding 3.
That won't work. At least not without exceptions (and that would make it complicated). |
CI usage is constantly growing and there's more and more congestion in the pipeline. We have some good ideas which workflows we can reduce significantly. Generally, these jobs don't contribute much. They rarely fail on generic changes. So we can be fairly confident without them running every time. We can reduce the total duration of CI runs for every single commit by approximatly 14 hours just by filtering some of the workflows:
The tricky part however is how to actually trigger these jobs with lower frequency. We still want to run them at some point in order to catch the odd issue. I think we can allow compromising master occasionally. It should be sufficient to run a full CI suite on master regularly. This would be very easy to implement with a nightly schedule. |
I have a concrete plan for this:
This is a rough sketch. We might have to figure out a couple of details along the way. I'll probably split this up into separate PRs so we can discuss specifics individually. |
Sounds good. Testing with Crystal 1.0 means that we already test an old enough LLVM release to catch issues early (so we can skip all the LLVM workflows) 👍 Nightly CI on Maybe we can configure manual workflow runs? Maybe also trigger actions from PR comments? For example type Note: the OpenSSL and PCRE workflows are fast but they're not really useful to run always, only if we changed something related to Regex or OpenSSL. Maybe we can skip based on git paths? |
I don't think there's a trivial means for that. Also even if nothing in tree has changed, we're using external dependencies which may change between runs. So it certainly has some merit. And I believe master changes on the majority of days anyway.
That's a good idea. Maybe this could even replace the branch name trigger in some cases (I guess we'll have to see).
This could be a convenience feature for a follow-up if we need to run some workflows often (and want non-commiters to trigger them). I've left OpenSSL and PCRE out of the base proposal because the runtimes are insignificant. We can follow up with that later, of course. |
Regarding the skip if nothing has changed: I think it might be feasible to reduce the frequency. Maybe we don't need all of these tests to run even daily. Every other day or weekly would probably be sufficient (especially when combined with explicit manual triggers for relevant changes). |
One aspect to note is that nightly schedules only run on the default branch (i.e. I don't think we can combine such a trigger with the path filtering in the workflows directly. So we'll need an orchestrator workflow that triggers on push to a release branch an then calls the actual workflows (forward-compat, llvm, smoke). |
We're currently running 57 individual workflows in CI on every commit,1 and counting (#14964).
Some runs are fairly small, like the library compatiblity tests for
OpenSSL
andlibpcre
which take ~25 seconds each and most of that is setup. Not too much to worry about those. But most are orders of magnitude bigger and produce a quite noticable load.The majority of workflows run
std_spec
,compiler_spec
, build the compiler itself andstd_spec
again or a part of that. The full routine ofbin/ci build
usually takes 30-40 minutes.Our CI runners are generously sponsored by GitHub, so using more resources doesn't incur an immediate cost for us. But we should still use the resources responsibly. And we suffer from significant congestion when there's lots of activity because parallel runners are limited.2
I think we have some potential to reduce the number of runs for some workflows. We don't need to test everything on every commit.
Reduce matrix in
linux.yml
We currently run a matrix job to test forward compatibility for every single Crystal version since 1.0.0. That's a total of 14 versions and only getting more.
I think we can safely reduce that number. I do not recall if this has ever brought any valuable insight. If something breaks compatibility with older compilers, it's usually broken from a specific compiler version downwards. So testing the oldest and most recent versions (currently 1.0.0 and 1.13.2) should theoretically be sufficient. You only need the versions in between to pinpoint where exactly the breakage appears, but that's part of the debugging process and doesn't need to be in CI.
We could still keep a couple more versions in between for due diligence, but there's definitely no need to test all versions on every commit. Perhaps we could run the full set on release builds (maintenance and nightlies) just to be sure.
Limit
llvm.yml
& other library version testsWe're currently testing support with all major LLVM versions between 13 and 18. These jobs also run on every commit. We certainly want to keep testing all these versions as long as they're supported.
But it would not be necessary to do that on every commit. I think we could limit this workflow to only run when llvm related source code is directly affected (
src/llvm
).Similar restrictions could apply to other workflows that test library-support across multiple versions.
The problem with these is that changes outside the code tree somewhere else in stdlib can have an effect as well. If a change in
src/pointer.cr
would break something in llvm it would get unnoticed because the workflow doesn't run. The chances for this are probably quite low and we have some general coverage withstd_spec
as well.We should make sure to run all workflows on release builds (nightlies and maintenances) though.
Reduce smoke tests
We run smoke test for targets that are somewhat supported but we don't have any CI runners for these platforms. Currently, these are 9 platforms.
Smoke test means we only build the object files for
std_spec
,compiler_spec
andcompiler
for the respective target, but do not actually link it or execute any code.So these tests are naturally quite limited. They can only detect platform-specific compile time errors. These may happen when working on code related to a specific platform, but otherwise they're very unlikely. And changes to the platform-specific code should be expected to be tested on the respective platform anyway, so smoke test won't do much.
I think we can easily limit smoke tests to run only in release builds.
Prerequisites
In Windows CI we have a couple of workflows to build the requires libraries. Those are cached so these workflows usually just download the cached values and do nothing else. Later jobs directly pull the assets from cache. The lib jobs are just to ensure the cache is populated. They are quite lightweight at ~20 seconds, but these jobs are basically useless in ~99% of runs.
Perhaps we could find a more efficient way to provide lib assets to the build jobs? On Linux we're using Docker images which contain all necessary dependencies, and Nix on macos.
Other measurements
I'm sure there are other things we could do to improve the performance of individual workflows. But they may require more research and digging. Hard to say upfront what would be fruitful.
Ideas such as #13413 come to mind.
Footnotes
All information is based on the state of the latest completed CI run on master, at this point that's https://github.com/crystal-lang/crystal/commit/a310dee1bbf30839964e798d7cd5653c5149ba3d ↩
For example, on Monday September 2, 2024 there were 7 successful runs of the
Linux CI
workflow with an average duration (time to completion, i.e. wait time + run time) of 64 minutes. On Thursday, September 5 there were 22 succesful runs with an average duration of 103 minutes. ↩The text was updated successfully, but these errors were encountered: