Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the way we currently detect regressions in build time metrics #40076

Open
zakkak opened this issue Apr 15, 2024 · 5 comments
Open

Rework the way we currently detect regressions in build time metrics #40076

zakkak opened this issue Apr 15, 2024 · 5 comments

Comments

@zakkak
Copy link
Contributor

zakkak commented Apr 15, 2024

Description

#36108 was an attempt to detect regressions in native builds when certain metrics are outside a given range. Unfortunately this doesn't seem to work well in practice. The main reason seems to be that multiple PRs gradually increase the metrics without hitting the threshold. Then a new PR that happens to increase the metrics a bit more triggers a failure. Although this PR might not be responsible for the total increase (that resulted in hitting the threshold) it is the one being blocked.

Implementation ideas

A thought we had within the mandrel team (cc @Karm @jerboaa) and we are working towards it is the following.

We would like to start collecting data from Quarkus CI runs (initially from runs on main and lately probably from PRs as well). This will allow us to observe the change over time (as show in #39674 (comment)) instead of just when we hit a threshold.

Next we would ideally like to feed these data to a tool with anomaly detection (possibly https://horreum.hyperfoil.io/) in order to get automated alerts when something seems wrong. That could be:

  1. Create a generic GH issue when we have crossed a threshold from the last known "good state"
  2. Create a PR specific issue or comment in an open PR if it appears to be causing a sudden increase in the metrics we are interested in.

Related PRs:

  1. Introduce RunnerInfo to ImageStats Karm/collector#23
  2. Upload native build statistics #39784
Copy link

quarkus-bot bot commented Apr 15, 2024

/cc @Karm (mandrel), @ebullient (metrics), @galderz (mandrel), @jmartisk (metrics)

@maxandersen
Copy link
Member

Go for it! Lets spot those frogs boiling earlier :)

@dmlloyd
Copy link
Member

dmlloyd commented May 16, 2024

I love this idea!

@yrodiere
Copy link
Member

yrodiere commented May 16, 2024

So IIUC, the same checks, but with alerts that won't block merging, and a nice infra with lots of history to help investigation?

Sounds great, +1 :)

zakkak added a commit to zakkak/quarkus that referenced this issue Jun 11, 2024
As part of quarkusio#40076 is now
implemented and we are gathering the necessary stats in a database there
is no longer the need for these tests to exist.

Note that automated alerting in case of regressions is still not setup
but we now have a way to go back in time and see which (few) patches
resulted in the regression, even if we don't spot it immediately.
zakkak added a commit to zakkak/quarkus that referenced this issue Jun 12, 2024
As part of quarkusio#40076 is now
implemented and we are gathering the necessary stats in a database there
is no longer the need for these tests to exist.

Note that automated alerting in case of regressions is still not setup
but we now have a way to go back in time and see which (few) patches
resulted in the regression, even if we don't spot it immediately.
holly-cummins pushed a commit to holly-cummins/quarkus that referenced this issue Jul 31, 2024
As part of quarkusio#40076 is now
implemented and we are gathering the necessary stats in a database there
is no longer the need for these tests to exist.

Note that automated alerting in case of regressions is still not setup
but we now have a way to go back in time and see which (few) patches
resulted in the regression, even if we don't spot it immediately.
@zakkak
Copy link
Contributor Author

zakkak commented Sep 13, 2024

Status update:

✔️ Data are being collected from CI runs on main (see #39784)
✔️ Integration with Horreum Hyperfoil/Horreum#1703
📆 Automated reporting of anomalies waiting for Horreum new release to be deployed in production

danielsoro pushed a commit to danielsoro/quarkus that referenced this issue Sep 20, 2024
As part of quarkusio#40076 is now
implemented and we are gathering the necessary stats in a database there
is no longer the need for these tests to exist.

Note that automated alerting in case of regressions is still not setup
but we now have a way to go back in time and see which (few) patches
resulted in the regression, even if we don't spot it immediately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants