Add a automatic PR-revert utility for failing buildbots #535

pablogsal · 2024-10-01T21:33:24Z

No description provided.

pablogsal · 2024-10-02T20:28:34Z

This will be a cronjob in the buildbot master server

ambv

Looks fine to me on principle, but we need to work out the kinks by testing this out.

buildbothammer/README.md

buildbothammer/pyproject.toml

buildbothammer/src/buildbothammer/__init__.py

ambv · 2024-10-10T18:33:49Z

buildbothammer/src/buildbothammer/__init__.py

+REPO_OWNER = "python"
+REPO_NAME = "cpython"
+FORK_OWNER = os.getenv("FORK_OWNER", REPO_OWNER)
+REPO_CLONE_DIR = Path("/tmp/cpython")  # Replace with actual path


what process do you plan to replace all the placeholder directories, emails, etc.?

Not sure I understand the question. What do you mean with “what process”?

If you mean the values of the fork owner and the tokens and all of that I was planning to have a systemd launch script or a bash script with the environment variables and run that as the cron job.

if that is what you meant, do you have a better preference?

pablogsal · 2024-10-10T19:43:15Z

Looks fine to me on principle, but we need to work out the kinks by testing this out.

I tested it in python/cpython#124870

I will clean it up before we land now that I know you like the approach

Signed-off-by: Pablo Galindo <[email protected]>

encukou · 2024-10-11T08:41:00Z

I'd appreciate if you could write down the problem this solves. I wasn't the discussions for this, and it seems that there's a few steps of reasoning leading to this solution that aren't clear to me.

A lot of the buildbot issues I see won't be caught by this script, since they test several commits: refleaks and other slow builders, high-activity events, or just an unrelated docs commit coming in at the same time.
I can't quite tell if that's fine.

vstinner · 2024-10-11T10:11:18Z

I have mixed feelings about this feature, create a revert PR when a buildbot fails. In my experience, we need some engineering to decide if a failure is random, a real regression, or something else. Also, when there is a regression, I prefer a discussion to decide what to do rather than just revert. Having a change revert is bad for the change author, it's not always needed.

For me, that's more a technical solution for a human issue which should be solved by interacting, rather than having big "revert" hammer.

pablogsal · 2024-10-11T11:33:31Z

This PR is the result of some discussion we had at the core developer sprint. The idea of the tool is that this PR opens (but doesn't merge) revert PRs when a builtbot builder has failed more than N times and it was green before and if the failure corresponds to a single commit.

The idea is not to force the revert on people because a human still needs to merge the PR and the discussion needs to happen as well but to start the process in a way that's not someone opening a revert PR to someone else (it's a bot opening the revert PR).

This won't caught everything but it will caught as many things we can be sure that are caused by a single commit and the PRs will be open by a bot and not a human, highlighting that this is an automated process to not have the CI broken.

The problem that this solves is not having the CI red for a long time making fixes or reverts more challenging as time passes and the reverts are not clean anymore. We have discussed many times (not only in this last core dev sprint) that having the buildbot failing for a long time makes things much more difficult, specially when issues start to pile up and people are slow to act.

pablogsal · 2024-10-11T11:54:13Z

@encukou @vstinner would prefer if we don't go ahead with this and close the PR?

encukou · 2024-10-11T12:12:47Z

For reference, this logic would not catch any of the 36 currently failing stable builders. For all of them, they either always failed or the first build of the most recent red streak has more/less than one associated Change.
(3 of them failing so long that the change data was scrubbed, the hammer might have struck there if it was active.)
Of course the current situation is unusual -- the dashboard is still recovering from adding tier-3 and no-tier.

If the hammer does turn out to be reliable, I'd be worried about another problem: once people rely on it, its blind spots become CPython's blind spots. (That's a recent scar of course -- I caused this kind of issue with my daily reports!)

All in all, I do agree we need to do something, but I'm skeptical about this script. Sorry for telling you after you wrote it!

If we are to go ahead: the script won't hurt, and the best way to debug such three-way service integration is in production. Code-wise, ship it! :)

The logic here looks good, and I'll take a look at integrating it in the “release” dashboard (which we'll need anyway to catch the multi-commit builds):

the streak & cause logic
unstable builders that recently started failing

ambv · 2024-10-11T13:53:34Z

I'm a bit puzzled by the pushback here.

This automation is implementing an existing policy that we haven't been enforcing very consistently:

if we can identify that a single change broke a buildbot consistently, we are posting a comment on the PR;
the PR author has 24 hours to react, otherwise a revert will happen.

This policy is even mentioned explicitly in Victor's unofficial devguide.

Sure, the revert PRs this script makes will only solve a subset of the cases. But that's by design. It's a conservative approach to avoid producing PRs that revert changes we're not really sure about.

And even in the case when a PR is created, we don't have to merge the revert. We can ping the original change author first, pointing to the PR, saying that "hey, look, there's an urgent thing for you to look at".

encukou · 2024-10-11T14:28:46Z

I agree with that. The script does the right thing. The question is if it'll do the right thing often enough to be worth it.
But, it'll be easier to evaluate that once it's merged and running.

zware · 2024-10-11T15:42:58Z

I have not had a chance to actually review the code, all thoughts here assume that it's perfect :)

My biggest concern with this is just logistical; I'm not sure it should be part of the buildmaster-config repository. Any time this repository is updated, the buildbot master gets forcibly restarted (unless someone surreptitiously pulls the change on the host before Salt gets a chance to), and this really isn't part of the master's configuration. I'm fine with deploying this to the buildbot master host, but if it's not part of the master process itself it will need its own deployment set up in psf-salt and I think it would be best in its own repository.

If it doesn't already do so, this should definitely trigger a build of the revert PR on the failing bot(s).

pablogsal · 2024-10-11T15:55:34Z

My biggest concern with this is just logistical; I'm not sure it should be part of the buildmaster-config repository. Any time this repository is updated, the buildbot master gets forcibly restarted (unless someone surreptitiously pulls the change on the host before Salt gets a chance to), and this really isn't part of the master's configuration. I'm fine with deploying this to the buildbot master host, but if it's not part of the master process itself it will need its own deployment set up in psf-salt and I think it would be best in its own repository.

Ok, we can place this on some new repo, no problem, but I worry about the discovery and reusability of the common bits. But I am open to that if this is the consensus.

If it doesn't already do so, this should definitely trigger a build of the revert PR on the failing bot(s).

I does not, I will modify it if we decide to go forward 👍

vstinner · 2024-10-11T15:58:58Z

@encukou @vstinner would prefer if we don't go ahead with this and close the PR?

I'm neutral, I don't have a strong opinion on this PR :-) I'm afraid of two things.

If the policy is applied blindly, contributors who worked weeks on a PR will see their PR reverted in 24h without having an opportunity to dig into the issue, whereas the fix can take 5 minutes. It can be disappointing and send the wrong signal to contributors: "your work is not worth it, it has been rejected".

I'm also afraid of seeing tons of "revert" PRs being created automatically and someone will have to go through the list to decide which ones should be applied or not. Reviewing a revert PR requires to fully understand the issue which can mean: fixing the regression :-)

Maybe things will be smooth and people will understand the need to revert automatically if no one is available to write a fix.

@ambv:

I'm a bit puzzled by the pushback here. (...) This policy is even mentioned explicitly in Victor's unofficial devguide.

Right. But in practice, it's usually rare to revert, and easier to fix the regression. For me, a revert is like the last resort action.

Well, I just wanted to share my concerns, I don't want to hold the work here. If you want to give it a try, go ahead, and we can adjust the process later depending on how it goes.

vstinner · 2024-10-11T16:01:44Z

Said differently: applying a revert in 24 hours is too short for most people availability. If you want a strict policy, maybe we should give a week to contributors to fix the issue. Usually, a red buildbot is not a major blocker issue and it can stay red a few days.

ambv · 2024-10-11T16:10:08Z

If the policy is applied blindly, contributors who worked weeks on a PR will see their PR reverted in 24h without having an opportunity to dig into the issue, whereas the fix can take 5 minutes.

Right, which is why I said nobody's first reaction will be to apply a proposed revert PR without talking to the person. In fact, the author is @mentioned in the PR, so they have a chance to respond.

Reviewing a revert PR requires to fully understand the issue which can mean: fixing the regression

This is false. Reverting = returning to a previously known state. If done quickly, there is minimal collateral that needs to be taken into account.

If you want a strict policy, maybe we should give a week to contributors to fix the issue.

We can definitely refine the magic number of hours before which we apply the revert. The 24 hour period that you yourself implemented in the past was based on the following observations:

the longer the breakage is present, the harder it is to revert;
a failing buildbot is making us blind to further damage made after the fact, i.e. other PRs that would otherwise break that same buildbot, will not receive an alert that they did so. This is particularly problematic for refleak buildbots.

pablogsal · 2024-10-11T16:10:58Z

Said differently: applying a revert in 24 hours is too short for most people availability. If you want a strict policy, maybe we should give a week to contributors to fix the issue. Usually, a red buildbot is not a major blocker issue and it can stay red a few days.

Nobody will merge the revert automatically, so there is no risk of things happening without human intervention.

The PRs are filed and acting quickly is important to avoid piling up issues in the builedbot fleet that merge reverts more difficult and challenging for the people involved in the revert. We have being doing this by hand years. This just adds some help and makes the revert coming from a bot, which a lot of people in several core dev sprints and language summit talk thought it was a good idea.

encukou · 2024-10-11T16:12:15Z

I won't block the PR, and I'll help deploying it if you make it my job -- just say so :)

Since I think that this will need a human in the loop way more often than not, I'm instead investing time in a dashboard that shows the “Breaking build” and its associated change(s). Here's a very much work-in-progress static preview (as in, it won't automatically update).
To not derail the discussion here, please direct any comments on that to its repo.

pablogsal · 2024-10-11T16:13:07Z

If the policy is applied blindly, contributors who worked weeks on a PR will see their PR reverted in 24h without having an opportunity to dig into the issue, whereas the fix can take 5 minutes. It can be disappointing and send the wrong signal to contributors: "your work is not worth it, it has been rejected".

The signal is: your work was good because we merged it but unfortunately broke the CI and that has consequences to everyone in the project so we are opening this PR to potentially revert it to avoid further problems for everyone.

If you are afraid about this we can tweak the revert message all that we want to make it clear why this is happening, the policy and the consequences but the reality of the situation is that not reverting has duplicated or triplicated the amount of work to find and fix CI problems for years and we have evidence of that (think of all the problems with ref leaks and crashes in sub interpreters in 3.11)

pablogsal · 2024-10-11T16:14:46Z

I'm also afraid of seeing tons of "revert" PRs being created automatically and someone will have to go through the list to decide which ones should be applied or not. Reviewing a revert PR requires to fully understand the issue which can mean: fixing the regression :-)

This won't happen: the chances of all the conditions being true at the same time (so we are sure that the PR is not a false positive) its actually quite rare. I did a simulation and across this year it would have filed 4 revert PRs

encukou · 2024-10-22T14:44:20Z

It's hard to say this when you have a working implementation ready, but: For 4 actions a year, I think that this is too much code & infrastructure to maintain.

We do need to automate reverts, but the process needs a human in the loop in all but the simplest cases. I filed a plan here: python/core-workflow#559

pablogsal · 2024-10-22T14:57:48Z

Ok, I guess I will close the PR then. I do apologise if this sounds a bit negative but I don't have more energy to keep defending my views on this, specially since unfortunately I am not feeling that this is being a collaboration.

pablogsal requested a review from ambv October 1, 2024 21:33

pablogsal force-pushed the buildbothammer branch 6 times, most recently from 8677cac to ed4d20d Compare October 1, 2024 23:52

pablogsal requested review from zware and encukou October 2, 2024 20:28

ambv reviewed Oct 10, 2024

View reviewed changes

Add a automatic PR-revert utility for failing buildbots

a6e2f0c

Signed-off-by: Pablo Galindo <[email protected]>

pablogsal force-pushed the buildbothammer branch 2 times, most recently from 003f324 to ed16034 Compare October 10, 2024 22:42

Do not create a PR if one exists

bc97a6f

Signed-off-by: Pablo Galindo <[email protected]>

pablogsal force-pushed the buildbothammer branch from ed16034 to bc97a6f Compare October 10, 2024 22:43

pablogsal added 3 commits October 10, 2024 23:45

Use task groups

20276e0

Small fixes

58194ac

Small fixes

94ff7b4

encukou mentioned this pull request Oct 22, 2024

Automating revert commits python/core-workflow#559

Open

3 tasks

pablogsal closed this Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a automatic PR-revert utility for failing buildbots #535

Add a automatic PR-revert utility for failing buildbots #535

pablogsal commented Oct 1, 2024

pablogsal commented Oct 2, 2024

ambv left a comment

ambv Oct 10, 2024

pablogsal Oct 10, 2024

pablogsal Oct 10, 2024

pablogsal commented Oct 10, 2024

encukou commented Oct 11, 2024

vstinner commented Oct 11, 2024

pablogsal commented Oct 11, 2024

pablogsal commented Oct 11, 2024

encukou commented Oct 11, 2024

ambv commented Oct 11, 2024

encukou commented Oct 11, 2024

zware commented Oct 11, 2024

pablogsal commented Oct 11, 2024

vstinner commented Oct 11, 2024

vstinner commented Oct 11, 2024

ambv commented Oct 11, 2024

pablogsal commented Oct 11, 2024

encukou commented Oct 11, 2024 •

edited

Loading

pablogsal commented Oct 11, 2024

pablogsal commented Oct 11, 2024

encukou commented Oct 22, 2024

pablogsal commented Oct 22, 2024 •

edited

Loading

Add a automatic PR-revert utility for failing buildbots #535

Add a automatic PR-revert utility for failing buildbots #535

Conversation

pablogsal commented Oct 1, 2024

pablogsal commented Oct 2, 2024

ambv left a comment

Choose a reason for hiding this comment

ambv Oct 10, 2024

Choose a reason for hiding this comment

pablogsal Oct 10, 2024

Choose a reason for hiding this comment

pablogsal Oct 10, 2024

Choose a reason for hiding this comment

pablogsal commented Oct 10, 2024

encukou commented Oct 11, 2024

vstinner commented Oct 11, 2024

pablogsal commented Oct 11, 2024

pablogsal commented Oct 11, 2024

encukou commented Oct 11, 2024

ambv commented Oct 11, 2024

encukou commented Oct 11, 2024

zware commented Oct 11, 2024

pablogsal commented Oct 11, 2024

vstinner commented Oct 11, 2024

vstinner commented Oct 11, 2024

ambv commented Oct 11, 2024

pablogsal commented Oct 11, 2024

encukou commented Oct 11, 2024 • edited Loading

pablogsal commented Oct 11, 2024

pablogsal commented Oct 11, 2024

encukou commented Oct 22, 2024

pablogsal commented Oct 22, 2024 • edited Loading

encukou commented Oct 11, 2024 •

edited

Loading

pablogsal commented Oct 22, 2024 •

edited

Loading