Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a automatic PR-revert utility for failing buildbots #535

Closed
wants to merge 5 commits into from

Conversation

pablogsal
Copy link
Member

No description provided.

@pablogsal pablogsal requested a review from ambv October 1, 2024 21:33
@pablogsal pablogsal force-pushed the buildbothammer branch 6 times, most recently from 8677cac to ed4d20d Compare October 1, 2024 23:52
@pablogsal pablogsal requested review from zware and encukou October 2, 2024 20:28
@pablogsal
Copy link
Member Author

This will be a cronjob in the buildbot master server

Copy link
Contributor

@ambv ambv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me on principle, but we need to work out the kinks by testing this out.

buildbothammer/README.md Outdated Show resolved Hide resolved
buildbothammer/pyproject.toml Show resolved Hide resolved
buildbothammer/src/buildbothammer/__init__.py Show resolved Hide resolved
REPO_OWNER = "python"
REPO_NAME = "cpython"
FORK_OWNER = os.getenv("FORK_OWNER", REPO_OWNER)
REPO_CLONE_DIR = Path("/tmp/cpython") # Replace with actual path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what process do you plan to replace all the placeholder directories, emails, etc.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the question. What do you mean with “what process”?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean the values of the fork owner and the tokens and all of that I was planning to have a systemd launch script or a bash script with the environment variables and run that as the cron job.

if that is what you meant, do you have a better preference?

@pablogsal
Copy link
Member Author

Looks fine to me on principle, but we need to work out the kinks by testing this out.

I tested it in python/cpython#124870

I will clean it up before we land now that I know you like the approach

@pablogsal pablogsal force-pushed the buildbothammer branch 2 times, most recently from 003f324 to ed16034 Compare October 10, 2024 22:42
@encukou
Copy link
Member

encukou commented Oct 11, 2024

I'd appreciate if you could write down the problem this solves. I wasn't the discussions for this, and it seems that there's a few steps of reasoning leading to this solution that aren't clear to me.

A lot of the buildbot issues I see won't be caught by this script, since they test several commits: refleaks and other slow builders, high-activity events, or just an unrelated docs commit coming in at the same time.
I can't quite tell if that's fine.

@vstinner
Copy link
Member

I have mixed feelings about this feature, create a revert PR when a buildbot fails. In my experience, we need some engineering to decide if a failure is random, a real regression, or something else. Also, when there is a regression, I prefer a discussion to decide what to do rather than just revert. Having a change revert is bad for the change author, it's not always needed.

For me, that's more a technical solution for a human issue which should be solved by interacting, rather than having big "revert" hammer.

@pablogsal
Copy link
Member Author

This PR is the result of some discussion we had at the core developer sprint. The idea of the tool is that this PR opens (but doesn't merge) revert PRs when a builtbot builder has failed more than N times and it was green before and if the failure corresponds to a single commit.

The idea is not to force the revert on people because a human still needs to merge the PR and the discussion needs to happen as well but to start the process in a way that's not someone opening a revert PR to someone else (it's a bot opening the revert PR).

This won't caught everything but it will caught as many things we can be sure that are caused by a single commit and the PRs will be open by a bot and not a human, highlighting that this is an automated process to not have the CI broken.

The problem that this solves is not having the CI red for a long time making fixes or reverts more challenging as time passes and the reverts are not clean anymore. We have discussed many times (not only in this last core dev sprint) that having the buildbot failing for a long time makes things much more difficult, specially when issues start to pile up and people are slow to act.

@pablogsal
Copy link
Member Author

@encukou @vstinner would prefer if we don't go ahead with this and close the PR?

@encukou
Copy link
Member

encukou commented Oct 11, 2024

For reference, this logic would not catch any of the 36 currently failing stable builders. For all of them, they either always failed or the first build of the most recent red streak has more/less than one associated Change.
(3 of them failing so long that the change data was scrubbed, the hammer might have struck there if it was active.)
Of course the current situation is unusual -- the dashboard is still recovering from adding tier-3 and no-tier.

If the hammer does turn out to be reliable, I'd be worried about another problem: once people rely on it, its blind spots become CPython's blind spots. (That's a recent scar of course -- I caused this kind of issue with my daily reports!)

All in all, I do agree we need to do something, but I'm skeptical about this script. Sorry for telling you after you wrote it!

If we are to go ahead: the script won't hurt, and the best way to debug such three-way service integration is in production. Code-wise, ship it! :)

The logic here looks good, and I'll take a look at integrating it in the “release” dashboard (which we'll need anyway to catch the multi-commit builds):

  • the streak & cause logic
  • unstable builders that recently started failing

@ambv
Copy link
Contributor

ambv commented Oct 11, 2024

I'm a bit puzzled by the pushback here.

This automation is implementing an existing policy that we haven't been enforcing very consistently:

  • if we can identify that a single change broke a buildbot consistently, we are posting a comment on the PR;
  • the PR author has 24 hours to react, otherwise a revert will happen.

This policy is even mentioned explicitly in Victor's unofficial devguide.

Sure, the revert PRs this script makes will only solve a subset of the cases. But that's by design. It's a conservative approach to avoid producing PRs that revert changes we're not really sure about.

And even in the case when a PR is created, we don't have to merge the revert. We can ping the original change author first, pointing to the PR, saying that "hey, look, there's an urgent thing for you to look at".

@encukou
Copy link
Member

encukou commented Oct 11, 2024

I agree with that. The script does the right thing. The question is if it'll do the right thing often enough to be worth it.
But, it'll be easier to evaluate that once it's merged and running.

@zware
Copy link
Member

zware commented Oct 11, 2024

I have not had a chance to actually review the code, all thoughts here assume that it's perfect :)

My biggest concern with this is just logistical; I'm not sure it should be part of the buildmaster-config repository. Any time this repository is updated, the buildbot master gets forcibly restarted (unless someone surreptitiously pulls the change on the host before Salt gets a chance to), and this really isn't part of the master's configuration. I'm fine with deploying this to the buildbot master host, but if it's not part of the master process itself it will need its own deployment set up in psf-salt and I think it would be best in its own repository.


If it doesn't already do so, this should definitely trigger a build of the revert PR on the failing bot(s).

@pablogsal
Copy link
Member Author

My biggest concern with this is just logistical; I'm not sure it should be part of the buildmaster-config repository. Any time this repository is updated, the buildbot master gets forcibly restarted (unless someone surreptitiously pulls the change on the host before Salt gets a chance to), and this really isn't part of the master's configuration. I'm fine with deploying this to the buildbot master host, but if it's not part of the master process itself it will need its own deployment set up in psf-salt and I think it would be best in its own repository.

Ok, we can place this on some new repo, no problem, but I worry about the discovery and reusability of the common bits. But I am open to that if this is the consensus.

If it doesn't already do so, this should definitely trigger a build of the revert PR on the failing bot(s).

I does not, I will modify it if we decide to go forward 👍

@vstinner
Copy link
Member

@encukou @vstinner would prefer if we don't go ahead with this and close the PR?

I'm neutral, I don't have a strong opinion on this PR :-) I'm afraid of two things.

If the policy is applied blindly, contributors who worked weeks on a PR will see their PR reverted in 24h without having an opportunity to dig into the issue, whereas the fix can take 5 minutes. It can be disappointing and send the wrong signal to contributors: "your work is not worth it, it has been rejected".

I'm also afraid of seeing tons of "revert" PRs being created automatically and someone will have to go through the list to decide which ones should be applied or not. Reviewing a revert PR requires to fully understand the issue which can mean: fixing the regression :-)

Maybe things will be smooth and people will understand the need to revert automatically if no one is available to write a fix.

@ambv:

I'm a bit puzzled by the pushback here. (...) This policy is even mentioned explicitly in Victor's unofficial devguide.

Right. But in practice, it's usually rare to revert, and easier to fix the regression. For me, a revert is like the last resort action.

Well, I just wanted to share my concerns, I don't want to hold the work here. If you want to give it a try, go ahead, and we can adjust the process later depending on how it goes.

@vstinner
Copy link
Member

Said differently: applying a revert in 24 hours is too short for most people availability. If you want a strict policy, maybe we should give a week to contributors to fix the issue. Usually, a red buildbot is not a major blocker issue and it can stay red a few days.

@ambv
Copy link
Contributor

ambv commented Oct 11, 2024

If the policy is applied blindly, contributors who worked weeks on a PR will see their PR reverted in 24h without having an opportunity to dig into the issue, whereas the fix can take 5 minutes.

Right, which is why I said nobody's first reaction will be to apply a proposed revert PR without talking to the person. In fact, the author is @mentioned in the PR, so they have a chance to respond.

Reviewing a revert PR requires to fully understand the issue which can mean: fixing the regression

This is false. Reverting = returning to a previously known state. If done quickly, there is minimal collateral that needs to be taken into account.

If you want a strict policy, maybe we should give a week to contributors to fix the issue.

We can definitely refine the magic number of hours before which we apply the revert. The 24 hour period that you yourself implemented in the past was based on the following observations:

  • the longer the breakage is present, the harder it is to revert;
  • a failing buildbot is making us blind to further damage made after the fact, i.e. other PRs that would otherwise break that same buildbot, will not receive an alert that they did so. This is particularly problematic for refleak buildbots.

@pablogsal
Copy link
Member Author

Said differently: applying a revert in 24 hours is too short for most people availability. If you want a strict policy, maybe we should give a week to contributors to fix the issue. Usually, a red buildbot is not a major blocker issue and it can stay red a few days.

Nobody will merge the revert automatically, so there is no risk of things happening without human intervention.

The PRs are filed and acting quickly is important to avoid piling up issues in the builedbot fleet that merge reverts more difficult and challenging for the people involved in the revert. We have being doing this by hand years. This just adds some help and makes the revert coming from a bot, which a lot of people in several core dev sprints and language summit talk thought it was a good idea.

@encukou
Copy link
Member

encukou commented Oct 11, 2024

I won't block the PR, and I'll help deploying it if you make it my job -- just say so :)

Since I think that this will need a human in the loop way more often than not, I'm instead investing time in a dashboard that shows the “Breaking build” and its associated change(s). Here's a very much work-in-progress static preview (as in, it won't automatically update).
To not derail the discussion here, please direct any comments on that to its repo.

@pablogsal
Copy link
Member Author

If the policy is applied blindly, contributors who worked weeks on a PR will see their PR reverted in 24h without having an opportunity to dig into the issue, whereas the fix can take 5 minutes. It can be disappointing and send the wrong signal to contributors: "your work is not worth it, it has been rejected".

The signal is: your work was good because we merged it but unfortunately broke the CI and that has consequences to everyone in the project so we are opening this PR to potentially revert it to avoid further problems for everyone.

If you are afraid about this we can tweak the revert message all that we want to make it clear why this is happening, the policy and the consequences but the reality of the situation is that not reverting has duplicated or triplicated the amount of work to find and fix CI problems for years and we have evidence of that (think of all the problems with ref leaks and crashes in sub interpreters in 3.11)

@pablogsal
Copy link
Member Author

I'm also afraid of seeing tons of "revert" PRs being created automatically and someone will have to go through the list to decide which ones should be applied or not. Reviewing a revert PR requires to fully understand the issue which can mean: fixing the regression :-)

This won't happen: the chances of all the conditions being true at the same time (so we are sure that the PR is not a false positive) its actually quite rare. I did a simulation and across this year it would have filed 4 revert PRs

@encukou
Copy link
Member

encukou commented Oct 22, 2024

It's hard to say this when you have a working implementation ready, but: For 4 actions a year, I think that this is too much code & infrastructure to maintain.

We do need to automate reverts, but the process needs a human in the loop in all but the simplest cases. I filed a plan here: python/core-workflow#559

@pablogsal
Copy link
Member Author

pablogsal commented Oct 22, 2024

Ok, I guess I will close the PR then. I do apologise if this sounds a bit negative but I don't have more energy to keep defending my views on this, specially since unfortunately I am not feeling that this is being a collaboration.

@pablogsal pablogsal closed this Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants