Mitigate Dr.CI stale comments on PyTorch PRs #5963
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change fixes a couple of issues with the workflow that refreshes Dr.CI results for all open PRs. The key take away is that this API call scale on the number of open pull requests on a repo. And on PyTorch, it now takes longer than 120 seconds to finish. When that limit is reached, the Vercel function (AWS lambda) will terminate the execution and all PRs that are still in queue will be dropped. Their Dr.CI comments will surely become stale.
Here is an example of the failure https://github.com/pytorch/test-infra/actions/runs/11943802339/job/33293533522. The error is FUNCTION_INVOCATION_TIMEOUT (https://github.com/pytorch/test-infra/actions/runs/11964503897/job/33356932041#step:3:136), and it stops at 2 minutes sharp. It's defined at https://vercel.com/fbopensource/torchci/settings/functions.
isTime0
where the value is now NaN instead of 0, maybe this is related to our recent next.js upgrade Upgrade next to 14.2.16 #5756A final note, during my debug, I see this new failure shows up flakily from time to time. I'll take a look at it in another PR as it doesn't happen frequently (although it also causes the Dr.CI comment on the PR in question to go stale temporarily)
Testing
return 200 OK now even when the runtime is 3+ minutes (3:12.56 total), it was 504 before