Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Mitigate Dr.CI stale comments on PyTorch PRs (#5963)
This change fixes a couple of issues with the workflow that refreshes Dr.CI results for all open PRs. The key take away is that this API call scale on the number of open pull requests on a repo. And on PyTorch, it now takes longer than 120 seconds to finish. When that limit is reached, the Vercel function (AWS lambda) will terminate the execution and all PRs that are still in queue will be dropped. Their Dr.CI comments will surely become stale. Here is an example of the failure https://github.com/pytorch/test-infra/actions/runs/11943802339/job/33293533522. The error is FUNCTION_INVOCATION_TIMEOUT (https://github.com/pytorch/test-infra/actions/runs/11964503897/job/33356932041#step:3:136), and it stops at 2 minutes sharp. It's defined at https://vercel.com/fbopensource/torchci/settings/functions. * Follow https://vercel.com/docs/functions/configuring-functions/duration to increase the max duration to 900 seconds, the max value for enterprise account defined at https://vercel.com/docs/functions/runtimes#max-duration. * Also fix a bug in `isTime0` where the value is now NaN instead of 0, maybe this is related to our recent next.js upgrade #5756 * Re-factor the workflow to get rid of lots of duplicated code * Also surface the failure better via curl as the current command returns successfully and masks the failure, i.e. https://github.com/pytorch/test-infra/actions/runs/11964503897/job/33356932041#step:3:136 A final note, during my debug, I see this new failure shows up flakily from time to time. I'll take a look at it in another PR as it doesn't happen frequently (although it also causes the Dr.CI comment on the PR in question to go stale temporarily) ``` Failed to update PR 139760 Error: Client network socket disconnected before secure TLS connection was established at TLSSocket.onConnectEnd (node:_tls_wrap:1732:19) at TLSSocket.emit (node:events:525:35) at endReadableNT (node:internal/streams/readable:1696:12) at process.processTicksAndRejections (node:internal/process/task_queues:90:21) { code: 'ECONNRESET', path: null, host: 'hyt81izu0c.us-east-1.aws.clickhouse.cloud', port: 8443, localAddress: undefined } ``` ### Testing ``` time curl --request POST \ --url 'https://torchci-git-address-drci-refresh-issue-fbopensource.vercel.app/api/drci/drci' \ --header 'Authorization: REDACT' \ --data 'repo=pytorch' \ --silent --output /dev/null --show-error --fail ``` return 200 OK now even when the runtime is 3+ minutes (3:12.56 total), it was 504 before
- Loading branch information