Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate Dr.CI stale comments on PyTorch PRs #5963

Merged
merged 6 commits into from
Nov 22, 2024
Merged

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Nov 22, 2024

This change fixes a couple of issues with the workflow that refreshes Dr.CI results for all open PRs. The key take away is that this API call scale on the number of open pull requests on a repo. And on PyTorch, it now takes longer than 120 seconds to finish. When that limit is reached, the Vercel function (AWS lambda) will terminate the execution and all PRs that are still in queue will be dropped. Their Dr.CI comments will surely become stale.

Here is an example of the failure https://github.com/pytorch/test-infra/actions/runs/11943802339/job/33293533522. The error is FUNCTION_INVOCATION_TIMEOUT (https://github.com/pytorch/test-infra/actions/runs/11964503897/job/33356932041#step:3:136), and it stops at 2 minutes sharp. It's defined at https://vercel.com/fbopensource/torchci/settings/functions.

A final note, during my debug, I see this new failure shows up flakily from time to time. I'll take a look at it in another PR as it doesn't happen frequently (although it also causes the Dr.CI comment on the PR in question to go stale temporarily)

Failed to update PR 139760 Error: Client network socket disconnected before secure TLS connection was established
    at TLSSocket.onConnectEnd (node:_tls_wrap:1732:19)
    at TLSSocket.emit (node:events:525:35)
    at endReadableNT (node:internal/streams/readable:1696:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:90:21) {
  code: 'ECONNRESET',
  path: null,
  host: 'hyt81izu0c.us-east-1.aws.clickhouse.cloud',
  port: 8443,
  localAddress: undefined
}

Testing

time curl --request POST \
  --url 'https://torchci-git-address-drci-refresh-issue-fbopensource.vercel.app/api/drci/drci' \
  --header 'Authorization: REDACT' \
  --data 'repo=pytorch' \
  --silent --output /dev/null --show-error --fail

return 200 OK now even when the runtime is 3+ minutes (3:12.56 total), it was 504 before

@huydhn huydhn requested review from malfet, clee2000, yangw-dev and a team November 22, 2024 01:52
Copy link

vercel bot commented Nov 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
torchci ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 22, 2024 2:40am

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 22, 2024
@huydhn huydhn changed the title Fix Dr.CI stale comments on PyTorch PRs Mitigate Dr.CI stale comments on PyTorch PRs Nov 22, 2024
@huydhn
Copy link
Contributor Author

huydhn commented Nov 22, 2024

cc @clee2000 I think we should try to profile drci call to see if the overall runtime can be improved here, or if the increase only due to the higher number of open PRs on PyTorch. Running the workflow more often in #5956 help a bit when the list of pull requests is shuffled , but there is no guarantee.

@huydhn huydhn merged commit f60bd2e into main Nov 22, 2024
7 checks passed
@huydhn huydhn deleted the address-drci-refresh-issue branch November 22, 2024 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants