Skip to content

Commit

Permalink
Mitigate Dr.CI stale comments on PyTorch PRs (#5963)
Browse files Browse the repository at this point in the history
This change fixes a couple of issues with the workflow that refreshes
Dr.CI results for all open PRs. The key take away is that this API call
scale on the number of open pull requests on a repo. And on PyTorch, it
now takes longer than 120 seconds to finish. When that limit is reached,
the Vercel function (AWS lambda) will terminate the execution and all
PRs that are still in queue will be dropped. Their Dr.CI comments will
surely become stale.

Here is an example of the failure
https://github.com/pytorch/test-infra/actions/runs/11943802339/job/33293533522.
The error is FUNCTION_INVOCATION_TIMEOUT
(https://github.com/pytorch/test-infra/actions/runs/11964503897/job/33356932041#step:3:136),
and it stops at 2 minutes sharp. It's defined at
https://vercel.com/fbopensource/torchci/settings/functions.

* Follow
https://vercel.com/docs/functions/configuring-functions/duration to
increase the max duration to 900 seconds, the max value for enterprise
account defined at
https://vercel.com/docs/functions/runtimes#max-duration.
* Also fix a bug in `isTime0` where the value is now NaN instead of 0,
maybe this is related to our recent next.js upgrade
#5756
* Re-factor the workflow to get rid of lots of duplicated code
* Also surface the failure better via curl as the current command
returns successfully and masks the failure, i.e.
https://github.com/pytorch/test-infra/actions/runs/11964503897/job/33356932041#step:3:136

A final note, during my debug, I see this new failure shows up flakily
from time to time. I'll take a look at it in another PR as it doesn't
happen frequently (although it also causes the Dr.CI comment on the PR
in question to go stale temporarily)

```
Failed to update PR 139760 Error: Client network socket disconnected before secure TLS connection was established
    at TLSSocket.onConnectEnd (node:_tls_wrap:1732:19)
    at TLSSocket.emit (node:events:525:35)
    at endReadableNT (node:internal/streams/readable:1696:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:90:21) {
  code: 'ECONNRESET',
  path: null,
  host: 'hyt81izu0c.us-east-1.aws.clickhouse.cloud',
  port: 8443,
  localAddress: undefined
}
```

### Testing

```
time curl --request POST \
  --url 'https://torchci-git-address-drci-refresh-issue-fbopensource.vercel.app/api/drci/drci' \
  --header 'Authorization: REDACT' \
  --data 'repo=pytorch' \
  --silent --output /dev/null --show-error --fail
```

return 200 OK now even when the runtime is 3+ minutes (3:12.56 total),
it was 504 before
  • Loading branch information
huydhn authored Nov 22, 2024
1 parent f12c0d4 commit f60bd2e
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 75 deletions.
95 changes: 21 additions & 74 deletions .github/workflows/update-drci-comments.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,83 +10,30 @@ on:
jobs:
update-drci-comments:
runs-on: ubuntu-22.04
strategy:
fail-fast: false
matrix:
repo: [
ao,
audio,
data,
executorch,
pytorch,
rl,
text,
torchchat,
torchtune,
tutorials,
vision,
]
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Retrieve rockset query results and update Dr. CI comments for the PyTorch repo
- name: Retrieve rockset query results and update Dr. CI comments
run: |
curl --request POST \
--url 'https://www.torch-ci.com/api/drci/drci' \
--header 'Authorization: ${{ secrets.DRCI_BOT_KEY }}' \
--data 'repo=pytorch'
- name: Retrieve rockset query results and update Dr. CI comments for the Vision repo
run: |
curl --request POST \
--url 'https://www.torch-ci.com/api/drci/drci' \
--header 'Authorization: ${{ secrets.DRCI_BOT_KEY }}' \
--data 'repo=vision'
- name: Retrieve rockset query results and update Dr. CI comments for the Text repo
run: |
curl --request POST \
--url 'https://www.torch-ci.com/api/drci/drci' \
--header 'Authorization: ${{ secrets.DRCI_BOT_KEY }}' \
--data 'repo=text'
- name: Retrieve rockset query results and update Dr. CI comments for the Data repo
run: |
curl --request POST \
--url 'https://www.torch-ci.com/api/drci/drci' \
--header 'Authorization: ${{ secrets.DRCI_BOT_KEY }}' \
--data 'repo=data'
- name: Retrieve rockset query results and update Dr. CI comments for the Audio repo
run: |
curl --request POST \
--url 'https://www.torch-ci.com/api/drci/drci' \
--header 'Authorization: ${{ secrets.DRCI_BOT_KEY }}' \
--data 'repo=audio'
- name: Retrieve rockset query results and update Dr. CI comments for the Tutorials repo
run: |
curl --request POST \
--url 'https://www.torch-ci.com/api/drci/drci' \
--header 'Authorization: ${{ secrets.DRCI_BOT_KEY }}' \
--data 'repo=tutorials'
- name: Retrieve the Rockset query results and update Dr. CI comments for the ExecuTorch repo
run: |
curl --request POST \
--url 'https://www.torch-ci.com/api/drci/drci' \
--header 'Authorization: ${{ secrets.DRCI_BOT_KEY }}' \
--data 'repo=executorch'
- name: Retrieve the Rockset query results and update Dr. CI comments for the RL repo
run: |
curl --request POST \
--url 'https://www.torch-ci.com/api/drci/drci' \
--header 'Authorization: ${{ secrets.DRCI_BOT_KEY }}' \
--data 'repo=rl'
- name: Retrieve the Rockset query results and update Dr. CI comments for the TorchTune repo
run: |
curl --request POST \
--url 'https://www.torch-ci.com/api/drci/drci' \
--header 'Authorization: ${{ secrets.DRCI_BOT_KEY }}' \
--data 'repo=torchtune'
- name: Retrieve the Rockset query results and update Dr. CI comments for the torchao repo
run: |
curl --request POST \
--url 'https://www.torch-ci.com/api/drci/drci' \
--header 'Authorization: ${{ secrets.DRCI_BOT_KEY }}' \
--data 'repo=ao'
- name: Retrieve the Rockset query results and update Dr. CI comments for the torchchat repo
run: |
curl --request POST \
--url 'https://www.torch-ci.com/api/drci/drci' \
--header 'Authorization: ${{ secrets.DRCI_BOT_KEY }}' \
--data 'repo=torchchat'
--url 'https://www.torch-ci.com/api/drci/drci' \
--header 'Authorization: ${{ secrets.DRCI_BOT_KEY }}' \
--data 'repo=${{ matrix.repo }}' \
--silent --output /dev/null --show-error --fail
4 changes: 3 additions & 1 deletion torchci/lib/bot/utils.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@ import { Context, Probot } from "probot";
import urllib from "urllib";

export function isTime0(time: string): boolean {
return dayjs.utc(time).valueOf() == 0;
const v = dayjs.utc(time).valueOf();
// NB: This returns NaN when the string is empty
return isNaN(v) || v === 0;
}

export const TIME_0 = "1970-01-01 00:00:00.000000000";
Expand Down
7 changes: 7 additions & 0 deletions torchci/pages/api/drci/drci.ts
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,13 @@ export interface UpdateCommentBody {
repo: string;
}

// Attempt to set the maxDuration of this serveless function on Vercel https://vercel.com/docs/functions/configuring-functions/duration,
// also according to https://vercel.com/docs/functions/runtimes#max-duration, the max duration
// for an enterprise account is 900
export const config = {
maxDuration: 900,
};

export default async function handler(
req: NextApiRequest,
res: NextApiResponse<{
Expand Down

0 comments on commit f60bd2e

Please sign in to comment.