Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebase your PRs: Unstable CUDA signal in CI caused by cudnn 9 update #128221

Closed
atalman opened this issue Jun 7, 2024 · 1 comment
Closed

Rebase your PRs: Unstable CUDA signal in CI caused by cudnn 9 update #128221

atalman opened this issue Jun 7, 2024 · 1 comment
Labels
ci: sev critical failure affecting PyTorch CI

Comments

@atalman
Copy link
Contributor

atalman commented Jun 7, 2024

NOTE: Remember to label this issue with "ci: sev"
Related to: #128180

Current Status

Mitigated. Remove cudnn failures on your PR by rebasing past https://hud2.pytorch.org/pytorch/pytorch/commit/54fe2d0e89e1d7c64c1fb2ab120e966a750aff4d

Error looks like

Failures on CI related to cudnn

Incident timeline (all times pacific)

Update Try 1 (reverted):
Jun 4, 2024, 8:55 AM PST - Builder PR Merged
Jun 4, 2024, 9:33 AM PST - Pytrrch/Pytorch PR Merged
Jun 5, 2024, 1:59 AM PST - Pytorch/Pytorch PR Reverted
Jun 5, 2024, 4:24 AM PST - Builder PR Reverted
Update Try 2 (landed):
Jun 6, 2024, 11:11 PM PST - Builder PR Landed
Jun 6, 2024, 11:45 PM PST - Pytrrch/Pytorch PR Merged
Jun 6, 2024, 2:43 PM PST - Followup fix for qlinear failure landed

User impact

How does this affect users of PyTorch CI?

Root cause

Update to cudnn 9

Mitigation

Mitigated, rebase past: https://hud2.pytorch.org/pytorch/pytorch/commit/54fe2d0e89e1d7c64c1fb2ab120e966a750aff4d

Prevention/followups

To mitigate in future we need to adress this issue: pytorch/builder#1849

@atalman atalman added the ci: sev critical failure affecting PyTorch CI label Jun 7, 2024
@ZainRizvi ZainRizvi changed the title Unstable CUDA signal in CI caused by cudnn 9 update Rebase your PRs: Unstable CUDA signal in CI caused by cudnn 9 update Jun 7, 2024
@atalman
Copy link
Contributor Author

atalman commented Jun 10, 2024

closing this now - has been 3+ days since merge of hud2.pytorch.org/pytorch/pytorch/commit/54fe2d0e89e1d7c64c1fb2ab120e966a750aff4d

@atalman atalman closed this as completed Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci: sev critical failure affecting PyTorch CI
Projects
None yet
Development

No branches or pull requests

1 participant