Skip to content

Update numerical verification for SPMD Linear checkpointing #9113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sdasgup3
Copy link
Collaborator

@sdasgup3 sdasgup3 commented May 7, 2025

The current PR tracks a issue where an internal TPU CI is failing on v5p hardware. A specific test failing with assertion failure at test_train_spmd_linear_model.py#L49 and test_train_spmd_linear_model.py#L51 with maximum absolute difference of 0.0042718649 and 0.0000191778 respectively.

The fix here is to update the corresponding atols.

@sdasgup3 sdasgup3 requested a review from bhavya01 May 7, 2025 23:53
@sdasgup3 sdasgup3 added the CI CI related change label May 7, 2025
@bhavya01
Copy link
Collaborator

bhavya01 commented May 8, 2025

I see that both the results are running on TPU, one with checkpointing and one without gradient checkpointing. The result should be exactly the same.

Generally, there is some tolerance when we are running on different hardwares but in this case, I expected them to be exactly the same. I think that we should take a closer look at this problem. We might find something is wrong with the way we checkpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI CI related change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants