Fix (scaling/standalone): better switch from runtime stats to param #1099

Giuseppe5 · 2024-11-21T13:58:59Z

Reason for this PR

Currently, if we switch from training to eval before stats_collection_steps is done, we never update the value parameter to store the buffer value. This has a few side effects:

When applying learned round, we might keep the model in eval model but still accumulate gradients. If the value parameter is not being used, no gradients are accumulated
When exporting state_dict, value is not exported
When doing PTQ calibration, current setup is such that the buffer is never converted to its corresponding parameter value, causing some of the issues mentioned above.

Changes Made in this PR

At eval time, during the first iteration the buffer is always converted to param.
The side effect of this happens in the case the user would want to switch multiple times between training/evaluation mode very early on in the training process. Although it is common to switch between training/eval to check loss on the validation set, it is usually done after enough iteration that the buffer has already been converted to parameter anyway.
I'd admit that it could be marked as breaking change for this edge cases.

~~This has been removed in a more recent commit. I believe there are no more breaking changes at this point.~~

All fixed, no more breaking changes.
After calibration, we forcefully convert the buffer to parameters.

Testing Summary

Risk Highlight

This PR includes code from another work (please detail).
This PR contains API-breaking changes.
This PR depends on work in another PR (please provide links/details).
This PR introduces new dependencies (please detail).
There are coverage gaps not covered by tests.
Documentation updates required in subsequent PR.

Checklist

Code comments added to any hard-to-understand areas, if applicable.
Changes generate no new warnings.
Updated any relevant tests, if applicable.
No conflicts with destination dev branch.
I reviewed my own code changes.
Initial CI/CD passing.
1+ reviews given, and any review issues addressed and approved.
Post-review full CI/CD passing.

nickfraser

The side effect of this happens in the case the user would want to switch multiple times between training/evaluation mode very early on in the training process.

Could the accuracy difference in the LLM tests be caused by this? I think the tests run at some very small seqlen (2?)

Otherwise, LGTM!

nickfraser

Check comment about self.init_done, otherwise LGTM!

src/brevitas/core/scaling/standalone.py

Giuseppe5 requested review from nickfraser and removed request for nickfraser November 21, 2024 14:57

nickfraser approved these changes Dec 9, 2024

View reviewed changes

Giuseppe5 added 4 commits December 16, 2024 09:01

Fix (scaling/standalone): better switch from runtime stats to param

9d8e6d2

fix

c351580

Fix

6327d5c

Fix tests

Loading
Loading status checks…

c6fb3d6

Giuseppe5 force-pushed the fix_calib branch from 6ffb64f to c6fb3d6 Compare December 16, 2024 09:01

Fix

Loading
Loading status checks…

378bd8c

Giuseppe5 requested a review from nickfraser December 16, 2024 17:04

nickfraser approved these changes Dec 17, 2024

View reviewed changes

src/brevitas/core/scaling/standalone.py Show resolved Hide resolved

Giuseppe5 added 2 commits December 17, 2024 15:09

Update standalone.py

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode

Loading
Loading status checks…

6175fa3

Update test_torch_qcdq.py

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode

Loading
Loading status checks…

9146d5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix (scaling/standalone): better switch from runtime stats to param #1099

Fix (scaling/standalone): better switch from runtime stats to param #1099

Giuseppe5 commented Nov 21, 2024 •

edited

Loading

nickfraser left a comment •

edited

Loading

nickfraser left a comment

Fix (scaling/standalone): better switch from runtime stats to param #1099

Are you sure you want to change the base?

Fix (scaling/standalone): better switch from runtime stats to param #1099

Conversation

Giuseppe5 commented Nov 21, 2024 • edited Loading

Reason for this PR

Changes Made in this PR

Testing Summary

Risk Highlight

Checklist

nickfraser left a comment • edited Loading

Choose a reason for hiding this comment

nickfraser left a comment

Choose a reason for hiding this comment

Giuseppe5 commented Nov 21, 2024 •

edited

Loading

nickfraser left a comment •

edited

Loading