Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: ensure deepspeed.initialize can only be called once. #6848

Closed
wants to merge 0 commits into from

Conversation

traincheck-team
Copy link

Partially Fixes: #6772 #6771 #6770 by forbidding repeated initialization.

  • Introduced a module-level variable _deepspeed_initialized to track initialization state.
  • Raised a RuntimeError if initialize is called more than once.
  • Ensured initialization logic is executed only on the first call.

This is a draft issue as we have not included the unit test for this PR

@tjruwase
We currently added a global variable to enforce that deepspeed.initialize can only be called once in one single interpreter.
However, this seems to have interfered with existing tests that might intentionally call this API multiple times.
image

A solution might be to record id of the models and optimizers that have been used, to enforce the semantics only for individual models and optimizers, but that might get trickier and complicated, thus we'd like to hear some feedback on how this should be implemented before proceeding.

@tjruwase
Copy link
Contributor

@traincheck-team, thanks for creating this PR so quickly. Unfortunately, it seems there is some miscommunication of the expectation. We want to maintain support for multiple deepspeed.initialize() calls as it is a widely used feature such as in DeepSpeed-Chat and Accelerate. Instead, what we want to prevent is repeated deepspeed.initialize() on a specific model, optimizer, lr_scheduler, etc. instance. And so, the deepspeed-specific attribute indicating prior deepspeed.intialize() should be attached to those objects.

What do you think?

@traincheck-team
Copy link
Author

That makes sense and clear. Will change the code soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] [Fix-Suggested] ZeRO Stage 3 Overwrites Module ID Attribute Causing Incorrect Expert Placement on GPUs
2 participants