-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about fault tolerance threshold (f) and zero3 #27
Comments
Hi @lhy101 ! Thank you for your interest in Oobleck. Re: fault tolerance threshold, please refer to: Oobleck/oobleck/engine/plugin.py Line 47 in 9d4e3b1
Re: Zero3, first of all, Zero3 is no longer used after refactoring and traditional 3D parallelism (DP+TP+PP) is used, where DP provides redundancy. Second, when Zero3 was used, Zero3 was not an alternative of DP, but of TP. So Zero3 + PP + DP was used and outermost DP provided redundancy. |
Thank you for your explanation, I understand now! I also have a practical question and hope you can take some time to help me with it. My experimental setup consists of 4 machines, each equipped with 8 A800 GPUs (80GB), and the model size is 32B. I am using a configuration with
With this setup, the training runs successfully, and the generated pipeline templates are as follows:
Based on my understanding of the log, all supported configurations are as follows:
Next, I shut down
It seems like the system is trying to find a pipeline with only 1 stage, but such a pipeline is not generated (perhaps due to high memory usage by the model?).
This worked fine. However, transitioning from the first scenario to the second scenario via reconfiguration seems to cause problems. |
In the paper and our early version of code included node borrow and pipeline merge; in this case, a pipeline with 1 node should be merged with another to form a 3-node pipeline. During refactoring the feature was removed due to incompatibility with the new framework structure, and the related issue #23 is still open. Sorry for the inconvinience, but for now the feature is not provided. It still should work if the new pipeline configuration is in the initial set of pipeline templates, as in yout second experiment. |
Thanks for clearing that up. I really appreciate your help! |
Hi @insujang,
Thank you for open-sourcing Oobleck—it’s an impressive piece of work!
I noticed in the paper that there is a parameter f that controls the fault tolerance threshold. However, I couldn’t find it in the codebase. Is there a way to configure or control this parameter? Additionally, is there a default value set for f?
Another question I have is that I noticed Zero3 is being used. In this case, each GPU should hold a unique model slice for optimizer states (as is the case with traditional Zero3). If one node fails, the corresponding Zero3 slice would also be lost. How can this be recovered? If my understanding is incorrect, please feel free to point it out.
Looking forward to your response. Thanks again for your contributions!
The text was updated successfully, but these errors were encountered: