Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about fault tolerance threshold (f) and zero3 #27

Open
lhy101 opened this issue Dec 24, 2024 · 4 comments
Open

Question about fault tolerance threshold (f) and zero3 #27

lhy101 opened this issue Dec 24, 2024 · 4 comments

Comments

@lhy101
Copy link

lhy101 commented Dec 24, 2024

Hi @insujang,

Thank you for open-sourcing Oobleck—it’s an impressive piece of work!

I noticed in the paper that there is a parameter f that controls the fault tolerance threshold. However, I couldn’t find it in the codebase. Is there a way to configure or control this parameter? Additionally, is there a default value set for f?

Another question I have is that I noticed Zero3 is being used. In this case, each GPU should hold a unique model slice for optimizer states (as is the case with traditional Zero3). If one node fails, the corresponding Zero3 slice would also be lost. How can this be recovered? If my understanding is incorrect, please feel free to point it out.

Looking forward to your response. Thanks again for your contributions!

@insujang
Copy link
Member

Hi @lhy101 ! Thank you for your interest in Oobleck.

Re: fault tolerance threshold, please refer to:

fault_tolerance_threshold: int = 3,

Re: Zero3, first of all, Zero3 is no longer used after refactoring and traditional 3D parallelism (DP+TP+PP) is used, where DP provides redundancy. Second, when Zero3 was used, Zero3 was not an alternative of DP, but of TP. So Zero3 + PP + DP was used and outermost DP provided redundancy.

@lhy101
Copy link
Author

lhy101 commented Dec 24, 2024

Thank you for your explanation, I understand now! I also have a practical question and hope you can take some time to help me with it. My experimental setup consists of 4 machines, each equipped with 8 A800 GPUs (80GB), and the model size is 32B. I am using a configuration with tp=4, so my hostfile looks like this:

30.207.99.20 slots=4 devices=0,1,2,3 port=22
30.207.99.20 slots=4 devices=4,5,6,7 port=22
30.207.99.21 slots=4 devices=0,1,2,3 port=22
30.207.99.21 slots=4 devices=4,5,6,7 port=22
30.207.99.22 slots=4 devices=0,1,2,3 port=22
30.207.99.22 slots=4 devices=4,5,6,7 port=22
30.207.99.23 slots=4 devices=0,1,2,3 port=22
30.207.99.23 slots=4 devices=4,5,6,7 port=22

With this setup, the training runs successfully, and the generated pipeline templates are as follows:

2024-12-24 11:27:33.385 | DEBUG | oobleck.engine.execution_engine:prepare:151 - Pipeline templates: {2: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages), 3: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages), 4: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages), 5: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 5 stages), 6: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 6 stages), 7: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 7 stages), 8: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 8 stages)}

Based on my understanding of the log, all supported configurations are as follows:

2024-12-24 11:27:33.385 | DEBUG | oobleck.engine.pipeline_instantiator:_enumerate_instantiation_options:94 - Enumerating all feasible sets of pipeline templates for 8 nodes. 2024-12-24 11:27:33.386 | DEBUG | oobleck.engine.pipeline_instantiator:_enumerate_instantiation_options:121 - Dynamic programming result: [defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 4}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 1, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages): 2}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 2, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages): 1}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages): 2}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages): 1, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 5 stages): 1}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 1, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 6 stages): 1}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 8 stages): 1})]

Next, I shut down 30.207.99.23 (i.e., the last two tp4 groups). In theory, if the default f is 3, reconfiguration should work. However, I encountered the following error:

File "/jizhicfs/hymiezhao/lhy/Oobleck/examples/run_gpt2.py", line 152, in main model, optimizer, dataloader = engine.reconfigure( File "/jizhicfs/hymiezhao/lhy/Oobleck/oobleck/engine/execution_engine.py", line 289, in reconfigure model, optimizer, dataloader, _ = self.plugin.reconfigure( File "/jizhicfs/hymiezhao/miniconda3/envs/Oobleck_new/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/jizhicfs/hymiezhao/lhy/Oobleck/oobleck/engine/plugin.py", line 228, in reconfigure new_pipelines, new_num_microbatches = self._instantiate_pipelines( File "/jizhicfs/hymiezhao/lhy/Oobleck/oobleck/engine/plugin.py", line 125, in _instantiate_pipelines pipelines = [ File "/jizhicfs/hymiezhao/lhy/Oobleck/oobleck/engine/plugin.py", line 126, in <listcomp> pipeline_templates[num_stages] KeyError: 1

It seems like the system is trying to find a pipeline with only 1 stage, but such a pipeline is not generated (perhaps due to high memory usage by the model?).
Additionally, I tried starting the training with only the rest of nodes from the beginning:

30.207.99.20 slots=4 devices=0,1,2,3 port=22
30.207.99.20 slots=4 devices=4,5,6,7 port=22
30.207.99.21 slots=4 devices=0,1,2,3 port=22
30.207.99.21 slots=4 devices=4,5,6,7 port=22
30.207.99.22 slots=4 devices=0,1,2,3 port=22
30.207.99.22 slots=4 devices=4,5,6,7 port=22

2024-12-24 12:41:39.444 | DEBUG | oobleck.engine.execution_engine:prepare:151 - Pipeline templates: {2: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages), 3: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages), 4: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages), 5: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 5 stages), 6: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 6 stages)} 2024-12-24 12:41:39.444 | DEBUG | oobleck.engine.pipeline_instantiator:_enumerate_instantiation_options:94 - Enumerating all feasible sets of pipeline templates for 6 nodes. 2024-12-24 12:41:39.444 | DEBUG | oobleck.engine.pipeline_instantiator:_enumerate_instantiation_options:121 - Dynamic programming result: [defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 3}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages): 2}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 1, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages): 1}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 6 stages): 1})]

This worked fine. However, transitioning from the first scenario to the second scenario via reconfiguration seems to cause problems.
I would greatly appreciate it if you could provide further clarification or guidance on this issue at your convenience. Thank you once again for your assistance!

@insujang
Copy link
Member

In the paper and our early version of code included node borrow and pipeline merge; in this case, a pipeline with 1 node should be merged with another to form a 3-node pipeline. During refactoring the feature was removed due to incompatibility with the new framework structure, and the related issue #23 is still open. Sorry for the inconvinience, but for now the feature is not provided. It still should work if the new pipeline configuration is in the initial set of pipeline templates, as in yout second experiment.

@lhy101
Copy link
Author

lhy101 commented Dec 26, 2024

Thanks for clearing that up. I really appreciate your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants