Question about fault tolerance threshold (f) and zero3 #27

lhy101 · 2024-12-24T10:12:25Z

Thank you for open-sourcing Oobleck—it’s an impressive piece of work!

I noticed in the paper that there is a parameter f that controls the fault tolerance threshold. However, I couldn’t find it in the codebase. Is there a way to configure or control this parameter? Additionally, is there a default value set for f?

Another question I have is that I noticed Zero3 is being used. In this case, each GPU should hold a unique model slice for optimizer states (as is the case with traditional Zero3). If one node fails, the corresponding Zero3 slice would also be lost. How can this be recovered? If my understanding is incorrect, please feel free to point it out.

Looking forward to your response. Thanks again for your contributions!

insujang · 2024-12-24T12:43:10Z

Hi @lhy101 ! Thank you for your interest in Oobleck.

Re: fault tolerance threshold, please refer to:

Oobleck/oobleck/engine/plugin.py

Line 47 in 9d4e3b1

fault_tolerance_threshold: int = 3,

Re: Zero3, first of all, Zero3 is no longer used after refactoring and traditional 3D parallelism (DP+TP+PP) is used, where DP provides redundancy. Second, when Zero3 was used, Zero3 was not an alternative of DP, but of TP. So Zero3 + PP + DP was used and outermost DP provided redundancy.

lhy101 · 2024-12-24T13:36:33Z

Thank you for your explanation, I understand now! I also have a practical question and hope you can take some time to help me with it. My experimental setup consists of 4 machines, each equipped with 8 A800 GPUs (80GB), and the model size is 32B. I am using a configuration with tp=4, so my hostfile looks like this:

30.207.99.20 slots=4 devices=0,1,2,3 port=22
30.207.99.20 slots=4 devices=4,5,6,7 port=22
30.207.99.21 slots=4 devices=0,1,2,3 port=22
30.207.99.21 slots=4 devices=4,5,6,7 port=22
30.207.99.22 slots=4 devices=0,1,2,3 port=22
30.207.99.22 slots=4 devices=4,5,6,7 port=22
30.207.99.23 slots=4 devices=0,1,2,3 port=22
30.207.99.23 slots=4 devices=4,5,6,7 port=22

With this setup, the training runs successfully, and the generated pipeline templates are as follows:

2024-12-24 11:27:33.385 | DEBUG | oobleck.engine.execution_engine:prepare:151 - Pipeline templates: {2: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages), 3: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages), 4: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages), 5: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 5 stages), 6: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 6 stages), 7: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 7 stages), 8: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 8 stages)}

Based on my understanding of the log, all supported configurations are as follows:

2024-12-24 11:27:33.385 | DEBUG | oobleck.engine.pipeline_instantiator:_enumerate_instantiation_options:94 - Enumerating all feasible sets of pipeline templates for 8 nodes. 2024-12-24 11:27:33.386 | DEBUG | oobleck.engine.pipeline_instantiator:_enumerate_instantiation_options:121 - Dynamic programming result: [defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 4}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 1, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages): 2}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 2, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages): 1}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages): 2}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages): 1, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 5 stages): 1}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 1, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 6 stages): 1}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 8 stages): 1})]

Next, I shut down 30.207.99.23 (i.e., the last two tp4 groups). In theory, if the default f is 3, reconfiguration should work. However, I encountered the following error:

File "/jizhicfs/hymiezhao/lhy/Oobleck/examples/run_gpt2.py", line 152, in main model, optimizer, dataloader = engine.reconfigure( File "/jizhicfs/hymiezhao/lhy/Oobleck/oobleck/engine/execution_engine.py", line 289, in reconfigure model, optimizer, dataloader, _ = self.plugin.reconfigure( File "/jizhicfs/hymiezhao/miniconda3/envs/Oobleck_new/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/jizhicfs/hymiezhao/lhy/Oobleck/oobleck/engine/plugin.py", line 228, in reconfigure new_pipelines, new_num_microbatches = self._instantiate_pipelines( File "/jizhicfs/hymiezhao/lhy/Oobleck/oobleck/engine/plugin.py", line 125, in _instantiate_pipelines pipelines = [ File "/jizhicfs/hymiezhao/lhy/Oobleck/oobleck/engine/plugin.py", line 126, in <listcomp> pipeline_templates[num_stages] KeyError: 1

It seems like the system is trying to find a pipeline with only 1 stage, but such a pipeline is not generated (perhaps due to high memory usage by the model?).
Additionally, I tried starting the training with only the rest of nodes from the beginning:

30.207.99.20 slots=4 devices=0,1,2,3 port=22
30.207.99.20 slots=4 devices=4,5,6,7 port=22
30.207.99.21 slots=4 devices=0,1,2,3 port=22
30.207.99.21 slots=4 devices=4,5,6,7 port=22
30.207.99.22 slots=4 devices=0,1,2,3 port=22
30.207.99.22 slots=4 devices=4,5,6,7 port=22

2024-12-24 12:41:39.444 | DEBUG | oobleck.engine.execution_engine:prepare:151 - Pipeline templates: {2: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages), 3: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages), 4: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages), 5: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 5 stages), 6: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 6 stages)} 2024-12-24 12:41:39.444 | DEBUG | oobleck.engine.pipeline_instantiator:_enumerate_instantiation_options:94 - Enumerating all feasible sets of pipeline templates for 6 nodes. 2024-12-24 12:41:39.444 | DEBUG | oobleck.engine.pipeline_instantiator:_enumerate_instantiation_options:121 - Dynamic programming result: [defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 3}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages): 2}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 1, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages): 1}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 6 stages): 1})]

This worked fine. However, transitioning from the first scenario to the second scenario via reconfiguration seems to cause problems.
I would greatly appreciate it if you could provide further clarification or guidance on this issue at your convenience. Thank you once again for your assistance!

insujang · 2024-12-26T00:11:43Z

In the paper and our early version of code included node borrow and pipeline merge; in this case, a pipeline with 1 node should be merged with another to form a 3-node pipeline. During refactoring the feature was removed due to incompatibility with the new framework structure, and the related issue #23 is still open. Sorry for the inconvinience, but for now the feature is not provided. It still should work if the new pipeline configuration is in the initial set of pipeline templates, as in yout second experiment.

lhy101 · 2024-12-26T04:18:25Z

Thanks for clearing that up. I really appreciate your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about fault tolerance threshold (f) and zero3 #27

Question about fault tolerance threshold (f) and zero3 #27

lhy101 commented Dec 24, 2024

insujang commented Dec 24, 2024

lhy101 commented Dec 24, 2024

insujang commented Dec 26, 2024

lhy101 commented Dec 26, 2024

Question about fault tolerance threshold (f) and zero3 #27

Question about fault tolerance threshold (f) and zero3 #27

Comments

lhy101 commented Dec 24, 2024

insujang commented Dec 24, 2024

lhy101 commented Dec 24, 2024

insujang commented Dec 26, 2024

lhy101 commented Dec 26, 2024