Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Inference with xDit unsuccessful #129

Open
BestKuan opened this issue Dec 16, 2024 · 15 comments
Open

Parallel Inference with xDit unsuccessful #129

BestKuan opened this issue Dec 16, 2024 · 15 comments

Comments

@BestKuan
Copy link

BestKuan commented Dec 16, 2024

Hello, I have a problem. I can't successfully run Parallel Inference in an environment equipped with 8 L40S GPU cards (each card having 48GB of VRAM). The run fails with a memory insufficient error on rank 0. However, single-card operation runs successfully, although it takes significantly longer.

@feifeibear
Copy link
Contributor

We will check this issue ASAP.

@ximo2002
Copy link

(HunyuanVideo) root@dd22:~/project/HunyuanVideo# torchrun --nproc_per_node=8 sample_video.py --video-size 1280 720 --video-length 129 --infer-steps 50 --prompt "A cat walks on the grass, realistic style." --flow-reverse --seed 42 --ulysses-degree 8 --ring-degree 1 --save-path ./results
W1216 08:53:53.827000 140297779558208 torch/distributed/run.py:779]
W1216 08:53:53.827000 140297779558208 torch/distributed/run.py:779] *****************************************
W1216 08:53:53.827000 140297779558208 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1216 08:53:53.827000 140297779558208 torch/distributed/run.py:779] *****************************************
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.565 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=1 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.586 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
2024-12-16 08:53:56.586 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=0 local_rank=-1 distributed_init_method=env:// backend=nccl
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=4 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.606 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=5 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.610 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=7 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.636 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=6 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.738 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=2 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.850 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=3 local_rank=-1 distributed_init_method=env:// backend=nccl
2024-12-16 08:54:02.117 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.117 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.136 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.167 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.168 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.168 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.168 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.168 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
[rank3]: Traceback (most recent call last):
[rank3]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank3]: main()
[rank3]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank3]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank3]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank3]: model = load_model(
[rank3]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank3]: model = HYVideoDiffusionTransformer(
[rank3]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank3]: init(self, *args, **init_kwargs)
[rank3]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank3]: [
[rank3]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank3]: MMSingleStreamBlock(
[rank3]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank3]: self.linear2 = nn.Linear(
[rank3]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank3]: self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
[rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 3 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank1]: main()
[rank1]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank1]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank1]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank1]: model = load_model(
[rank1]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank1]: model = HYVideoDiffusionTransformer(
[rank1]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank1]: init(self, *args, **init_kwargs)
[rank1]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank1]: [
[rank1]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank1]: MMSingleStreamBlock(
[rank1]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank1]: self.linear2 = nn.Linear(
[rank1]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank1]: self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 1 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
[rank2]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank2]: main()
[rank2]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank2]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank2]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank2]: model = load_model(
[rank2]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank2]: model = HYVideoDiffusionTransformer(
[rank2]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank2]: init(self, *args, **init_kwargs)
[rank2]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank2]: [
[rank2]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank2]: MMSingleStreamBlock(
[rank2]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank2]: self.linear2 = nn.Linear(
[rank2]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank2]: self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 2 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank0]: main()
[rank0]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank0]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank0]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank0]: model = load_model(
[rank0]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank0]: model = HYVideoDiffusionTransformer(
[rank0]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank0]: init(self, *args, **init_kwargs)
[rank0]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank0]: [
[rank0]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank0]: MMSingleStreamBlock(
[rank0]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank0]: self.linear2 = nn.Linear(
[rank0]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank0]: self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank5]: Traceback (most recent call last):
[rank5]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank5]: main()
[rank5]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank5]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank5]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank5]: model = load_model(
[rank5]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank5]: model = HYVideoDiffusionTransformer(
[rank5]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank5]: init(self, *args, **init_kwargs)
[rank5]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank5]: [
[rank5]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank5]: MMSingleStreamBlock(
[rank5]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank5]: self.linear2 = nn.Linear(
[rank5]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank5]: self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
[rank5]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 5 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank6]: Traceback (most recent call last):
[rank6]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank6]: main()
[rank6]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank6]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank6]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank6]: model = load_model(
[rank6]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank6]: model = HYVideoDiffusionTransformer(
[rank6]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank6]: init(self, *args, **init_kwargs)
[rank6]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank6]: [
[rank6]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank6]: MMSingleStreamBlock(
[rank6]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank6]: self.linear2 = nn.Linear(
[rank6]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank6]: self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
[rank6]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 6 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank7]: Traceback (most recent call last):
[rank7]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank7]: main()
[rank7]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank7]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank7]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank7]: model = load_model(
[rank7]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank7]: model = HYVideoDiffusionTransformer(
[rank7]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank7]: init(self, *args, **init_kwargs)
[rank7]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank7]: [
[rank7]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank7]: MMSingleStreamBlock(
[rank7]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank7]: self.linear2 = nn.Linear(
[rank7]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank7]: self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
[rank7]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 7 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank4]: Traceback (most recent call last):
[rank4]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank4]: main()
[rank4]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank4]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank4]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank4]: model = load_model(
[rank4]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank4]: model = HYVideoDiffusionTransformer(
[rank4]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank4]: init(self, *args, **init_kwargs)
[rank4]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank4]: [
[rank4]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank4]: MMSingleStreamBlock(
[rank4]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank4]: self.linear2 = nn.Linear(
[rank4]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank4]: self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
[rank4]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 4 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W1216 08:54:04.627000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879711 closing signal SIGTERM
W1216 08:54:04.628000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879712 closing signal SIGTERM
W1216 08:54:04.628000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879713 closing signal SIGTERM
W1216 08:54:04.628000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879715 closing signal SIGTERM
W1216 08:54:04.628000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879716 closing signal SIGTERM
W1216 08:54:04.628000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879717 closing signal SIGTERM
W1216 08:54:04.629000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879718 closing signal SIGTERM
E1216 08:54:05.550000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 3 (pid: 2879714) of binary: /root/miniconda3/envs/HunyuanVideo/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/HunyuanVideo/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

sample_video.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-16_08:54:04
host : dd22
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 2879714)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(HunyuanVideo) root@dd22:~/project/HunyuanVideo#
8卡4090也遇到了这个多卡运行的错误,说显存不足

@feifeibear
Copy link
Contributor

@BestKuan @ximo2002 could provide the script you run parallel version. What is the resolution of the vidoe.

@ximo2002
Copy link

可以提供您运行的脚本并行版本。视频的分辨率是多少。

torchrun --nproc_per_node=8 sample_video.py
--video-size 1280 720
--video-length 129
--infer-steps 50
--prompt "A cat walks on the grass, realistic style."
--flow-reverse
--seed 42
--ulysses-degree 8
--ring-degree 1
--save-path ./results

是最新的版本哈

@jash101
Copy link

jash101 commented Dec 16, 2024

I'm facing same issue. I'm using g6.12xlarge instance on aws which has 4 L4 GPUs (each card is 24GB vram).

The command I run is

torchrun --nproc_per_node=4 sample_video.py \
    --video-size 1280 720 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --seed 42 \
    --ulysses-degree 4 \
    --ring-degree 1 \
    --save-path ./results

I also tried 2x2 and 1x4, but getting out of memory error.

image

@feifeibear
Copy link
Contributor

I'm facing same issue. I'm using g6.12xlarge instance on aws which has 4 L4 GPUs (each card is 24GB vram).

The command I run is

torchrun --nproc_per_node=4 sample_video.py \
    --video-size 1280 720 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --seed 42 \
    --ulysses-degree 4 \
    --ring-degree 1 \
    --save-path ./results

I also tried 2x2 and 1x4, but getting out of memory error.

image

I suppose you can not run it successfully with 1 GPU? Currently, the VRAM memory usage should be the same as a single GPU version.

@ximo2002
Copy link

我遇到了同样的问题。我在 aws 上使用 g6.12xlarge 实例,它有 4 个 L4 GPU(每个卡为 24GB vram)。
我运行的命令是

torchrun --nproc_per_node=4 sample_video.py \
    --video-size 1280 720 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --seed 42 \
    --ulysses-degree 4 \
    --ring-degree 1 \
    --save-path ./results

我还尝试了 2x2 和 1x4,但出现内存不足错误。
图像

我想你不能用 1 个 GPU 成功运行它?目前,VRAM 内存使用量应与单个 GPU 版本相同。
所以4090可以运行嘛,8卡的话

@HenryBao91
Copy link

the same error with 8 x L20, but I can run successfully with single L20

@BestKuan
Copy link
Author

BestKuan commented Dec 17, 2024

@feifeibear thanks for your reply!This is my command script

export TOKENIZERS_PARALLELISM=false

export NPROC_PER_NODE=4
export ULYSSES_DEGREE=2
export RING_DEGREE=2
export CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun --nproc_per_node=$NPROC_PER_NODE sample_video.py \
	--video-size  544 960 \
	--video-length 129 \
	--infer-steps 50 \
	--prompt "A baby walks on the grass, realistic style." \
	--seed 42 \
	--embedded-cfg-scale 6.0 \
	--flow-shift 7.0 \
	--flow-reverse \
	--ulysses-degree=$ULYSSES_DEGREE \
	--ring-degree=$RING_DEGREE \
	--save-path ./results

and here is my log file
log_4cards.txt
I can run successfully on a single card with the same video-size and video-length.

@jash101
Copy link

jash101 commented Dec 17, 2024

I'm facing same issue. I'm using g6.12xlarge instance on aws which has 4 L4 GPUs (each card is 24GB vram).
The command I run is

torchrun --nproc_per_node=4 sample_video.py \
    --video-size 1280 720 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --seed 42 \
    --ulysses-degree 4 \
    --ring-degree 1 \
    --save-path ./results

I also tried 2x2 and 1x4, but getting out of memory error.
image

I suppose you can not run it successfully with 1 GPU? Currently, the VRAM memory usage should be the same as a single GPU version.

@feifeibear thanks for your reply. I changed to g6e.12xlarge (4x 48GB), and while I'm able to run single GPU inference for 544 * 960, I'm unable to run parallel inference

@Jeff123z
Copy link

我是直接git clone https://github.com/tencent/HunyuanVideo, 读了一下代码。比较好奇的是parallel inference的实现。
sample_video.py: 核心代码
def main():
args = parse_args()
print(args)
models_root_path = Path(args.model_base)
models_root_path ="/home/sw4sever3/xxxx/hunyuan"
# Create save folder to save the samples
save_path = args.save_path if args.save_path_suffix=="" else f'{args.save_path}_{args.save_path_suffix}'
if not os.path.exists(args.save_path):
os.makedirs(save_path, exist_ok=True)

# Load models
hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)

##################这一行就是加载模型

# Get the updated args
args = hunyuan_video_sampler.args

# Start sampling
# TODO: batch inference check
outputs = hunyuan_video_sampler.predict()

.......

然后看HunyuanVideoSampler.from_pretrained()具体实现,

def from_pretrained(cls, pretrained_model_path, args, device=None, **kwargs):

    # ==================== Initialize Distributed Environment ================
    if args.ulysses_degree > 1 or args.ring_degree > 1:

        init_distributed_environment(rank=dist.get_rank(), world_size=dist.get_world_size())
        
        initialize_model_parallel(
            sequence_parallel_degree=dist.get_world_size(),
            ring_degree=args.ring_degree,
            ulysses_degree=args.ulysses_degree,
        )
        device = torch.device(f"cuda:{os.environ['LOCAL_RANK']}")
        
    else:
        if device is None:
            device = "cuda" if torch.cuda.is_available() else "cpu"

    parallel_args = {"ulysses_degree": args.ulysses_degree, "ring_degree": args.ring_degree}

    # ======================== Get the args path =============================

    # Disable gradient
    torch.set_grad_enabled(False)

    # =========================== Build main model ===========================
    logger.info("Building model...")
    
    #initialize_megatron_env()

    factor_kwargs = {"device": device, "dtype": PRECISION_TO_TYPE[args.precision]}
   
    in_channels = args.latent_channels
    out_channels = args.latent_channels

    model = load_model(
        args,
        in_channels=in_channels,
        out_channels=out_channels,
        factor_kwargs=factor_kwargs,
    ) **#################################**#load_model()本质上是HYVideoDiffusionTransformer的构造函数, 我是没有看到它是怎么根据你的启动script设置的--ulysses-degree       --ring-degree  来切分模型从而实现在模型加载的阶段就分布式。我感觉这一行还是把整个模型都加载到单个的GPU上, 这样就会有显存不足。****
   
    
    model = model.to(device)  ####this is original      #####directly move to single GPU, OOM error!  
    model = Inference.load_state_dict(args, model, pretrained_model_path)
    model.eval()
    
    # ============================= Build extra models ========================
    ........
    return model

@xibosun
Copy link
Contributor

xibosun commented Dec 19, 2024

I'm facing same issue. I'm using g6.12xlarge instance on aws which has 4 L4 GPUs (each card is 24GB vram).
The command I run is

torchrun --nproc_per_node=4 sample_video.py \
    --video-size 1280 720 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --seed 42 \
    --ulysses-degree 4 \
    --ring-degree 1 \
    --save-path ./results

I also tried 2x2 and 1x4, but getting out of memory error.
image

I suppose you can not run it successfully with 1 GPU? Currently, the VRAM memory usage should be the same as a single GPU version.

@feifeibear thanks for your reply. I changed to g6e.12xlarge (4x 48GB), and while I'm able to run single GPU inference for 544 * 960, I'm unable to run parallel inference

@jash101 Did you run single GPU inference with the --use-cpu-offload flag? as I'm not able to run single GPU inference when CPU offload is disabled.

@jash101
Copy link

jash101 commented Dec 19, 2024

I'm facing same issue. I'm using g6.12xlarge instance on aws which has 4 L4 GPUs (each card is 24GB vram).
The command I run is

torchrun --nproc_per_node=4 sample_video.py \
    --video-size 1280 720 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --seed 42 \
    --ulysses-degree 4 \
    --ring-degree 1 \
    --save-path ./results

I also tried 2x2 and 1x4, but getting out of memory error.
image

I suppose you can not run it successfully with 1 GPU? Currently, the VRAM memory usage should be the same as a single GPU version.

@feifeibear thanks for your reply. I changed to g6e.12xlarge (4x 48GB), and while I'm able to run single GPU inference for 544 * 960, I'm unable to run parallel inference

@jash101 Did you run single GPU inference with the --use-cpu-offload flag? as I'm not able to run single GPU inference when CPU offload is disabled.

@xibosun yes, I used command in README for single GPU inference:

python3 sample_video.py \
    --video-size 720 1280 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --use-cpu-offload \
    --save-path ./results

@xibosun
Copy link
Contributor

xibosun commented Dec 19, 2024

The OOM issue arises due to the absence of CPU offloading support in multi-GPU inference. So it's natural for multi-GPU inference to consume more GPU memory compared to single-GPU setups.

Nevertheless, we are actively exploring alternative strategies such as FSDP to mitigate memory demands during multi-GPU inference.

@jash101
Copy link

jash101 commented Dec 20, 2024

The OOM issue arises due to the absence of CPU offloading support in multi-GPU inference. So it's natural for multi-GPU inference to consume more GPU memory compared to single-GPU setups.

Nevertheless, we are actively exploring alternative strategies such as FSDP to mitigate memory demands during multi-GPU inference.

Thanks for pointing this out, yes, that makes sense. Tested without cpu offload on single GPU and it gives OOM error.
Tested with a smaller resolution and it runs on both single gpu and parallel, so no issue really with the repo.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants