[QUESTION] Why do we need both " train_valid_test_datasets_provider.is_distributed = True" and batched data broadcasting ? #1196

rayleizhu · 2024-10-04T02:14:45Z

I noticed that when train_valid_test_datasets_provider.is_distributed = True data loader is created in all processes, ignoring their tensor parallel rank.

Megatron-LM/pretrain_vlm.py

Line 333 in c02b335

train_valid_test_datasets_provider.is_distributed = True

Megatron-LM/megatron/training/training.py

Line 1685 in c02b335

if is_distributed or mpu.get_tensor_model_parallel_rank() == 0:

However, in get_batch(), the batched data is still broadcasted:

Megatron-LM/pretrain_vlm.py

Line 242 in c02b335

    
           data_i = tensor_parallel.broadcast_data(["tokens", "position_ids", "labels"], data, torch.int64)

I got confused why do we need both of them? My understanding is that we need either distributed access or broadcasting from tp rank 0, not both of them.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Why do we need both " train_valid_test_datasets_provider.is_distributed = True" and batched data broadcasting ? #1196

[QUESTION] Why do we need both " train_valid_test_datasets_provider.is_distributed = True" and batched data broadcasting ? #1196

rayleizhu commented Oct 4, 2024

[QUESTION] Why do we need both " train_valid_test_datasets_provider.is_distributed = True" and batched data broadcasting ? #1196

[QUESTION] Why do we need both " train_valid_test_datasets_provider.is_distributed = True" and batched data broadcasting ? #1196

Comments

rayleizhu commented Oct 4, 2024