You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that when train_valid_test_datasets_provider.is_distributed = True data loader is created in all processes, ignoring their tensor parallel rank.
I got confused why do we need both of them? My understanding is that we need either distributed access or broadcasting from tp rank 0, not both of them.
The text was updated successfully, but these errors were encountered:
I noticed that when
train_valid_test_datasets_provider.is_distributed = True
data loader is created in all processes, ignoring their tensor parallel rank.Megatron-LM/pretrain_vlm.py
Line 333 in c02b335
Megatron-LM/megatron/training/training.py
Line 1685 in c02b335
However, in get_batch(), the batched data is still broadcasted:
Megatron-LM/pretrain_vlm.py
Line 242 in c02b335
I got confused why do we need both of them? My understanding is that we need either distributed access or broadcasting from tp rank 0, not both of them.
The text was updated successfully, but these errors were encountered: