Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Why do we need both " train_valid_test_datasets_provider.is_distributed = True" and batched data broadcasting ? #1196

Open
rayleizhu opened this issue Oct 4, 2024 · 0 comments

Comments

@rayleizhu
Copy link

I noticed that when train_valid_test_datasets_provider.is_distributed = True data loader is created in all processes, ignoring their tensor parallel rank.

train_valid_test_datasets_provider.is_distributed = True

if is_distributed or mpu.get_tensor_model_parallel_rank() == 0:

However, in get_batch(), the batched data is still broadcasted:

data_i = tensor_parallel.broadcast_data(["tokens", "position_ids", "labels"], data, torch.int64)

I got confused why do we need both of them? My understanding is that we need either distributed access or broadcasting from tp rank 0, not both of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant