Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write multiple items to output file at once, in distributed data analyzer. #5169

Conversation

bm-synth
Copy link
Contributor

@bm-synth bm-synth commented Feb 21, 2024

Minor improvements of #5129.

  • Writes all buffers at once to the output file, instead of iteratively (indexed_dataset.py, method add_items()).
  • Fixes the wrong initialisation of num_workers and worker_id that were being ignored when they were provided by the user.

@bm-synth bm-synth marked this pull request as ready for review February 21, 2024 12:31
@bm-synth bm-synth requested a review from conglongli as a code owner February 21, 2024 12:31
@bm-synth bm-synth changed the title Write multiple items to file at once in distributed data analyzer Write multiple items to output file at once, in DistributedDataAnalyzer. Feb 21, 2024
@bm-synth bm-synth changed the title Write multiple items to output file at once, in DistributedDataAnalyzer. Write multiple items to output file at once, in distributed data analyzer. Feb 21, 2024
@conglongli conglongli self-assigned this Feb 21, 2024
@conglongli conglongli enabled auto-merge February 22, 2024 11:31
@conglongli conglongli added this pull request to the merge queue Feb 22, 2024
Merged via the queue into microsoft:master with commit d5fa87f Feb 22, 2024
12 checks passed
ShellyNR pushed a commit to ShellyNR/DeepSpeed that referenced this pull request Mar 11, 2024
…yzer. (microsoft#5169)

Minor improvements of
[https://github.com/microsoft/DeepSpeed/pull/5129](https://github.com/microsoft/DeepSpeed/pull/5129).
- Writes all buffers at once to the output file, instead of iteratively
(`indexed_dataset.py`, method `add_items()`).
- Fixes the wrong initialisation of `num_workers` and `worker_id` that
were being ignored when they were provided by the user.

---------

Co-authored-by: Conglong Li <[email protected]>
rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024
…yzer. (microsoft#5169)

Minor improvements of
[https://github.com/microsoft/DeepSpeed/pull/5129](https://github.com/microsoft/DeepSpeed/pull/5129).
- Writes all buffers at once to the output file, instead of iteratively
(`indexed_dataset.py`, method `add_items()`).
- Fixes the wrong initialisation of `num_workers` and `worker_id` that
were being ignored when they were provided by the user.

---------

Co-authored-by: Conglong Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants