Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shuffle Tokenized Data #290

Merged
merged 13 commits into from
Jan 21, 2025
Merged

Shuffle Tokenized Data #290

merged 13 commits into from
Jan 21, 2025

Conversation

mali-git
Copy link
Member

@mali-git mali-git commented Jan 14, 2025

What does this PR do?

This PR introduces functionality to shuffle both the data and index segments in a packed file format. The primary goal is to eliminate the reliance on random reads during later stages when the data is accessed via mmap.

By shuffling the index and aligning it with a shuffled byte stream of the data, this implementation ensures that data can be read sequentially.

General Changes

The shuffle_tokenized_data function loads the input data and index into memory, shuffles the index entries, and processes them in batches to recduce the memory footprint.

Breaking Changes

None.

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py) - Please see Tokenization Test Fails #293
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@mali-git mali-git marked this pull request as ready for review January 15, 2025 22:04
@mali-git mali-git requested a review from le1nux January 15, 2025 22:04
Copy link
Member

@le1nux le1nux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a high-level review and I think we should discuss if we really need this multiprocessing setup.
Once decided, I will also look into reading and writing of the pbin files.

src/modalities/dataloader/shuffle_tokenized_data.py Outdated Show resolved Hide resolved
src/modalities/dataloader/shuffle_tokenized_data.py Outdated Show resolved Hide resolved
Copy link
Member

@fromm-m fromm-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@le1nux le1nux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :) Nice work!
Left minor comments but nothing major.
Almost mergeable!

src/modalities/__main__.py Outdated Show resolved Hide resolved
src/modalities/dataloader/shuffle_tokenized_data.py Outdated Show resolved Hide resolved
src/modalities/dataloader/shuffle_tokenized_data.py Outdated Show resolved Hide resolved
src/modalities/dataloader/shuffle_tokenized_data.py Outdated Show resolved Hide resolved
tests/dataloader/test_shuffle_tokenized_data.py Outdated Show resolved Hide resolved
tests/dataloader/test_shuffle_tokenized_data.py Outdated Show resolved Hide resolved
@mali-git mali-git merged commit 43b6cc2 into main Jan 21, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants