-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shuffle Tokenized Data #290
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it a high-level review and I think we should discuss if we really need this multiprocessing setup.
Once decided, I will also look into reading and writing of the pbin files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :) Nice work!
Left minor comments but nothing major.
Almost mergeable!
Co-authored-by: Max Lübbering <[email protected]>
Co-authored-by: Max Lübbering <[email protected]>
Co-authored-by: Max Lübbering <[email protected]>
What does this PR do?
This PR introduces functionality to shuffle both the data and index segments in a packed file format. The primary goal is to eliminate the reliance on random reads during later stages when the data is accessed via mmap.
By shuffling the index and aligning it with a shuffled byte stream of the data, this implementation ensures that data can be read sequentially.
General Changes
The
shuffle_tokenized_data
function loads the input data and index into memory, shuffles the index entries, and processes them in batches to recduce the memory footprint.Breaking Changes
None.
Checklist before submitting final PR
python tests/tests.py
) - Please see Tokenization Test Fails #293CHANGELOG_DEV.md
)