Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time of tokenize data too long #23

Open
obananas opened this issue Feb 19, 2025 · 2 comments
Open

Time of tokenize data too long #23

obananas opened this issue Feb 19, 2025 · 2 comments

Comments

@obananas
Copy link

Hi, When I use dolma to tokenize OLMoE-mix-0924 data, it takes a long time, probably more than a week. Is this normal?

@obananas obananas changed the title time of tokenize data too long Time of tokenize data too long Feb 19, 2025
@Muennighoff
Copy link
Collaborator

I think you can parallelize it across more processes? cc @soldni for more insights

@obananas
Copy link
Author

We have set --processes to 32, but the DCLM subset still needs longer. I don't know if there is another method to accelerate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants