Time of tokenize data too long #23

obananas · 2025-02-19T15:58:31Z

Hi, When I use dolma to tokenize OLMoE-mix-0924 data, it takes a long time, probably more than a week. Is this normal?

Muennighoff · 2025-02-19T17:04:44Z

I think you can parallelize it across more processes? cc @soldni for more insights

obananas · 2025-02-20T02:55:33Z

We have set --processes to 32, but the DCLM subset still needs longer. I don't know if there is another method to accelerate.

obananas changed the title ~~time of tokenize data too long~~ Time of tokenize data too long Feb 19, 2025

Provide feedback