Tokenizer training implementation #207

joanna-baran-allegro · 2025-02-26T12:31:40Z

Hi! I could not find the implementation details on tokenizer training. Could you navigate me which files contain it, or are they outside of this repository?
I am interested what OLMo tokenizer was used as a base, how did you evaluate the tokenizer performance and what data splitting & sampling strategy was applied during training procedure. :)

NohTow · 2025-02-26T13:14:08Z

Hello,

The files of the tokenizer can be found alongside the model.
I am not 100% which OLMo tokenizer is the base, @bclavie should be able to give the exact version was used.
Note that we did not perform any training of the tokenizer, we just used the tokenizer as is, with additional unused token to get a power of 2 as vocabulary size. We thus did not evaluate much the tokenizer either, we just tested a few existing ones during ablations and decided to use OLMo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer training implementation #207

Tokenizer training implementation #207

joanna-baran-allegro commented Feb 26, 2025

NohTow commented Feb 26, 2025

Tokenizer training implementation #207

Tokenizer training implementation #207

Comments

joanna-baran-allegro commented Feb 26, 2025

NohTow commented Feb 26, 2025