You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I could not find the implementation details on tokenizer training. Could you navigate me which files contain it, or are they outside of this repository?
I am interested what OLMo tokenizer was used as a base, how did you evaluate the tokenizer performance and what data splitting & sampling strategy was applied during training procedure. :)
The text was updated successfully, but these errors were encountered:
The files of the tokenizer can be found alongside the model.
I am not 100% which OLMo tokenizer is the base, @bclavie should be able to give the exact version was used.
Note that we did not perform any training of the tokenizer, we just used the tokenizer as is, with additional unused token to get a power of 2 as vocabulary size. We thus did not evaluate much the tokenizer either, we just tested a few existing ones during ablations and decided to use OLMo.
Hi! I could not find the implementation details on tokenizer training. Could you navigate me which files contain it, or are they outside of this repository?
I am interested what OLMo tokenizer was used as a base, how did you evaluate the tokenizer performance and what data splitting & sampling strategy was applied during training procedure. :)
The text was updated successfully, but these errors were encountered: