You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I need to calculate the number of tokens, but TokenizerGpt3 has errors in calculations for models of GPT-3.5 and above.
TokenizerGpt3 mainly refers to openai-tools. After reading the source code, its implementation mainly refers to data_gym_to_mergeable_bpe_ranks, which requires an encoder.json and a vocab.bpe at runtime. According to openai_public, this method is mainly applicable to gpt-2, and based on the test results, it is also suitable for r50k_base and p50k_base. However, it doesn't work for cl100k_base (GPT-4 and GPT-3.5).
Starting from r50k_base, the tokenizer implementation has changed to load_tiktoken_bpe, which relies on a .tiktoken file at runtime. Currently, there are 2 tokenizer projects supporting GPT-3.5: TiktokenSharp and SharpToken, both implemented in this way.
Describe the solution you'd like
It is difficult to modify the current TokenizerGpt3 to support cl100k_base, maybe a rewrite is the only way. Do you think it's necessary? If so, I'm willing to undertake the rewriting work. Please let me know your opinion.
Describe alternatives you've considered
Or maybe we can just use TiktokenSharp.
The text was updated successfully, but these errors were encountered:
Hi, Thanks for creating the issue. Both solutions are okay for me(porting or using a different library) but before I need to do a bit of research about it.
Is your feature request related to a problem? Please describe.
I need to calculate the number of tokens, but TokenizerGpt3 has errors in calculations for models of GPT-3.5 and above.
TokenizerGpt3 mainly refers to openai-tools. After reading the source code, its implementation mainly refers to data_gym_to_mergeable_bpe_ranks, which requires an encoder.json and a vocab.bpe at runtime. According to openai_public, this method is mainly applicable to gpt-2, and based on the test results, it is also suitable for r50k_base and p50k_base. However, it doesn't work for cl100k_base (GPT-4 and GPT-3.5).
Starting from r50k_base, the tokenizer implementation has changed to load_tiktoken_bpe, which relies on a .tiktoken file at runtime. Currently, there are 2 tokenizer projects supporting GPT-3.5: TiktokenSharp and SharpToken, both implemented in this way.
Describe the solution you'd like
It is difficult to modify the current TokenizerGpt3 to support cl100k_base, maybe a rewrite is the only way. Do you think it's necessary? If so, I'm willing to undertake the rewriting work. Please let me know your opinion.
Describe alternatives you've considered
Or maybe we can just use TiktokenSharp.
The text was updated successfully, but these errors were encountered: