Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer for multiple encodings #213

Open
Frogley opened this issue Mar 31, 2023 · 1 comment
Open

Tokenizer for multiple encodings #213

Frogley opened this issue Mar 31, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@Frogley
Copy link

Frogley commented Mar 31, 2023

Is your feature request related to a problem? Please describe.
I need to calculate the number of tokens, but TokenizerGpt3 has errors in calculations for models of GPT-3.5 and above.

TokenizerGpt3 mainly refers to openai-tools. After reading the source code, its implementation mainly refers to data_gym_to_mergeable_bpe_ranks, which requires an encoder.json and a vocab.bpe at runtime. According to openai_public, this method is mainly applicable to gpt-2, and based on the test results, it is also suitable for r50k_base and p50k_base. However, it doesn't work for cl100k_base (GPT-4 and GPT-3.5).

Starting from r50k_base, the tokenizer implementation has changed to load_tiktoken_bpe, which relies on a .tiktoken file at runtime. Currently, there are 2 tokenizer projects supporting GPT-3.5: TiktokenSharp and SharpToken, both implemented in this way.

Describe the solution you'd like
It is difficult to modify the current TokenizerGpt3 to support cl100k_base, maybe a rewrite is the only way. Do you think it's necessary? If so, I'm willing to undertake the rewriting work. Please let me know your opinion.

Describe alternatives you've considered
Or maybe we can just use TiktokenSharp.

@kayhantolga
Copy link
Member

Hi, Thanks for creating the issue. Both solutions are okay for me(porting or using a different library) but before I need to do a bit of research about it.

@kayhantolga kayhantolga added the bug Something isn't working label Apr 15, 2023
@kayhantolga kayhantolga added this to the 8.0.2 milestone Apr 6, 2024
@kayhantolga kayhantolga modified the milestones: 8.0.2, 8.0.4 Apr 15, 2024
@kayhantolga kayhantolga removed this from the 8.4.3 milestone Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants