Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why fast tokenizer is disabled? #301

Closed
dyang415 opened this issue Jun 24, 2024 · 2 comments
Closed

Why fast tokenizer is disabled? #301

dyang415 opened this issue Jun 24, 2024 · 2 comments

Comments

@dyang415
Copy link

Hi there, nice work on the internVL! We're really impressed by the new internvl-v1.5.

One thing we noticed is that the backing language model internlm/internlm2-chat-20b has a fast tokenizer (https://huggingface.co/internlm/internlm2-chat-20b/blob/main/tokenizer_config.json#L89). However, in internvl, the faster tokenizer was removed (https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5/blob/main/tokenizer_config.json#L162). We're wondering if there's any specific reason the faster tokenizer isn't enabled?

@dyang415 dyang415 changed the title Why faster tokenizer is disabled? Why fast tokenizer is disabled? Jun 24, 2024
@Weiyun1025
Copy link
Collaborator

We previously discovered that the tokenize results of FastTokenizer sometimes differed from those of Tokenizer. Considering that the benefits of FastTokenizer in our scenario are not significant, we decided not to use FastTokenizer to ensure the correctness of the code.

@czczup
Copy link
Member

czczup commented Jul 30, 2024

You can turn it on while you are using it.

@czczup czczup closed this as completed Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants