Why fast tokenizer is disabled? #301

dyang415 · 2024-06-24T22:24:42Z

Hi there, nice work on the internVL! We're really impressed by the new internvl-v1.5.

One thing we noticed is that the backing language model internlm/internlm2-chat-20b has a fast tokenizer (https://huggingface.co/internlm/internlm2-chat-20b/blob/main/tokenizer_config.json#L89). However, in internvl, the faster tokenizer was removed (https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5/blob/main/tokenizer_config.json#L162). We're wondering if there's any specific reason the faster tokenizer isn't enabled?

Weiyun1025 · 2024-06-30T07:47:52Z

We previously discovered that the tokenize results of FastTokenizer sometimes differed from those of Tokenizer. Considering that the benefits of FastTokenizer in our scenario are not significant, we decided not to use FastTokenizer to ensure the correctness of the code.

czczup · 2024-07-30T15:05:15Z

You can turn it on while you are using it.

dyang415 changed the title ~~Why faster tokenizer is disabled?~~ Why fast tokenizer is disabled? Jun 24, 2024

czczup closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why fast tokenizer is disabled? #301

Why fast tokenizer is disabled? #301

dyang415 commented Jun 24, 2024

Weiyun1025 commented Jun 30, 2024

czczup commented Jul 30, 2024

Why fast tokenizer is disabled? #301

Why fast tokenizer is disabled? #301

Comments

dyang415 commented Jun 24, 2024

Weiyun1025 commented Jun 30, 2024

czczup commented Jul 30, 2024