'[UNK]' during tokenization when word starts with 'q' #34

shubhanshu786 · 2021-05-20T02:41:16Z

Hi,
I faced an issue with the tokenizer while tokenizing text that start from q. I found that q is missing from vocab.txt file. (Q is present.)

tokenizer.tokenize('q qnm')
['[UNK]', '[UNK]']

Simple fix i tried: Add q into tokenizer using add_tokens method (huggingface), but it failed to produce exact/correct tokenization.

tokenizer.add_tokens(['q'])
tokenizer.tokenize('q qnm')
['q', 'q' 'n', '##m']

Here n should be ##n, while due to added q separately, it will treat q as new token and will try to split it separately. Which is not a correct solution down the line.

Solution suggested:
Add q into the vocab.txt file, that way it will result in correct tokenization. (I added at the last of vocab.txt file and updated model embedding size, not sure how it will work with model down the line. Yet to test)

tokenizer.tokenize('q qnm')
['q', 'q', '##n', '##m']

I hope you will release updated tokenizer vocab.txt file with added token q.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'[UNK]' during tokenization when word starts with 'q' #34

'[UNK]' during tokenization when word starts with 'q' #34

shubhanshu786 commented May 20, 2021 •

edited

Loading

'[UNK]' during tokenization when word starts with 'q' #34

'[UNK]' during tokenization when word starts with 'q' #34

Comments

shubhanshu786 commented May 20, 2021 • edited Loading

shubhanshu786 commented May 20, 2021 •

edited

Loading