You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here n should be ##n, while due to added q separately, it will treat q as new token and will try to split it separately. Which is not a correct solution down the line.
Solution suggested:
Add q into the vocab.txt file, that way it will result in correct tokenization. (I added at the last of vocab.txt file and updated model embedding size, not sure how it will work with model down the line. Yet to test)
Hi,
I faced an issue with the tokenizer while tokenizing text that start from
q
. I found thatq
is missing from vocab.txt file. (Q
is present.)Simple fix i tried: Add
q
into tokenizer usingadd_tokens
method (huggingface), but it failed to produce exact/correct tokenization.Here
n
should be##n
, while due to addedq
separately, it will treatq
as new token and will try to split it separately. Which is not a correct solution down the line.Solution suggested:
Add
q
into thevocab.txt
file, that way it will result in correct tokenization. (I added at the last of vocab.txt file and updated model embedding size, not sure how it will work with model down the line. Yet to test)I hope you will release updated tokenizer
vocab.txt
file with added tokenq
.The text was updated successfully, but these errors were encountered: