You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running into some potentially troublesome issues in the tokenization for indic-bert. It seems all vowel matras (diacritics) are getting dropped in the tokenization, which loses a lot of information about the word. Perhaps some sort of Unicode issue?
Minimal example (prints True) where two very different words get treated as the same token.
bert-base-multilingual-cased does not have this issue.
Is this an issue on my end? I have this problem on Colab and on my machine (Mac, Python 3.8.8). @nitinvwaran also has this issue. I had to install sentencepiece to get the tokenizer to work btw.
The text was updated successfully, but these errors were encountered:
Running into some potentially troublesome issues in the tokenization for indic-bert. It seems all vowel matras (diacritics) are getting dropped in the tokenization, which loses a lot of information about the word. Perhaps some sort of Unicode issue?
Minimal example (prints
True
) where two very different words get treated as the same token.bert-base-multilingual-cased
does not have this issue.Is this an issue on my end? I have this problem on Colab and on my machine (Mac, Python 3.8.8). @nitinvwaran also has this issue. I had to install
sentencepiece
to get the tokenizer to work btw.The text was updated successfully, but these errors were encountered: