Tokenization issues #26

aryamanarora · 2021-06-03T15:28:58Z

Running into some potentially troublesome issues in the tokenization for indic-bert. It seems all vowel matras (diacritics) are getting dropped in the tokenization, which loses a lot of information about the word. Perhaps some sort of Unicode issue?

Minimal example (prints True) where two very different words get treated as the same token.

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह"))

bert-base-multilingual-cased does not have this issue.

Is this an issue on my end? I have this problem on Colab and on my machine (Mac, Python 3.8.8). @nitinvwaran also has this issue. I had to install sentencepiece to get the tokenizer to work btw.

The text was updated successfully, but these errors were encountered:

aryamanarora · 2021-06-03T20:24:52Z

Found a fix thanks to @pranavmaneriker.

import transformers
-tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
+tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert', keep_accents=True)
print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह")) # returns False

Would be nice to mention this in the README.

gowtham1997 · 2021-06-09T16:16:43Z

Sorry for the late reply.
Added this to the readme and referenced your issue.

Thanks

gowtham1997 closed this as completed Jun 9, 2021

gowtham1997 mentioned this issue Jan 8, 2022

Tokenization doesn't preserve diacritics #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization issues #26

Tokenization issues #26

aryamanarora commented Jun 3, 2021 •

edited

Loading

aryamanarora commented Jun 3, 2021 •

edited

Loading

gowtham1997 commented Jun 9, 2021

Tokenization issues #26

Tokenization issues #26

Comments

aryamanarora commented Jun 3, 2021 • edited Loading

aryamanarora commented Jun 3, 2021 • edited Loading

gowtham1997 commented Jun 9, 2021

aryamanarora commented Jun 3, 2021 •

edited

Loading

aryamanarora commented Jun 3, 2021 •

edited

Loading