LLM is worse at non-English languages #92

7CD · 2024-11-10T10:05:36Z

Andrej in his YouTube video noted that LLMs are worse at non-English languages, partly due to tokenization. Basically, for less represented languages, even frequent pairs of characters appear less frequently in the corpus than most of English pairs. Hence, fewer merges occur in BPE for these languages and their token representation ends up being lengthy.
Isn’t it a good idea to build tokens for each language separately and for distinct domains e.g., python code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM is worse at non-English languages #92

LLM is worse at non-English languages #92

7CD commented Nov 10, 2024

LLM is worse at non-English languages #92

LLM is worse at non-English languages #92

Comments

7CD commented Nov 10, 2024