Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM is worse at non-English languages #92

Open
7CD opened this issue Nov 10, 2024 · 0 comments
Open

LLM is worse at non-English languages #92

7CD opened this issue Nov 10, 2024 · 0 comments

Comments

@7CD
Copy link

7CD commented Nov 10, 2024

Andrej in his YouTube video noted that LLMs are worse at non-English languages, partly due to tokenization. Basically, for less represented languages, even frequent pairs of characters appear less frequently in the corpus than most of English pairs. Hence, fewer merges occur in BPE for these languages and their token representation ends up being lengthy.
Isn’t it a good idea to build tokens for each language separately and for distinct domains e.g., python code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant