You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Andrej in his YouTube video noted that LLMs are worse at non-English languages, partly due to tokenization. Basically, for less represented languages, even frequent pairs of characters appear less frequently in the corpus than most of English pairs. Hence, fewer merges occur in BPE for these languages and their token representation ends up being lengthy.
Isn’t it a good idea to build tokens for each language separately and for distinct domains e.g., python code?
The text was updated successfully, but these errors were encountered:
Andrej in his YouTube video noted that LLMs are worse at non-English languages, partly due to tokenization. Basically, for less represented languages, even frequent pairs of characters appear less frequently in the corpus than most of English pairs. Hence, fewer merges occur in BPE for these languages and their token representation ends up being lengthy.
Isn’t it a good idea to build tokens for each language separately and for distinct domains e.g., python code?
The text was updated successfully, but these errors were encountered: