Skip to content

This repository contains code for extending llama 3 tiktoken based tokenizer on any new language

Notifications You must be signed in to change notification settings

bubblspace/Extending-Llama-3-Tokenizer-Hindi

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extending Llama 3 Tokenizer to a New Language

About Llama 3 Tokenizer

  • Uses tiktoken instead of SentencePiece
  • 128,000 token vocabulary
  • Supports multiple languages, including Hindi

Building a New Tokenizer

  1. Generate Target Language Corpus

    • Collect diverse, representative text in your target language
  2. Train a SentencePiece Tokenizer

    • Use SentencePiece library on your new language data. We use sentencepiece considering it works well for most languages. (./multilingual/hindi/train_tokenizer.py)
  3. Merge with Base Tokenizer

    • Load Llama 3 tokenizer and integrate new language tokens (extend_tokenizer_llama3.py)

Project Structure

.
├── extend_tokenizer_llama3.py
└── multilingual/
    └── hindi/
        ├── data/
        ├── tokenizer_model/
        └── train_tokenizer.py

Llama 3 vs Extended Tokenizer

Let's consider the hindi sentence:

मैं उसे एक ऐसा प्रस्ताव दूँगा जिसे वह अस्वीकार नहीं कर सकेगा।

Number of words = 13

Llama 3 tokenizer:

Target Tokens: ['म', 'ैं', ' उस', 'े', ' एक', ' ऐस', 'ा', ' प', '्रस', '्त', 'ाव', ' द', 'ूँ', 'ग', 'ा', ' ज', 'िस', 'े', ' वह', ' अस', '्व', 'ीक', 'ार', ' नह', 'ीं', ' कर', ' सक', 'ेग', 'ा।']

Target tokens produced: 30 Tokens
Fertility Score ≈ 28/13 = 2.15 tokens per word

Extended tokenizer:

Target Tokens: ['मैं', ' उसे', ' एक', ' ऐसा', ' प्रस्ताव', ' दू', 'ँ', 'गा', ' जिसे', ' वह', ' अस', '्वी', 'कार', ' नहीं', ' कर', ' सकेगा', '।']

Target tokens produced: 18 Tokens
Fertility ≈ 16/13 = 1.23 tokens per word

Key Observations

  1. Efficiency and Fertility: The new tokenizer is 40% more efficient and has a lower fertility score, producing fewer subword units per word on average.

  2. Improved Segmentation: The new tokenizer demonstrates better understanding of Hindi morphology:

  • It keeps more words intact: "मैं", "उसे", "एक", "ऐसा", "जिसे", "वह"
  • It makes more linguistically sensible splits: "प्रस्ताव" instead of "प", "्रस", "्त", "ाव"
  1. Diacritic Handling: The new tokenizer handles diacritical marks more effectively, often keeping them attached to the base character (e.g., "दू" instead of "द", "ूँ").

Overall this helps in improving training and inference since with fewer tokens, more meaningful content can fit within the model's context window.

About

This repository contains code for extending llama 3 tiktoken based tokenizer on any new language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.9%
  • Shell 3.1%