GPT2-Nepali (Pretrained from scratch) #485
Replies: 2 comments 12 replies
-
Awesome project! I don't speak Nepali, unfortunately, but I find that super interesting as a kind of case study for adapting LLMs to new languages, codes, structures, etc. May I ask what tool you used for training the tokenizer? PS: I have some code for implementing and training a BPE from scratch (this was one of the outtakes that didn't fit into the book / was way too long for chapter 2). I have ample notes for this but totally forgot to upload it as part of the bonus materials, thanks for reminding me. Will probably share it in a few days here. It's not meant for efficiency but more for educational purposes |
Beta Was this translation helpful? Give feedback.
-
Hi @Aananda-giri, this is fantastic work! I'm particularly interested in the tokenizer comparison. Could you share any insights on how the performance of your new BPE tokenizer compares to the original GPT-2 pretrained tokenizer? Did the BPE tokenizer lead to more accurate text generation or better understanding of Nepali nuances? |
Beta Was this translation helpful? Give feedback.
-
Hi everyone! 👋
I’m excited to share my recent project: GPT2-Nepali, a GPT-2 model pretrained from scratch for the Nepali language. This project builds upon the GPT-2 model training code detailed in Build a Large Language Model (From Scratch), adapting it specifically for the Nepali language.
Project Highlights:
🔗 Chat Interface: GPT2-Nepali Chat Interface on Hugging Face
📦 Pre-Trained Model: GPT2-Nepali on Hugging Face
💻 Training Code: GitHub Repository
📊 Dataset: 12GB Nepali text derived from NepBERTa project.
Modifications from Original Code
1️⃣ Tokenizer:
2️⃣ Dataloader:
A huge thank you to @rasbt for the inspiration and for writing such an incredible resource—easily the best book on LLMs I’ve ever read!
Beta Was this translation helpful? Give feedback.
All reactions