GitHub

A minimal, clean, and fast Python implementation of the BPE algorithm.

I just watched Karpathy's BPE video (thanks Andrej for all the fun videos!) and got inspired. Karpathy's code is really nice (as always), and his implementation is obviously not meant to be fast, but with the right data structures, we should be able to make the code much faster without sacrificing clarity too much. Hopefully, faster code can make experimentation (e.g. with different scores, instead of always taking the most popular pair) easier. So... here's my late night take on a simple, clean, and fast BPE implementation!

For example, on my laptop, training/tokenizing the taylorswift.txt file with a vocab size of 10K takes:

	minbpe (Karpathy's)	fast_minbpe (this repo)
Training	110.10 secs	1.00 secs
Tokenizing	190.91 secs	0.52 secs

So 100X faster training in this case.

Training a vocab size of 100K on Swann's Way (~1MB) takes 8.15 seconds, on the bible (~4.3MB) 24.49 seconds, and on a corpus of reddit jokes (~69MB) just under 5 minutes.

bpe.py - the BPE impl.
fast_minbpe.ipynb - brief analysis and results.
datastructures/ - IndexedList and Multiset (the data structures that make this fast).
my website - short writeup.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
datastructures		datastructures
util		util
.gitignore		.gitignore
README.md		README.md
bke.py		bke.py
bpe.py		bpe.py
bpe_less_greedy.ipynb		bpe_less_greedy.ipynb
bpe_less_greedy.py		bpe_less_greedy.py
bpe_vs_bke_vs_bxe.ipynb		bpe_vs_bke_vs_bxe.ipynb
bxe_with_score.py		bxe_with_score.py
fast_minbpe.ipynb		fast_minbpe.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

yanivle/fast_minbpe

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages