A minimal, clean, and fast Python implementation of the BPE algorithm.
I just watched Karpathy's BPE video (thanks Andrej for all the fun videos!) and got inspired. Karpathy's code is really nice (as always), and his implementation is obviously not meant to be fast, but with the right data structures, we should be able to make the code much faster without sacrificing clarity too much. Hopefully, faster code can make experimentation (e.g. with different scores, instead of always taking the most popular pair) easier. So... here's my late night take on a simple, clean, and fast BPE implementation!
For example, on my laptop, training/tokenizing the taylorswift.txt file with a vocab size of 10K takes:
minbpe (Karpathy's) | fast_minbpe (this repo) | |
---|---|---|
Training | 110.10 secs | 1.00 secs |
Tokenizing | 190.91 secs | 0.52 secs |
So 100X faster training in this case.
Training a vocab size of 100K on Swann's Way (~1MB) takes 8.15 seconds, on the bible (~4.3MB) 24.49 seconds, and on a corpus of reddit jokes (~69MB) just under 5 minutes.
- bpe.py - the BPE impl.
- fast_minbpe.ipynb - brief analysis and results.
- datastructures/ - IndexedList and Multiset (the data structures that make this fast).
- my website - short writeup.