Skip to content

Latest commit

 

History

History
20 lines (13 loc) · 1.7 KB

README.md

File metadata and controls

20 lines (13 loc) · 1.7 KB

A minimal, clean, and fast Python implementation of the BPE algorithm.

I just watched Karpathy's BPE video (thanks Andrej for all the fun videos!) and got inspired. Karpathy's code is really nice (as always), and his implementation is obviously not meant to be fast, but with the right data structures, we should be able to make the code much faster without sacrificing clarity too much. Hopefully, faster code can make experimentation (e.g. with different scores, instead of always taking the most popular pair) easier. So... here's my late night take on a simple, clean, and fast BPE implementation!

For example, on my laptop, training/tokenizing the taylorswift.txt file with a vocab size of 10K takes:

minbpe (Karpathy's) fast_minbpe (this repo)
Training 110.10 secs 1.00 secs
Tokenizing 190.91 secs 0.52 secs

So 100X faster training in this case.

Training a vocab size of 100K on Swann's Way (~1MB) takes 8.15 seconds, on the bible (~4.3MB) 24.49 seconds, and on a corpus of reddit jokes (~69MB) just under 5 minutes.