Skip to content

yanivle/fast_minbpe

Repository files navigation

A minimal, clean, and fast Python implementation of the BPE algorithm.

I just watched Karpathy's BPE video (thanks Andrej for all the fun videos!) and got inspired. Karpathy's code is really nice (as always), and his implementation is obviously not meant to be fast, but with the right data structures, we should be able to make the code much faster without sacrificing clarity too much. Hopefully, faster code can make experimentation (e.g. with different scores, instead of always taking the most popular pair) easier. So... here's my late night take on a simple, clean, and fast BPE implementation!

For example, on my laptop, training/tokenizing the taylorswift.txt file with a vocab size of 10K takes:

minbpe (Karpathy's) fast_minbpe (this repo)
Training 110.10 secs 1.00 secs
Tokenizing 190.91 secs 0.52 secs

So 100X faster training in this case.

Training a vocab size of 100K on Swann's Way (~1MB) takes 8.15 seconds, on the bible (~4.3MB) 24.49 seconds, and on a corpus of reddit jokes (~69MB) just under 5 minutes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published