Unofficial implementation of SuperBPE: Space Travel for Language Models in Rust. Just for fun and there may be mistakes. Use with caution.
- OS: MacOS Sequoia 15.3.2
- Chip: Apple M2 Max
-
To install Rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain=1.79.0 -y
-
Open a new terminal to set PATH for Rust installation.
-
After opening a new terminal, check the Rust installation by running
rustc --version
. -
conda create -n bpe python=3.10
-
conda activate bpe
-
git clone https://github.com/willxxy/superbpe.git
-
pip install -r requirements.txt
-
cd bpe
-
maturin develop --release
python main.py
For now, here are some results on a smaller dataset (Pride and Prejudice) with 400,000 characters.
BPE Training time: 24.97 seconds SuperBPE Training time: 8261.13 seconds
Final vocab size is determined as ≈ 256 + successful_merges. These results are with 5000 merges. We make the transition point at 3000 merges.
- Training is quite slow...which is expected.
- Compression rate is worse than BPE? This is just a simple inversion of their byte per token ratio.
- Looking at the tokenization visualization, it seems like SuperBPE is merging okay (look at the word "sentence")?
However, the Figure 1 in the original paper looks like this:
SuperBPE does not show great benefits until a certain vocab size (~25k). Also the training is super slow...Maybe I am too impatient XD. Maybe need to do more analysis and check for bugs. Also may need to train on a larger dataset. Feel free to contribute!