SuperBPE: Space Travel for Language Models in Rust

Unofficial implementation of SuperBPE: Space Travel for Language Models in Rust. Just for fun and there may be mistakes. Use with caution.

System info

OS: MacOS Sequoia 15.3.2
Chip: Apple M2 Max

Installation

To install Rust: curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain=1.79.0 -y
Open a new terminal to set PATH for Rust installation.
After opening a new terminal, check the Rust installation by running rustc --version.
conda create -n bpe python=3.10
conda activate bpe
git clone https://github.com/willxxy/superbpe.git
pip install -r requirements.txt
cd bpe
maturin develop --release

Usage

python main.py

Results

For now, here are some results on a smaller dataset (Pride and Prejudice) with 400,000 characters.

BPE Training time: 24.97 seconds SuperBPE Training time: 8261.13 seconds

Final vocab size is determined as ≈ 256 + successful_merges. These results are with 5000 merges. We make the transition point at 3000 merges.

Analysis

Training is quite slow...which is expected.
Compression rate is worse than BPE? This is just a simple inversion of their byte per token ratio.
Looking at the tokenization visualization, it seems like SuperBPE is merging okay (look at the word "sentence")?

However, the Figure 1 in the original paper looks like this:

SuperBPE does not show great benefits until a certain vocab size (~25k). Also the training is super slow...Maybe I am too impatient XD. Maybe need to do more analysis and check for bugs. Also may need to train on a larger dataset. Feel free to contribute!

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
bpe		bpe
models		models
pngs		pngs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SuperBPE: Space Travel for Language Models in Rust

System info

Installation

Usage

Results

Analysis

About

Releases

Packages

Languages

License

willxxy/superbpe

Folders and files

Latest commit

History

Repository files navigation

SuperBPE: Space Travel for Language Models in Rust

System info

Installation

Usage

Results

Analysis

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages