YouTokenToMe
will be compared with Hugging Face, SentencePiece
and fastBPE. These three algorithms are considered to be fast.
Data from Wikipedia was used to evaluate algorithm speed. In a similar way to enwik8
and enwik9
, the experiments were run on first 10^8
and 10^9
bytes of datasets for English, Russian, Chinese and Japanese.
vocab_size
was set to the commonly used value 30000
.
In this benchmark, YouTokenToMe
used 4 threads for training and tokenization. SentencePiece
doesn't support multithreading for BPE at all. fastBPE
doesn't support multithreading for training.
For tokenization, it also used 4 threads.
Source code for benchmark can be found here. The results of the experiments are below. The time is measured in seconds.
Russian | English | Chinese | Japanese | |
---|---|---|---|---|
SentencePiece | 86.5 (x18.8) | 82.1 (x17.4) | 963.6 (x68.8) | 1026.4 (x91.6) |
fastBPE | 41.0 (x8.9) | 36.1 (x7.6) | 700.9 (x50.0) | 485.1 (x43.3) |
Hugging Face | 73.7 (x16.0) | 60.9 (x12.9) | 574.5 (x43.0) | 466.2 (x41.6) |
YouTokenToMe | 4.6 (x1) | 4.7 (x1) | 14.0 (x1) | 11.2 (x1) |
Russian | English | Chinese | Japanese | |
---|---|---|---|---|
SentencePiece | 31.1 (x9.7) | 51.5 (x10.3) | 22.9 (x6.9) | 23.2 (x8.0) |
fastBPE | 10.1 (x3.1) | 10.3 (x2.0) | 56.3 (x17.0) | 55.3 (x19.0) |
Hugging Face | 11.4 (x3.5) | 11.1 (x2.2) | 12.2 (x3.7) | 11.0(x3.8) |
YouTokenToMe | 3.2 (x1) | 5.0 (x1) | 3.3 (x1) | 2.9 (x1) |
Russian | English | Chinese | Japanese | |
---|---|---|---|---|
SentencePiece | 455.9 (x14.9) | 454.5 (x15.5) | 3035.7 (x31.7) | 5485.7 (x53.0) |
fastBPE | 293.3 (x9.6) | 253.4 (x8.6) | 4388.2 (x45.8) | 4554.8 (x44.0) |
Hugging Face | 274.8 (x9.0) | 295.1(x10.0) | 2149.0(x22.4) | 1949.9 (x18.8) |
YouTokenToMe | 30.4 (x1) | 29.3 (x1) | 95.7 (x1) | 103.5 (x1) |
Russian | English | Chinese | Japanese | |
---|---|---|---|---|
SentencePiece | 319.2 (x10.0) | 543.1 (x10.9) | 228.1 (x7.3) | 220.5 (x7.1) |
fastBPE | 92.5 (x2.9) | 116 (x2.3) | 444.3 (x14.3) | 545.7 (x17.7) |
Hugging Face | 96.2(x3.0) | 117.8(x2.3) | 154.6(x5.0) | 130.7(x4.25) |
YouTokenToMe | 31.7 (x1) | 49.5 (x1) | 30.9 (x1) | 30.7 (x1) |
YouTokenToMe
performed really well in this benchmark. This is especially noticeable for languages with large alphabets.
The table below shows the dependence of performance on the number of threads for YouTokenToMe
.
YouTokenToMe | Russian | English | Chinese | Japanese |
---|---|---|---|---|
1 thread | 60.5 | 71.2 | 236.9 | 233.0 |
2 threads | 38.0 | 41.5 | 137.7 | 141.7 |
4 threads | 30.4 | 29.3 | 95.7 | 103.5 |
8 threads | 23.4 | 25.2 | 74.1 | 77.5 |
16 threads | 25.4 | 23.5 | 84.2 | 85.5 |
YouTokenToMe | Russian | English | Chinese | Japanese |
---|---|---|---|---|
1 thread | 86.7 | 135.3 | 62.4 | 65.9 |
2 threads | 51.5 | 81.9 | 43.8 | 42.6 |
4 threads | 33.2 | 52.6 | 32.4 | 32.5 |
8 threads | 25.4 | 38.9 | 28.9 | 25.8 |
16 threads | 20.1 | 32.4 | 24.9 | 20.6 |
Training performance stops increasing significantly after 8 threads.
So the number of threads for training is always min(8, n_threads)
.
In almost any situation, n_threads=-1
is a good value to use.
In this case, the number of threads will be determined automatically.
All experiments were run on the following machine: 24-core Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 256GB memory, Debian 8.10