Word2Bits extends the Word2Vec algorithm to output high quality quantized word vectors that take 8x-16x less storage than regular word vectors. Read the details at https://arxiv.org/abs/1803.05651.
Quantized word vectors are word vectors where each parameter
is one of 2^bitlevel
values.
For example, the 1-bit quantized vector for "king" looks something like
0.33333334 0.33333334 0.33333334 -0.33333334 -0.33333334 -0.33333334 0.33333334 0.33333334 -0.33333334 0.33333334 0.33333334 ...
Since parameters are limited to one of 2^bitlevel
values, each parameter
takes only bitlevel
bits to represent; this drastically reduces
the amount of storage that word vectors take.
- All word vectors are in Glove/Fasttext format (format details here). Files are compressed using gzip.
# Bits per parameter | Dimension | Trained on | Vocabulary size | File Size (Compressed) | Download Link |
---|---|---|---|---|---|
1 | 800 | English Wikipedia 2017 | Top 400k | 86M | w2b_bitlevel1_size800_vocab400K.tar.gz |
1 | 1000 | English Wikipedia 2017 | Top 400k | 106M | w2b_bitlevel1_size1000_vocab400K.tar.gz |
1 | 1200 | English Wikipedia 2017 | Top 400k | 126M | w2b_bitlevel1_size1200_vocab400K.tar.gz |
2 | 400 | English Wikipedia 2017 | Top 400k | 67M | w2b_bitlevel2_size400_vocab400K.tar.gz |
2 | 800 | English Wikipedia 2017 | Top 400k | 134M | w2b_bitlevel2_size800_vocab400K.tar.gz |
2 | 1000 | English Wikipedia 2017 | Top 400k | 168M | w2b_bitlevel2_size1000_vocab400K.tar.gz |
32 | 200 | English Wikipedia 2017 | Top 400k | 364M | w2b_bitlevel0_size200_vocab400K.tar.gz |
32 | 400 | English Wikipedia 2017 | Top 400k | 724M | w2b_bitlevel0_size400_vocab400K.tar.gz |
32 | 800 | English Wikipedia 2017 | Top 400k | 1.4G | w2b_bitlevel0_size800_vocab400K.tar.gz |
32 | 1000 | English Wikipedia 2017 | Top 400k | 1.8G | w2b_bitlevel0_size1000_vocab400K.tar.gz |
1 | 800 | English Wikipedia 2017 | 3.7M (Full) | 812M | w2b_bitlevel1_size800_vocab3.7M.tar.gz |
2 | 400 | English Wikipedia 2017 | 3.7M (Full) | 671M | w2b_bitlevel2_size400_vocab3.7M.tar.gz |
32 | 400 | English Wikipedia 2017 | 3.7M (Full) | 6.7G | w2b_bitlevel0_size400_vocab3.7M.tar.gz |
(Note: every 5 word vectors are labelled; turquoise line boundary between nearest and furthest word vectors from target.)
Compile with
make word2bits
Run with
./word2bits -train input -bitlevel 1 -size 200 -window 10 -negative 12 -threads 2 -iter 5 -min-count 5 -output 1bit_200d_vectors -binary 0
Description of the most common flags:
-train Input corpus text file
-bitlevel Number of bits for each parameter. 0 is full precision (or 32 bits).
-size Word vector dimension
-window Window size
-negative Negative sample size
-threads Number of threads to use to train
-iter Number of epochs to train
-min-count Minimum count value. Words appearing less than value are removed from corpus.
-output Path to write output word vectors
-binary 0 to write in Glove format; 1 to write in binary format.
-
Download and preprocess text8 (make sure you're in the Word2Bits base directory).
bash data/download_text8.sh
-
Compile Word2Bits and compute accuracy
make word2bits
make compute_accuracy
-
Train 1 bit 200 dimensional word vectors for 5 epochs using 4 threads (save in binary so that compute_accuracy can work with it)
./word2bits -bitlevel 1 -size 200 -window 8 -negative 24 -threads 4 -iter 5 -min-count 5 -train text8 -output 1b200d_vectors -binary 1
(This will take several minutes. Run with more threads if you have more cores!)
-
Evaluate vectors on Google Analogy Task
./compute_accuracy ./1b200d_vectors < data/google_analogies_test_set/questions-words.txt
You should see output like:
Starting eval... capital-common-countries: ACCURACY TOP1: 19.76 % (100 / 506) Total accuracy: 19.76 % Semantic accuracy: 19.76 % Syntactic accuracy: -nan % capital-world: ACCURACY TOP1: 8.81 % (239 / 2713) Total accuracy: 10.53 % Semantic accuracy: 10.53 % Syntactic accuracy: -nan % ... gram8-plural: ACCURACY TOP1: 19.92 % (251 / 1260) Total accuracy: 11.48 % Semantic accuracy: 13.27 % Syntactic accuracy: 10.25 % gram9-plural-verbs: ACCURACY TOP1: 6.09 % (53 / 870) Total accuracy: 11.20 % Semantic accuracy: 13.27 % Syntactic accuracy: 9.88 % Questions seen / total: 16284 19544 83.32 %
Inspecting the vector file in hex should show something like:
$ od --format=x1 --read-bytes=160 1b200d_vectors 0000000 36 30 32 33 38 20 32 30 30 0a 3c 2f 73 3e 20 ab 0000020 aa aa 3e ab aa aa 3e ab aa aa be ab aa aa be ab 0000040 aa aa 3e ab aa aa 3e ab aa aa 3e ab aa aa be ab ... 0000160 aa aa be ab aa aa 3e ab aa aa be ab aa aa 3e ab 0000200 aa aa be ab aa aa 3e ab aa aa 3e ab aa aa 3e ab 0000220 aa aa be ab aa aa 3e ab aa aa be ab aa aa 3e ab