Skip to content

Latest commit

 

History

History
74 lines (51 loc) · 1.54 KB

README.md

File metadata and controls

74 lines (51 loc) · 1.54 KB

Using pytorch to implement word2vec algorithm Skip-gram Negative Sampling (SGNS), and refer paper Distributed Representations of Words and Phrases and their Compositionality.

Dependency

  • python 3.6
  • pytorch 0.4+

Usage

Run main.py.

Initialize the dataset and model.

# init dataset and model
word2vec = Word2Vec(data_path='text8',
                    vocabulary_size=50000,
                    embedding_size=300)

# the index of the whole corpus
print(word2vec.data[:10])

# word_count like this [['word', word_count], ...]
# the index of list correspond index of word
print(word2vec.word_count[:10])

# index to word
print(word2vec.index2word[34])

# word to index
print(word2vec.word2index['hello'])

Train and get the vector.

# train model
word2vec.train(train_steps=200000,
               skip_window=1,
               num_skips=2,
               num_neg=20,
               output_dir='out/run-1')

# save vector txt file
word2vec.save_vector_txt(path_dir='out/run-1')

# get vector list
vector = word2vec.get_list_vector()
print(vector[123])
print(vector[word2vec.word2index['hello']])

# get top k similar word
sim_list = word2vec.most_similar('one', top_k=8)
print(sim_list)

# load pre-train model
word2vec.load_model('out/run-1/model_step200000.pt')

Evaluate

Refer repository eval-word-vectors. Like this:

eval/wordsim.py vector.txt eval/data/EN-MTurk-287.txt
eval/wordsim.py vector.txt eval/data/EN-MC-30.txt