facebook出品的fasttext模型。可以用于embedding,也可以用于文本分类。
Fasttext关键点:
- sub-word 和 n-gram,利用词根级信息和词组级信息,并且可以test时处理未出现过的词(OOV)。
- hierarchical softmax 和 negative sampling 用于提速。
ref:Bag of Tricks for Efficient Text Classification https://arxiv.org/abs/1607.01759
Training Supervised Classifier [supervised] Supervised Classifier Training for Text Classification. 训练分类器,就是文本分类,fasttext 的主营业务。
Training SkipGram Model [skipgram] Learning Word Representations/Word Vectors using skipgram technique. 训练skipgram的方式的词向量。
Quantization [quantize] Quantization is a process applied on a model so as to reduce the memory usage during prediction. 量化压缩,降低模型体积。
Predictions [predict] Predicting labels for a given text : Text Classification. 对于文本分类任务,用于预测类别。
Predictions with Probabilities [predict-prob] Predicting probabilities in addition to labels for a given text : Text Classification. 带有概率的预测类别。
Training of CBOW model [cbow] Learning Word Representations/Word Vectors using CBOW (Continuous Bag Of Words) technique. cbow方式训练词向量。
Print Word Vectors [print-word-vectors] Printing of Word Vectors for a trained model with each line representing a word vector. 打印一个单词的词向量。
Print Sentence Vectors [print-sentence-vectors] Printing of Sentence Vectors for a trained model with each line representing a vector for a paragraph. 打印文本向量,每个文本的向量长度是一样的,代表所有单词的综合特征。
Query Nearest Neighbors [nn] 找到某个单词的近邻。
Query for Analogies [analogies] 找到某个单词的类比词,比如 A - B + C。柏林 - 德国 + 法国 = 巴黎 这类的东西。
1 基于自己的语料训练word2vec
fasttext skipgram -input xxxcorpus -output xxxmodel
训练得到两个文件:xxxmodel.bin 和 xxxmodel.vec,分别是模型文件和词向量形式的模型文件
参数可选 skipgram 或者 cbow,分别对应SG和CBOW模型。
2 根据训练好的model查看某个词的neighbor
fasttext nn xxxmodel.bin
Query word? 后输入单词,即可获得其近邻单词。
3 其它的一些参数:
-minn 和 -maxn :subwords的长度范围,default是3和6
-epoch 和 -lr :轮数和学习率,default是5和0.05
-dim:词向量的维度,越大越🐮🍺,但是会占据更多内存,并降低计算速度。
-thread:运行的线程数,不解释。
参数含义与功能基本相同,用法如下。
给一个栗子:
def train_word_vector(train_fname, test_fname, epoch, lr, save_model_fname, thr):
"""
train text classification, and save model
"""
dim = 500 # size of word vectors [100]
ws = 5 # size of the context window [5]
minCount = 500 # minimal number of word occurences [1]
minCountLabel = 1 # minimal number of label occurences [1]
minn = 1 # min length of char ngram [0]
maxn = 2 # max length of char ngram [0]
neg = 5 # number of negatives sampled [5]
wordNgrams = 2 # max length of word ngram [1]
loss = 'softmax' # loss function {ns, hs, softmax, ova} [softmax]
lrUpdateRate = 100 # change the rate of updates for the learning rate [100]
t = 0.0001 # sampling threshold [0.0001]
label = '__label__' # label prefix ['__label__']
model = fasttext.train_supervised(train_fname, lr=lr, epoch=epoch, dim=dim, ws=ws,
minCount=minCount, minCountLabel=minCountLabel,
minn=minn, maxn=maxn, neg=neg,
wordNgrams=wordNgrams, loss=loss,
lrUpdateRate=lrUpdateRate,
t=t, label=label, verbose=True)
model.save_model(save_model_fname)
return model
if __name__ == "__main__":
""" param settings """
model = train_word_vector(train_fname, test_fname,
epoch, lr, save_model_fname, thr)
model.get_nearest_neighbors(some_word)
model.predict('sentence') # 得到输出类别
model.test(filename) # 输出三元组,(样本数量, acc, acc) 这里的acc是对二分类来说的