facebook出品的fasttext模型。可以用于embedding，也可以用于文本分类。

Fasttext关键点：

sub-word 和 n-gram，利用词根级信息和词组级信息，并且可以test时处理未出现过的词（OOV）。
hierarchical softmax 和 negative sampling 用于提速。

ref：Bag of Tricks for Efficient Text Classification https://arxiv.org/abs/1607.01759

fasttext的主要功能

Training Supervised Classifier [supervised] Supervised Classifier Training for Text Classification. 训练分类器，就是文本分类，fasttext 的主营业务。

Training SkipGram Model [skipgram] Learning Word Representations/Word Vectors using skipgram technique. 训练skipgram的方式的词向量。

Quantization [quantize] Quantization is a process applied on a model so as to reduce the memory usage during prediction. 量化压缩，降低模型体积。

Predictions [predict] Predicting labels for a given text : Text Classification. 对于文本分类任务，用于预测类别。

Predictions with Probabilities [predict-prob] Predicting probabilities in addition to labels for a given text : Text Classification. 带有概率的预测类别。

Training of CBOW model [cbow] Learning Word Representations/Word Vectors using CBOW (Continuous Bag Of Words) technique. cbow方式训练词向量。

Print Word Vectors [print-word-vectors] Printing of Word Vectors for a trained model with each line representing a word vector. 打印一个单词的词向量。

Print Sentence Vectors [print-sentence-vectors] Printing of Sentence Vectors for a trained model with each line representing a vector for a paragraph. 打印文本向量，每个文本的向量长度是一样的，代表所有单词的综合特征。

Query Nearest Neighbors [nn] 找到某个单词的近邻。

Query for Analogies [analogies] 找到某个单词的类比词，比如 A - B + C。柏林 - 德国 + 法国 = 巴黎这类的东西。

命令行的fasttext使用

1 基于自己的语料训练word2vec

fasttext skipgram -input xxxcorpus -output xxxmodel

训练得到两个文件：xxxmodel.bin 和 xxxmodel.vec，分别是模型文件和词向量形式的模型文件

参数可选 skipgram 或者 cbow，分别对应SG和CBOW模型。

2 根据训练好的model查看某个词的neighbor

fasttext nn xxxmodel.bin

Query word? 后输入单词，即可获得其近邻单词。

3 其它的一些参数：

-minn 和 -maxn ：subwords的长度范围，default是3和6
-epoch 和 -lr ：轮数和学习率，default是5和0.05
-dim：词向量的维度，越大越🐮🍺，但是会占据更多内存，并降低计算速度。
-thread：运行的线程数，不解释。

python 模块的应用方式：

参数含义与功能基本相同，用法如下。

给一个栗子：

def train_word_vector(train_fname, test_fname, epoch, lr, save_model_fname, thr):
    """
    train text classification, and save model
    """
    dim = 500               # size of word vectors [100]
    ws = 5                # size of the context window [5]
    minCount = 500          # minimal number of word occurences [1]
    minCountLabel = 1     # minimal number of label occurences [1]
    minn = 1              # min length of char ngram [0]
    maxn = 2              # max length of char ngram [0]
    neg = 5               # number of negatives sampled [5]
    wordNgrams = 2        # max length of word ngram [1]
    loss = 'softmax'              # loss function {ns, hs, softmax, ova} [softmax]
    lrUpdateRate = 100      # change the rate of updates for the learning rate [100]
    t = 0.0001                 # sampling threshold [0.0001]
    label = '__label__'             # label prefix ['__label__']

    model = fasttext.train_supervised(train_fname, lr=lr, epoch=epoch, dim=dim, ws=ws, 
                                        minCount=minCount, minCountLabel=minCountLabel,
                                        minn=minn, maxn=maxn, neg=neg, 
                                        wordNgrams=wordNgrams, loss=loss,
                                        lrUpdateRate=lrUpdateRate,
                                        t=t, label=label, verbose=True)
    model.save_model(save_model_fname)
    return model

if __name__ == "__main__":
    """ param settings """
    model = train_word_vector(train_fname, test_fname,
                              epoch, lr, save_model_fname, thr)
    model.get_nearest_neighbors(some_word)
    model.predict('sentence') # 得到输出类别
    model.test(filename) # 输出三元组，(样本数量, acc, acc) 这里的acc是对二分类来说的

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.Fasttext.md

3.Fasttext.md

fasttext的主要功能

命令行的fasttext使用

python 模块的应用方式：

Files

3.Fasttext.md

Latest commit

History

3.Fasttext.md

File metadata and controls

fasttext的 主要功能

命令行的fasttext使用

python 模块的应用方式：

fasttext的主要功能