SelfAttentiveSentenceEmbedding-TF

基于BiLSTM和Self-Attention的文本分类、表示学习网络

论文《A Structured Self-attentive Sentence Embedding》(ICLR 2017) 的模型实现

模型

本项目的实现与原文有一点小差异，本实现在最后获得句子的表示(图中M矩阵)后直接过softmax分类器了，而不是原文所说的两层MLP，这里主要是考虑到要削弱末级分类器的复杂度，强迫模型学习到更有效的表示(图中M矩阵)，这样有助于下游任务

使用

本项目主要提供一个SentencePresentation类，可以方便用户在各个场景下使用,仅需要几行代码即可训练模型：

    network =SentencePresentation(wv, wv_dim=100, lstm_size=64, layers=1, dim_r=30, classes=4, dim_a=10, norm=0.5, lr=0.01)
    with tf.Session() as sess:
        network.fit(sess, './train_data.csv', epoch=2)

也可以通过predict方法快速得到分类结果, attention, 以及句子的embedding结果:

    with tf.Session() as sess:
        dataset = Dataset(sess, './test_data.csv', 200, '\t', max_len=500, epoch=1)
        for c, ws, lens in dataset:
            labels, attentions, embedding = network.predict(sess, ws, lens)

完整示例见main.py（注：例子中train_data.csv文件由于体积太大无法传到github上，若有需要请联系我）

模型输入

模型的训练需要输入:

训练样本
word2vec词表

训练样本是<label><\tab><word-id-1><\space>...<word-id-n>的格式。这里给两个例子：

0	660 68441 11 2839 501 15241 893 1 85 17 311 18 17 77 18 18105 1 501 85 1004 161 19

1	2267 378 1322 10 917 1588 9 14859 6692 326 94534 2101 105 9019 4 341 541 28 0 341

word2vec词表是一个从<word-id>到<word vector>的映射，实际上就是word2vec的输出。

如果对于输入数据有任何疑问请联系我

训练效果

游戏类文本的attention：

娱乐圈文本的attention:

政治类文本的attention:

关于数据源

数据来源于THUCTC文本分类数据，本项目用到了其中政治、网络游戏、娱乐三个类别的数据

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Model		Model
doc		doc
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SelfAttentiveSentenceEmbedding-TF

模型

使用

模型输入

训练效果

关于数据源

About

Releases

Packages

Languages

EmbolismSoil/SelfAttentiveSentenceEmbedding-TF

Folders and files

Latest commit

History

Repository files navigation

SelfAttentiveSentenceEmbedding-TF

模型

使用

模型输入

训练效果

关于数据源

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages