kaggle - Quora insincere classification

说明：本项目为kaggle比赛内容，开始时间2018.11.09

kaggle - Quora insincere classification

比赛链接:Quora Insincere Classification
比赛形式: kernel only

目录说明

更新中...先做数据处理，再跑LSTM的baseline，再用CNN, 再加attention，再调超参数

input为输入数据
working/main_process.py 为数据预处理，返回 train和test清洗后的数据
working/bi-lstm_basline.py为基线模型

EDA

数据探测：EDA

一些操作技巧梳理

操作技巧梳理：tricks
笔记：keras学习笔记
LSTM Baselie LSTM+Attention 2DCNN 1DCNN Blending
Embeddings blending

Kernel Baseline Boosting

优秀的kernel梳理：kernels
- [Pre-processing when using embeddings]
- LSTM is all you need
- LSTM+Attention
- Different embeddings with attention
- LSTM + CNN
- LSTM Attention
- Blending with Linear Regression (9 models 0.688LB)
- 2D CNN TextClassifier
- inceptionCNN with flip
- Single RNN with 4 folds 0.692LB(good coding style) 5581 seconds

submission情况

LSTM baseline no tune: 0.573 2000s
Pre-processing + LSTM Baseline no tune: 0.631
LSTM + Attention + 256Dense + 0.5Dropout + SEQ_LEN(30->50) + thresh(0.5->0.35) no tune: 0.664
2D CNN Baseline 0.664 800s

记录

观察正负例数量
空缺值
拆分训练集、验证集
使用不同的embeddings(glove,word2vec,自己训练)训练后，进行融合
- Keras-embedding层类似word2vec，将输入的二维(用字典id表示的句子)转化为三维张量(即将id训练成vec)
f1score评价
LSTM + CNN + attention
tokenizer 用来将query转化为序列(先遍历得到字典，然后按频率排序得到id)
Don't use standard preprocessing steps like stemming or stopword removal when you have pre-trained embeddings in deeplearning methods
Get your vocabulary as close to the embeddings as possible

一些需要处理的问题

新词没有词向量

     [('bitcoin', 987), ('Quorans', 858), ('cryptocurrency', 822), ('Snapchat', 807), ('btech', 632), ('Brexit', 493), (
        'cryptocurrencies', 481), ('blockchain', 474), ('behaviour', 468), ('upvotes', 432), ('programme', 402), (
      'Redmi', 379), ('realise', 371), ('defence', 364), ('KVPY', 349), ('Paytm', 334), ('grey', 299), ('mtech', 281), (
      'Btech', 262), ('bitcoins', 254)]

正常的清洗之后还有异常文字(非英文)

odict_keys(
    ['w', 'h', 'a', 't', ' ', 'v', 'e', 'b', 'n', 's', 'x', 'i', 'm', 'u', 'o', 'd', 'l', 'p', 'r', '?', 'c', 'g', 'f',
     ',', 'y', 'j', '9', '8', '%', '1', '0', '2', 'k', 'q', '5', '$', '6', '.', 'z', '(', ')', "'", '-', '’', '3', '7',
     '/', '!', '"', 'é', '4', '…', '&amp;', '“', '”', '+', '\\', '=', '{', '^', '}', ';', '[', ']', '|', ':', '*',
     '&lt;', '₹', 'á', '²', 'ế', '청', '하', '¨', '‘', '√', '×', '−', '´', '\xa0', '`', 'θ', '高', '端', '大', '气', '上', '档',
     '次', '_', '½', 'π', '#', '小', '鹿', '乱', '撞', '成', '语', 'ë', 'à', 'ç', '@', 'ü', 'č', 'ć', 'ž', 'đ', '&gt;', '°',
     'द', 'े', 'श', '्', 'र', 'ो', 'ह', 'ि', 'प', 'स', 'थ', 'त', 'न', 'व', 'ा', 'ल', 'ं', '林', '彪', '€', '\u200b', '˚',
     'ö', '~', '—', '越', '人', 'च', 'म', 'क', 'ु', 'य', 'ी', 'ê', 'ă', 'ễ', '∞', '抗', '日', '神', '剧', '，', '\uf02d', '–',
     '？', 'ご', 'め', 'な', 'さ', 'い', 'す', 'み', 'ま', 'せ', 'ん', 'ó', 'è', '£', '¡', 'ś', '≤', '¿', 'λ', '魔', '法', '师', '）',
     'ğ', 'ñ', 'ř', '그', '자', '식', '멀', '쩡', '다', '인', '공', '호', '흡', '데', '혀', '밀', '어', '넣', '는', '거', '보', '니', 'ǒ',
     'ú', '️', 'ش', 'ه', 'ا', 'د', 'ة', 'ل', 'ت', 'َ', 'ع', 'م', 'ّ', 'ق', 'ِ', 'ف', 'ي', 'ب', 'ح', 'ْ', 'ث', '³', '饭',
     '可', '以', '吃', '话', '不', '讲', '∈', 'ℝ', '爾', '汝', '文', '言', '∀', '禮', 'इ', 'ब', 'छ', 'ड', '़', 'ʒ', '有', '「', '寧',
     '錯', '殺', '一', '千', '絕', '放', '過', '」', '之', '勢', '㏒', '㏑', 'ू', 'â', 'ω', 'ą', 'ō', '精', '杯', 'í', '生', '懸', '命',
     'ਨ', 'ਾ', 'ਮ', 'ੁ', '₁', '₂', 'ϵ', 'ä', 'к', 'с', 'ш', 'ɾ', '\ufeff', 'ã', '©', '\x9d', 'ū', '™', '＝', 'ù', 'ɪ',
     'ŋ', 'خ', 'ر', 'س', 'ن', 'ḵ', 'ā', 'ѕ', ...])

验证集还没用完
XGB还未使用
SEQ_LEN还需要确定
模型融合还未使用
- 简单融合
- stacking
  - 5-4-1采样？
Embeddings融合方法
- 取平均
- 求每个loss按权重取

参考文献

understanding CNN for NLP
implementing a CNN for Text Classification in TensorFlow
- TensorFlow’s convolutional conv2d operation expects a 4-dimensional tensor with dimensions corresponding to batch, width, height and channel.
- for example, [None(batch_size), sequence_length, embedding_size, channel_size]
- tensorboard bellow:
Multilingual Hierarchical Attention Networks for Document Classification
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Effective Approaches to Attention-based Neural Machine Translation
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Weighted Transformer Network for Machine Translation
文本分类中的Attention理解
基于Attention机制的上下文分类算法在问答系统中的作用
五种Attention模型方法及应用
Keras中Embedding层初始化的两种方式
- 随机初始化
- 使用weights传入

以上持续更新...

自助采样扩充数据集，有放回的采n个

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
docs		docs
input		input
working		working
.gitignore		.gitignore
README.MD		README.MD
config.py		config.py
jupyter_note_tune.py		jupyter_note_tune.py
model_tesst.py		model_tesst.py
tmp.py		tmp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kaggle - Quora insincere classification

目录说明

EDA

一些操作技巧梳理

Kernel Baseline Boosting

submission情况

记录

一些需要处理的问题

参考文献

以上持续更新...

About

Releases

Packages

Languages

pkusp/kaggle-quora-classification-2018

Folders and files

Latest commit

History

Repository files navigation

kaggle - Quora insincere classification

目录说明

EDA

一些操作技巧梳理

Kernel Baseline Boosting

submission情况

记录

一些需要处理的问题

参考文献

以上持续更新...

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages