说明:本项目为
kaggle
比赛内容,开始时间2018.11.09
- 比赛链接:Quora Insincere Classification
- 比赛形式: kernel only
更新中...先做数据处理,再跑LSTM的baseline,再用CNN, 再加attention,再调超参数
-
input为输入数据
-
working/main_process.py 为数据预处理,返回
train
和test
清洗后的数据
- 数据探测:EDA
- 优秀的kernel梳理:kernels
- [Pre-processing when using embeddings]
- LSTM is all you need
- LSTM+Attention
- Different embeddings with attention
- LSTM + CNN
- LSTM Attention
- Blending with Linear Regression (9 models 0.688LB)
- 2D CNN TextClassifier
- inceptionCNN with flip
- Single RNN with 4 folds 0.692LB(good coding style)
5581 seconds
- LSTM baseline no tune:
0.573
2000s - Pre-processing + LSTM Baseline no tune:
0.631
- LSTM + Attention + 256Dense + 0.5Dropout + SEQ_LEN(30->50) + thresh(0.5->0.35) no tune:
0.664
- 2D CNN Baseline
0.664
800s
-
观察正负例数量
-
空缺值
-
拆分训练集、验证集
-
使用不同的
embeddings
(glove,word2vec,自己训练)训练后,进行融合Keras-embedding
层类似word2vec,将输入的二维(用字典id表示的句子)转化为三维张量(即将id训练成vec)
-
f1score评价
-
LSTM + CNN + attention
-
tokenizer
用来将query转化为序列(先遍历得到字典,然后按频率排序得到id) -
Don't use standard preprocessing steps like stemming or stopword removal when you have pre-trained embeddings in deeplearning methods
-
Get your vocabulary as close to the embeddings as possible
- 新词没有词向量
[('bitcoin', 987), ('Quorans', 858), ('cryptocurrency', 822), ('Snapchat', 807), ('btech', 632), ('Brexit', 493), (
'cryptocurrencies', 481), ('blockchain', 474), ('behaviour', 468), ('upvotes', 432), ('programme', 402), (
'Redmi', 379), ('realise', 371), ('defence', 364), ('KVPY', 349), ('Paytm', 334), ('grey', 299), ('mtech', 281), (
'Btech', 262), ('bitcoins', 254)]
- 正常的清洗之后还有异常文字(非英文)
odict_keys(
['w', 'h', 'a', 't', ' ', 'v', 'e', 'b', 'n', 's', 'x', 'i', 'm', 'u', 'o', 'd', 'l', 'p', 'r', '?', 'c', 'g', 'f',
',', 'y', 'j', '9', '8', '%', '1', '0', '2', 'k', 'q', '5', '$', '6', '.', 'z', '(', ')', "'", '-', '’', '3', '7',
'/', '!', '"', 'é', '4', '…', '&', '“', '”', '+', '\\', '=', '{', '^', '}', ';', '[', ']', '|', ':', '*',
'<', '₹', 'á', '²', 'ế', '청', '하', '¨', '‘', '√', '×', '−', '´', '\xa0', '`', 'θ', '高', '端', '大', '气', '上', '档',
'次', '_', '½', 'π', '#', '小', '鹿', '乱', '撞', '成', '语', 'ë', 'à', 'ç', '@', 'ü', 'č', 'ć', 'ž', 'đ', '>', '°',
'द', 'े', 'श', '्', 'र', 'ो', 'ह', 'ि', 'प', 'स', 'थ', 'त', 'न', 'व', 'ा', 'ल', 'ं', '林', '彪', '€', '\u200b', '˚',
'ö', '~', '—', '越', '人', 'च', 'म', 'क', 'ु', 'य', 'ी', 'ê', 'ă', 'ễ', '∞', '抗', '日', '神', '剧', ',', '\uf02d', '–',
'?', 'ご', 'め', 'な', 'さ', 'い', 'す', 'み', 'ま', 'せ', 'ん', 'ó', 'è', '£', '¡', 'ś', '≤', '¿', 'λ', '魔', '法', '师', ')',
'ğ', 'ñ', 'ř', '그', '자', '식', '멀', '쩡', '다', '인', '공', '호', '흡', '데', '혀', '밀', '어', '넣', '는', '거', '보', '니', 'ǒ',
'ú', '️', 'ش', 'ه', 'ا', 'د', 'ة', 'ل', 'ت', 'َ', 'ع', 'م', 'ّ', 'ق', 'ِ', 'ف', 'ي', 'ب', 'ح', 'ْ', 'ث', '³', '饭',
'可', '以', '吃', '话', '不', '讲', '∈', 'ℝ', '爾', '汝', '文', '言', '∀', '禮', 'इ', 'ब', 'छ', 'ड', '़', 'ʒ', '有', '「', '寧',
'錯', '殺', '一', '千', '絕', '放', '過', '」', '之', '勢', '㏒', '㏑', 'ू', 'â', 'ω', 'ą', 'ō', '精', '杯', 'í', '生', '懸', '命',
'ਨ', 'ਾ', 'ਮ', 'ੁ', '₁', '₂', 'ϵ', 'ä', 'к', 'с', 'ш', 'ɾ', '\ufeff', 'ã', '©', '\x9d', 'ū', '™', '=', 'ù', 'ɪ',
'ŋ', 'خ', 'ر', 'س', 'ن', 'ḵ', 'ā', 'ѕ', ...])
-
验证集还没用完
-
XGB还未使用
-
SEQ_LEN还需要确定
-
模型融合还未使用
- 简单融合
- stacking
- 5-4-1采样?
-
Embeddings融合方法
- 取平均
- 求每个loss按权重取
-
implementing a CNN for Text Classification in TensorFlow
- TensorFlow’s convolutional conv2d operation expects a 4-dimensional tensor with dimensions corresponding to
batch
,width
,height
andchannel
. - for example,
[None(batch_size), sequence_length, embedding_size, channel_size]
- tensorboard bellow:
- TensorFlow’s convolutional conv2d operation expects a 4-dimensional tensor with dimensions corresponding to
-
Multilingual Hierarchical Attention Networks for Document Classification
-
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
-
Effective Approaches to Attention-based Neural Machine Translation
-
Neural Machine Translation by Jointly Learning to Align and Translate
-
- 随机初始化
- 使用
weights
传入
自助采样扩充数据集,有放回的采n个