Skip to content

Latest commit

 

History

History
60 lines (41 loc) · 2.09 KB

README.md

File metadata and controls

60 lines (41 loc) · 2.09 KB

DeepTweets

Real or Not? NLP with Disaster Tweets

Link: https://www.kaggle.com/c/nlp-getting-started

History of Word Embeddings

Traditionally, we use bag-of-word to represent a feature (e.g. TF-IDF or Count Vectorize). Besides BoW, we can apply LDA or LSA on word feature. However, they have some limitations such as high dimensional vector, sparse feature. Word Embedding is a dense feature in low dimensional vector. It is proved that word embedding provides a better vector feature on most of NLP problem.

In 2013, Mikolov et al. made Word Embedding popular. Eventually, word embedding is state-of-the-art in NLP. He released the word2vec toolkit and allowing us to enjoy the wonderful pre-trained model. Later on, gensim provide a amazing wrapper so that we can adopt different pre-trained word embedding models which including Word2Vec (by Google), GloVe (by Stanford), fastText (by Facebook).

12 years before Tomas et al. introduces Word2Vec, Bengio et al. published a paper [1] to tackle language modeling and it is the initial idea of word embedding. At that time, they named this process as “learning a distributed representation for words”.

2001: Bengio et al. introduced a concept of word embedding 2008: Ronan and Jason introduced a concept of pre-trained model 2013: Mikolov et al. released pre-trained model which is Word2Vec

Approaches -

  • Bag of Words, N-grams, and their TF-IDF.
  • Shallow Neural Net
  • Attempt to use ConvNets(Zhang and LeCun, 2015)
  • CNNs for Sentence Classification, Yoon Kim
  • Very Deep CNN Architecture, Facebook AI Research
  • Fine tuning of BERT for text classification.

Dataset

  • id
  • keyword
  • location
  • text
  • target => 1 [real disaster] => 0 [fake disaster]

Toolkit Tensorflow, sklearn

Dataset Analysis

7613 examples for training 3263 examples for testing

Dataset Cleaning

  • Replace all capital letters with small letters
  • Removed all punctuation marks
  • Remove URLs, Emojis and html text

Text encoding

  1. Character Level Encoding
  2. Word level Encoding
  • Bag of Words
  • Glove
  1. Sentence Level Encoding
  • Google Universal Sentence Encoder