BERT introduced in this paper stands for Bidirectional Encoder Representations from Transformers. If you don't know what most of that means - you've come to the right place! Let's unpack the main ideas:
- - to understand the text you're looking you'll have to look back (at the previous words) and forward (at the next words)
- - The Attention Is All You Need paper presented the Transformer model. The Transformer reads entire sequences of tokens at once. In a sense, the model is non-directional, while LSTMs read sequentially (left-to-right or right-to-left). The attention mechanism allows for learning contextual relations between words (e.g. his in a sentence refers to Jim).
- contextualized word embeddings - The ELMO paper introduced a way to encode words based on their meaning/context. Nails has multiple meanings - fingernails and metal nails.
BERT was trained by masking 15% of the tokens with the goal to guess them. An additional objective was to predict the next sentence. Let's look at examples of these tasks:
The objective of this task is to guess the masked tokens. Let's look at an example, and try to not make it harder than it has to be:That's [mask] she [mask] -> That's what she said
Given a pair of two sentences, the task is to say whether or not the second follows the first (binary classification). Let's continue with the example:Input = [CLS] That's [mask] she [mask]. [SEP] Hahaha, nice! [SEP]
Label = IsNext
Input = [CLS] That's [mask] she [mask]. [SEP] Dwight, you ignorant [mask]! [SEP]
Label = NotNext
The training corpus was comprised of two entries: Toronto Book Corpus (800M words) and English Wikipedia (2,500M words). While the original Transformer has an encoder (for reading the input) and a decoder (that makes the prediction), BERT uses only the decoder.
BERT is simply a pre-trained stack of Transformer Encoders. How many Encoders? We have two versions - with 12 (BERT base) and 24 (BERT Large).
The BERT paper was released along with the source code and pre-trained models.The best part is that you can do Transfer Learning (thanks to the ideas from OpenAI Transformer) with BERT for many NLP tasks - Classification, Question Answering, Entity Recognition, etc. You can train with small amounts of data and achieve great performance!
I haved used BERT-BASE-CASED pre-trained model for this project the main reason for going with the base-case model was that words written in uppercase will have more severity and will complicate the process of classification. Also after loading the dataset I have cleaned it as it contianed some NAN values and and extra column which was not needed. For tokenization I used BertTokenizer.from_pretrained for 'bert-base-cased'. after doing the necessary data prep and training the model below are few outputs obtained.