This repository contains a fork of the Transformer model from https://github.com/huggingface/transformers. Unlike the original source code, which is a library, this repository is runnable stand-alone code for language modeling tasks.
The source code on such open source libraries are great, however, sometimes it is difficult to read all code for simply running and experimenting with a model. For example, preprocessing data, define a train-eval loop, integrating a model into that loop, these tasks are essential to write machine learning programs but it not always easy to looking source code from a large open source project.
This repository combines open source model code and tutorial level preprocessing, and train-eval code. As a result, we expect readers can easily understand how Transformer models are implemented and work from scratch.
This repository partly uses source code from the following sources:
-
Preprocessing, Train-Eval loop (except model): https://pytorch.org/tutorials/beginner/transformer_tutorial.html
-
Transformer model: https://github.com/huggingface/transformers/blob/a7d46a060930242cd1de7ead8821f6eeebb0cd06/src/transformers/models/bert/modeling_bert.py
What is a model? What is their role? I think a good model needs to have a good Inductive Bias, which means it has good generalization capability to unseen example during training.
The difference between the Neural Network method of learning and other learning paradigm is that the Neural Network method learns from data by making a good representation of that data. On the contrary, many other methods learn by features that are manually selected by humans.
The Transformer model is one of the most popular representation generators of Neural Network methods of learning. Because of its general representation processing mechanism such as Attention-based learning, many recent advancements of deep learning rely on it.
So what actually Transformers do? What modules comprise Transformers? What are their implications? This is a natural question of mine as a beginner.
Because the Transformer model is a special case of the deep learning model. Let's look at what the Deep Learning model generally does. The Deep Learning model can be viewed as Directed Acyclic Graph (DAG; a graph structure that has no connection to ancestry node).
Deep learning model process matrix, which roles a computable representation of input (such as text, image, etc.). The computation of the Deep Learning model to retrieve more abstract representation of input representation.
As example shows in imagenet, the earlier part of DAG (a matrix transformation 'layer' of Deep Learning model) do generate low-level features (e.g. shape of stroke that consists cars) from input data. While the latter part of DAG does transform them into high-level features (e.g. wheels of a car or this is Ford Model T).
We can observe the same for natural language processing. The above example shows a step in making a representation of an input sentence for a Machine Translation task using the Transformer Model. The Encoder Layer 2 shows attention in the earlier part of DAG. It visualizes what to Attend for given sentences in given sentences. The diagonal line shows most of the words are self-attended since it is natural that low-level features of the text are in word-level. However, in the latter part of DAG, we show the trend that attention is concentrated on specific words, which refers to an abstract representation of given sentences that are crucial for a given task (such as word 'bad' is important for movie review sentiment analysis)
You may wonder about rules for shaping abstract representation, that is the role of the learning algorithm. For example, Neural Network is a Learning algorithm that depends on its task setup. A task like classifying text or image can be viewed as a classic Supervised Learning algorithm. In Supervised Deep Learning cases, shaping abstract representation is the role of optimization methods (e.g. backpropagation with learning rate scheduling) that utilize label data. More specifically in the training model, when the model outputs inference result, that is compared against the label, and the difference is backpropagated through the model, this enforces (thus trains) model parameters to generate more task-oriented representation. (Note that the task can be multiple, in the case of multi-task learning. Or task can be much more general, in the case of few-shot learning.)
Previously, we revisit the single significant property of the machine learning model that good models have good inductive bias performance to achieve good generalization performance. Deep learning is Neural Network methods of learning, focusing on learning thus able to deriving task-oriented representation from data. The Transformer is one of the popular methods of Deep learning that is good at learning textual or visual representation from data. The Deep Learning model can be viewed as matrix processing DAG; where input data is matrix and DAG is gradually learning representation of data from its low to high-level features. High-level features are determined by a learning algorithm, especially backpropagation with label data.
I think all of the above parts have meaning to build good machine learning algorithms and systems. Let's look at how the above concepts reflect in the Deep Learning model (the DAG) of Transformers Encoder.
The above figure shows the Transformer model and its mapping between Python classes.
We only look at the Transformer encoder, because the encoder and decoder share core functionality DAGs.
The role of the input embedding layer is to create a meaningful (task-oriented) representation of sentences - embeddings - from natural language. In this example, we receive word ids and create a 768 dimension vector.
The 768 dimension vector - the embeddings is a low-level representation of sentences, it is shaped by a mixture of the literal representation of text w and weight defined by backpropagation for objectives of the task.
The embedding layer concatenates embedding from other data, such as token type and positional information.
To add regularization for embedding, it applies layer normalization and Dropout to reduce training time and improve generalization respectively.
This is implemented in BertEmbedding Python Class.
Multiple layers of transformation are required to make a high-level representation.
The BertEncoder class stacks multiple self-attention layers of a transformer to implement this.
It initializes the transformer layer to input matrix one another.
The single transformer model is comprised of the attention part, intermediate processing part, and the output processing part.
The first module BertAttention applies self-attention and by call BertSelfAttentionand normalizes its output with BertSelfOutput module.
The second module BertIntermediate layer is for increasing model capacity, by adding one sparser layer (hidden_size 768 dim to intermediate_size 3072 dim).
The last layer BertOutput is for bottleneck layer (intermediate_size 3072 dim to hidden_size 768 dim again) and do additional regularization.
BertLayer connects above three layers, to be stackable.
I think the concept of attention no more than just matrix transformation using representations from relevant information sources. Above figure shows early work on attention. In this case, the sequential encoder (RNN) generates a matrix in a bidirectional context, and their output probabilities are concatenated and applied to a given step in the decoding process for making relevant representation for a given task.
Informally, attention is just a probability-contained matrix to scale representation to be more appropriate for the task. Attention is a signal for adjustment of representation, thus it reduces the model size to achieve the same quality performance.
In terms of viewing attention by scale representation, what is a good information source for such scaling? The idea of self-attention is finding the importance of each word and score them, and apply them to scale representation for a given task. The mechanism of finding scores are multiply query and key matrix. More specifically it obtains attention score matrix by element-wise multiplication result between single query word and all key words. Finally, the value vector is scaled by the respected attention score of its word. A higher score implies that given words have a strong relation to words on those sentences, the training procedures reveal a neural network to define a relationship by adjusting weights which are guided by loss function through backpropagation. Attention score totally depends on representation, which is the result of linear transformations from the embedding. In the training setting, weights of such embedding and linear transformations will be adjusted for generating maximum attention score helpful to resolve the given task.
BertSelfAttention class implements this. It initializes, a linear layer of query, key, and value.
Perform matrix multiplication with a query and key value.
Scaling it and make probabilities using Softmax function before applying to value.
Value is an adjusted representation for specific tasks.
Note that the adjustment is gradually refined across stacked layers.
Above image shows encoder self-attention (input-input) for machine translation. The image illustrates words related to it_. At layer 1, the word has a relationship to most words however, as layer getting compute abstract representation, it shows a strong relationship to a noun like animal_ or street_.
Note that the multi-head means just splitting input representation by number of heads, To allows the model to attend different subspaces of representation without averaging it (ref and original paper).
So Multi-head property can be viewed as regularizing to attend different subspace of representation, by preventing averaging overall attention score.
In summary, self-attention is just matrix multiplication that essentially does adjust the importance of each word in a sentence to make a better representation.
After retrieving self-attention applied representation from the previous step,
The BertSelfOutput module provides additional linear transformation with layer normalization and Dropout.
I think the role of this component is to provide additional model capacity and add Regularization.
We use our model as language modeling. Language modeling is an NLP task that predicts the next word from a given sequence of words. The Transformer approach to language modeling is text classification, where calculate conditional probabilities of the next word in given previous words.
BertModel creates embedding and passes it to the encoder.
The encoder provides a representation of a sentence,
and also the whole sentence (accumulated hidden states from all layers) is passed to the pooler for the whole sentence by making representation for whole sentences.
BertForLM takes the last hidden state (since language modeling is a classification for the next word) and apply Dropout and add passes to the linear layer to get a conditional distribution result.
It calculates loss after inference is done.
First of all, it splits data with N number of the batch for each train, eval loop.
The above figure shows the data prepared for training. Each row in the tensor contains a batch size number of words (20 in our case). As example shows, a single sentence is represented by multiple batches with the same column number.
- Set model to training mode
- Define criterion, optimizer, and scheduler
- Fetch data
- Pass to model
- Calculate loss
- Backpropagation
- Gradient clipping
- Optimizer stepping
- Loss calculation
- Logging (epoch number, batches, loss, ppl)
- Set model to evaluation mode
- No grad definition
- Fetch data
- Pass to model (deterministic process)
- Total loss calculation
-
train_eval.logshows numbers. Since train data is small, the model is close to underfitting. -
In larger transformer layer, more underfitting occur, thus loss function gets higher
-
12layer transformer encoder for LM task (underfitting due to large model capacity)

A common configuration for constructing the bert model. Options include,
https://huggingface.co/transformers/main_classes/configuration.html#transformers.PretrainedConfig
vocab_sizefor word embeddinghidden_sizefor model internal(hidden) representation sizenum_hidden_layersfor number of transformer layer to transforms hidden representationnum_attention_headssplit output of feedforward layer for different attention headsintermediate_sizesize of intermediate layer after self attentionhidden_actactivation function for intermediate layer after self attentionhidden_dropout_probhidden layer dropout prob. for regularizationattention_probs_dropout_probself attention layer dropout prob. for regularizationmax_position_embeddingsmax size of positional embedding legnthtype_vocab_sizemax size of type vocab length (...)layer_norm_epslayer norm optionoutput_attentionsoption for output attention proboutput_hidden_statesoption for output hidden state
- Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
- Sequence-to-Sequence Modeling with nn.Transformer and TorchText
https://pytorch.org/tutorials/beginner/transformer_tutorial.html
- transformers/modeling_bert.py at a7d46a060930242cd1de7ead8821f6eeebb0cd06 · huggingface/transformers (GitHub)
- Configuration - transformers 3.5.0 documentation
https://huggingface.co/transformers/main_classes/configuration.html#transformers.PretrainedConfig
- The Annotated Transformer
- Tensor2Tensor Intro
- Inductive bias - Wikipedia
- Directed acyclic graph - Wikipedia
- Transfer learning and Image classification using Keras on Kaggle kernels.
- Attention? Attention!
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#multi-head-self-attention
- Attention Is All You Need






