Coursera deep learning specialization notes and code [Specialization Page]
Notes are taken in the format of QA.
Move to Note.md since some formula images are not rendered 😢
- Course 1: Neural Networks and Deep Learning
- Week 3: Shallow Neural Networks
- Week 4: Deep Neural Networks
- Programming: Building your Deep Neural Network: Step by Step (Code)
- Programming: Deep Neural Network - Application
- Course 2: Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
- Course 3: Structuring Machine Learning Projects
- Course 4: Convolutional Neural Networks
- Week 1: Foundations of Convolutional Neural Networks
- Week 2: Deep Convolutional Models: Case Studies
- Programming: Residual Networks
- Programming: Transfer Learning with MobileNet
- Week 3: Detection Algorithms
- Programming: Car Detection with YOLO
- Programming: Image Segmentation with U-Net
- Week 4: Face Recognition
- Programming: Face Recognition
- Programming: Art Generation with Neural Style Transfer
- Course 5: Sequence Model
- Week 1: Recurrent Neural Network
- Programming: Building your Recurrent Neural Network - Step by Step
- Programming: Dinosaur Island-Character-Level Language Modeling
- Programming: Jazz Improvisation with LSTM
- Week 2: Natural Language Processing & Word Embeddings
- Programming: Operations on Word Vectors - Debiasing
- Programming: Emojify
- Week 3: Sequence Models & Attention Mechanism
- Programming: Neural Machine Translation
- Programming: Trigger Word Detection
- Week 4: Transformer Network
- Programming: Transformers Architecture with TensorFlow (Keras)
- Lab: Transformer Pre-processing
- Lab: Transformer Network Application: Named-Entity Recognition
- Lab: Transformer Network Application: Question Answering
- Lab: Transformers using Trax Library
- Week 1: Recurrent Neural Network
- Further Exploration
Week 1 is the overview of the course and specialization.
What is the difference between structure and unstructured data?
Features | Example | |
---|---|---|
Structured | columns of database | house price |
Unstructured | pixel value, individual word | audio, image, text |
What are the dimensions of input matrix and weights?
Param | Description |
---|---|
number of observations | |
number of features (input data) | |
number of layers. : input layer | |
number of units (features) at layer . |
Matrix | Shape |
---|---|
To better memberize
num of row: number of units of the next layer
num of col: number of units of the current layer
num of row: number of units of the next layer
num of col: number of observations
What are the pros and cons of activation functions?
Why non-linear activation functions?
If we use linear activation functions, no matter how many layers you have, the NN is just computing a linear function.
Why do we usually initialize as small random values?
large W -> large Z (Z = WX + b) -> end up at the flat parts of Sigmoid function -> gradient will be small -> gradient descent will be slow -> learning will be slow
If you're not using Sigmoid or Tanh activation functions, it is less of an issue. But note if you're doing a binary classification, the output layer will be a Sigmoid function.
What distribution should we draw from?
Normal distirbution.
In Python we should use
np.random.randn
(normal distribution) instead ofnp.random.rand
(uniform distribution).
Logistic regression’s weights w should be initialized randomly rather than to all zeros, because if you initialize to all zeros, then logistic regression will fail to learn a useful decision boundary because it will fail to “break symmetry”, True/False?
False. Logistic Regression doesn't have a hidden layer. If you initialize the weights to zeros, the first example x fed in the logistic regression will output zero but the derivatives of the Logistic Regression depend on the input x (because there's no hidden layer) which is not zero.
So at the second iteration, the weights values follow x's distribution and are different from each other if x is not a constant vector.
But in deep learning we should randomly initialize either or to "break symmetry". If both and values zero, will be 0 because tanh(0)=0. Using non-zero initialization but making them all the same does not work either. Though we can learn new values, but the values we get are symmetric, means it's the same as a network with a single neuron.
Reference: Symmetry Breaking versus Zero Initialization
A = np.random.randn(4,3); B = np.sum(A, axis = 1, keepdims = True). What will be B.shape?
Click to see answer
(4, 1)
We use (keepdims = True) to make sure that A.shape is (4,1) and not (4, ). It makes our code more robust.
What is the relationship between # of hidden units and # of layers?
Informally: for equal performance shallower networks require exponentially more hidden units to compute.
What is the intuition about deep representation?
Intuitively, deeper layers compute more complex things such as eyes instead of edges.
Vectorization allows you to compute forward propagation in an LL-layer neural network without an explicit for-loop (or any other explicit iterative loop) over the layers l=1, 2, …,L. True/False?
False. Forward propagation propagates the input through the layers, although for shallow networks we may just write all the lines. In a deeper network, we cannot avoid a for loop iterating over the layers.
What are the differences when creating train, dev, test sets in traditional ML and DL?
In traditional ML, train/dev/test split may be 60% / 20% / 20%.
In DL, since the data is large, train/dev/test may be 99.5% / 0.4% / 0.1%
Side note: not having a test set might be okay.
What should we do if the variance or bias is high?
Problem | Try |
---|---|
High bias | Bigger network Train longer (NN architecture search) |
High variance | More data Regularization (NN architecture search) |
Why regularization reduces overfitting?
If lambda is large, weights will be small or close to zero because gradient descent minimizes the cost function.
small weights -> decrease impacts of some hidden units -> simpler network -> not overfit
What are the differences between L1 and L2 regularization?
Regularization | Penalize | Feature selection | |
---|---|---|---|
L1 | sum of absolute values of the weights | sparse | Yes |
L2 | sum of squares of the weights | non-sparse | No |
What is dropout regularization? Why does it work?
Dropout regularization randomly switch off some hidden units so they do not learn anything and the NN will be simpler
What should we pay attention to when implementing dropout during train / test time?
apply dropout | keep_prob | |
---|---|---|
Train | Yes | Yes |
Test | No | No |
D1 = np.rand(a, b)
D1 = (D1 < keep_prob).astype(int)
A1 = A1 * D1
A1 = A1 / keep_prob
Note the devriatives during backward also need to scale
dA1 = dA1 * D1
dA1 = dA1 / keep_prob
What is weight decay?
A regularization technique (such as L2 regularization) that results in gradient descent shrinking the weights on every iteration.
Why do we normalize the inputs x?
It makes the cost function easier and faster to optimize
What is vanishing / exploding gradient ?
Derivatives of each layer are multiplied layer by layer (inputs times gradient). If we have a sigmoid or tanh activation function, derivates are always a fraction.
During backpropagate, fractions are multiplying for many times, the gradient decreases expotentially and the weights of the initial layer will be very small, which makes it hard to learn.
How to deal with vanishing gradient?
A partial solution: force the variance of to be constant and smaller. A recommended value is but it depends on the activation function.
What are the differences between batch , mini-batch, and stochatic gradient descent?
GD | Size | Train Time |
---|---|---|
Batch | m | too long |
Stochatic | 1 | lose speed up by vectorization |
Mini-batch | (1, m) |
How to choose mini-batch size?
if m <= 2000:
use batch gd
else:
typical size: 2_4, 2_5, 2_6...
It depends on the context, we should test with different sizes
Formula of bias correction in exponentially weighted averages?
What is momentum?
Momentum is a method to dampen down the changes in gradients and accelerate gradients vectors in the right direction using exponentially weighted averages.
Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.
What is the process of parameter update in Adam?
Why batch normalization?
Normalization can make training faster and hyperparameters more robust.
Values of each hidden layer are changing all the time because of changes in and , suffering from the problem of covariance shift. Batch normalization guarantees the mean and variance of features of each layer (e.g., , ) keep the same no matter how actual values of each node changes. It allows each layer to learn by itself (more independently than no batch normalization), and speed up learning.
Mean and variance are governed by two learnable parameters and . Adding and is because we don't want all the layers have the same mean and variance (mean = 0, variance = 1).
Batch normalization fomula?
If searching among a large number of hyperparameters, should you try values in a gird or by random? Why
Random.
Grid method is okay if # of hyperparameter is small. In DL, it is difficult to know in advance which hyperparameter is more important. Random method allow us to try more distinct values that are potentially important.
What the hyperparamers and their default values?
Hyperparameter | common value |
---|---|
learning rate | |
momentum | around 0.9 |
mini-batch size | |
# of hidden units | - |
learning rate decay | |
# of layers | - |
batch normalization , , | 0.9, 0.99, |
What are the types of metrics?
Optimizing metric: the metric you want as good as possible, e.g., accuracy
Satisficing metric: as long as it reaches a threshold, e.g., run time, memory
How should we make decisions on train/dev set error ?
We should always have Bayes error to estimate avoidable bias. Human-level error is often seen as a proxy of Bayes eror.
A learning algorithm’s performance can be better human-level performance but it can never be better than Bayes error. human-level performance
We should not add data from a different distribution to the train
set. True / False?
False.
Sometimes we'll need to train the model on the data that is available, and its distribution may not be the same as the data that will occur in production. Also, adding training data that differs from the dev set may still help the model improve performance on the dev set. What matters is that the dev and test set have the same distribution.
We should not add data from a different distribution to the test
set. True / False?
True.
This would cause the dev and test set distributions to become different.
What should you do if another metric (e.g., false negative rate) should be taken into account?
Rethink the appropriate metric for this task, and ask your team to tune to the new metric.
A softmax activation would be a good choice for the output layer if this is a multi-task learning problem. True/False?
False. Softmax would be a good choice if one and only one of the possibilities (stop sign, speed bump, pedestrian crossing, green light and red light) was present in each image.
Should you correct mislabeled data in train and test set after you did so in dev set?
You should correct mislabeled data in test set because test and dev set should come from the same distribution.
You do not necessarily need to fix the mislabeled data in the train set because it's okay for the train set distribution to differ from the dev and test sets.
Let's say you have 100,000 images taken by cars' front camera (you care about) and 900,000 images from the internet. How should you split train/test set?
One example:
Train set: 900,000 images from the internet + 80,000 images from car’s front-facing camera.
Dev / Test set: The 20,000 remaining front-camera images in each set.
As seen in lecture, it is important that your dev and test set have the closest possible distribution to “real”-data. It is also important for the training set to contain enough “real”-data to avoid having a data-mismatch problem.
What are the problems of convolution?
- Each time you apply a convolution operator, the image shrinks.
- Pixels on the corner or edge will be used much less than those in the middle.
Notations and dimensions of input matrix and parameters
Param | Description |
---|---|
filter size | |
padding | |
stride | |
number of filters |
Metric | Dimension |
---|---|
Filter | |
Activations | |
Weights | |
bias |
What is valid and same convolutions?
Valid: no padding
Same: Pad so that output size is the same as the input size
How to calculate the dimension of next conv layer?
Input is a 300 by 300 color (RGB) image, and you use a convolutional layer with 100 filters that are each 5x5. How many parameters does this hidden layer have (including the bias parameters)?
Click to see answer
(5 * 5 * 3 + 1) * 100 = 7,600
Each filter is a volume where the number of channels matches up the number of channels of the input volume.
Parameters are the variables that need to be learnt when training a model.
Parameter count of one filter: (Height × Width × Depth) + bias
What are the benefits of CNN?
It allows a feature detector to be used in multiple locations throughout the whole input volume.
Convolutional layers provide sparsity of connections.
Invariance
What does “sparsity of connections” mean?
Each activation in the next layer depends on only a small number of activations from the previous layer.
Each activation of the output volume is computed by multiplying the parameters from only one filter with a volumic slice of the input volume and then summing all these together.
LeNet - 5
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
hard to read
AlexNet
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
easy to read
VGG - 16
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
ResNet
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
What does skip-connection do?
Skip-connections make it easy for the network to learn an identity mapping between the input and the output within the ResNet block.
Why does ResNet work?
It helps with gradient vanishing and exploding problems and allows people to train deep neural networks without loss in performance.
"The skip connections in ResNet solve the problem of vanishing gradient in deep neural networks by allowing this alternate shortcut path for the gradient to flow through. The other way that these connections help is by allowing the model to learn the identity functions which ensures that the higher layer will perform at least as good as the lower layer, and not worse. "
Inceptionm Network
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
MobileNet
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., ... & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Suppose that in a MobileNet v2 Bottleneck block we have an input volume. We use 30 filters for the expansion. In the depthwise convolutions we use filters, and 20 filters for the projection.
How many parameters are used in the complete block, suppose we don't use bias?
Expansion filter: 5 * 30 = 150
Depthwise: 3 * 3 * 30 = 270
Pointwise: 30 * 20 = 600
Total = 150 + 270 + 600 = 1020
EfficientNet
Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105-6114). PMLR.
YOLO
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779-788).
What are advantages of YOLO algorithm?
- It can ouput more accurate bounding box
- It's one single convolutional computation where you use one conv net with lots of shared computation between all the computation needed for all the cells. Therefore it's a very efficient algorithm.
How to evaluate object localization?
How does non-max suppression work?
While BoundingBoxes:
What is the dimension of one grid in YOLO? Suppose there are classes and anchors
How does Transposed Convolution work?
Transposed Convolutions are used to upsample the input feature map to a desired output feature map using some learnable parameters.
- pick the top left corner element of input, multiply it with every element in the kernel
- put the result (the same size with kernel) on the top left corner of output matrix
- pick the second element of input, multiple it with every element in the kernel
- put the result in the output matrix based on stride
- repeat the steps above
- if there is overlap of results, add the elements
- ignore elements in the padding
Read more:
Towards Data Science | Transposed Convolution Demystified
U-net
Ronneberger, O., Fischer, P., & Brox, T. (2015, October). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Springer, Cham.
What is the dimension of U-Net archiecture output?
What is the differences between face verification and face recognition?
Input | Output | Comparison |
---|---|---|
An image and a name/ID | whether the input image if the claimed person | 1:1 |
An image | if the image is any of the K persons | 1:K |
Siamese Network
Taigman, Y., Yang, M., Ranzato, M. A., & Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1701-1708).
Triplet Loss
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815-823).
Neural Style Transfer
Zeiler, M. D., & Fergus, R. (2014, September). Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818-833). Springer, Cham.
Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576.
What is the cost function of style transfer?
Content cost. For each layer:
Content cost. For all the entries
Style cost. For each layer:
Style cost. For all entries:
Why not use standard network on sequence data?
- Inputs and outputs can be different in length.
- Standard network doesn't share features across positions of text. E.g., Harry at position 0 is a name, is other Harry at other positions also a name?
Notations
Param | Description |
---|---|
the t th element in the training sequence i | |
the length of training sequence i | |
the t th element in the output sequence i | |
the length of output sequence i |
What is the formula of forward propagation?
Here , the second index means will be multiplied by some x-like quantity, to compute some a-like quantity.
List some examples of RNN architectures
- Many to one: sentiment classiciation
- One to many: music generation. Input: genre / first note; output: a sequence of notes
- Many to many (different length): machine translation.
Andrej Karpathy: The Unreasonable Effectiveness of Recurrent Neural Networks
Why RNN has vanishing gradients problems?
An ouput is mainly influenced by values close to its position.
It's difficult for an output to be strongly influenced by an input that is very early in the sequence. Because it's difficult to backpropagate all the way to the beginning of the sequence.E.g., The cat, which already ate, was full.
How to deal with exploding gradients?
Apply gradients clipping. Re-scale some gradient vectors when it's bigger than some threshold.
Where and how do you apply clipping?
forward pass -> cost computation -> backward pass -> CLIPPING -> parameter update
np.clip(gradient, -maxValue, maxValue, out = gradient)
def optimize(X, Y, a_prev, parameters, learning_rate):
loss, cache = rnn_forward(X, Y, a_prev, parameters)
gradients, a = rnn_backward(X, Y, parameters, cache)
gradients = clip(gradients, 5)
parameters = update_parameters(parameters, gradients, learning_rate)
return loss, gradients, a[len(X)-1]
How to deal with vanishing gradients?
Vanichsing gradient problems are harder to detect and solve. Gated Recurrent Unit (GRU) is an effective solution.
What is the formula of full Gated Recurrent Unit (GRU)?
: output activation. They're the same here, but different in LSTM
: update gate, (0,1). Either close to 0 or 1
Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
How does LSTM differ from GRU?
Instead of having one update gate controls and , LSTM has two separate gates and (forget gate).
while in GRU, it's
What are the disadvantages of Bidirectional RNN?
You do need an entire sequence of date before making predictions anywhere (cannot use in real-time application).
True/False: In RNN, step t uses the probabilities output by the RNN to pick the highest probability word for that time-step. Then it passes the ground-truth word from the training set to the next time-step.
No, the probabilities output by the RNN are not used to pick the highest probability word and the ground-truth word from the training set is not the input to the next time-step.
You find your weights and activations are all taking on the value of NaN (“Not a Number”), what problem may cause it?
Gradient exploding. It happens when large error gradients accumulate and result in very large updates to the NN model weights during training. These weights can become too large and cause an overflow, identified as NaN.
Alice proposes to simplify the GRU by always removing the . I.e., setting = 0. Betty proposes to simplify the GRU by removing the . I. e., setting = 1 always. Which of these models is more likely to work without vanishing gradient problems even when trained on very long input sequences?
- Alice's model (removing ). Because if for a timestep, the gradient can propagate back through that timestep without much decay.
- Alice's model (removing ). Because if for a timestep, the gradient can propagate back through that timestep without much decay.
- Betty's model (removing ). Because if for a timestep, the gradient can propagate back through that timestep without much decay.
- Betty's model (removing ). Because if for a timestep, the gradient can propagate back through that timestep without much decay.
Click to see the answer
C.
For the signal to backpropagate without vanishing, we need to be highly dependent on , meaning close to 0.
Note: It's the simplied version in the lecture.
What is the downside of skip-gram model?
The Softmax objective is expensive to compute because it needs to sum over the entire vocabulary.
What are differences of problem objectives in the skip-gram model and negative sampling?
Skip-gram: given a context, predict the probability of different target word
Negative sampling: given a pair of words, is it a context-target pair? Is it a positive or negative sample?
Why negative sampling's computation cost is lower?
It converts a N softmax problem to a N binary classification problem. In each iteration, only train K words. K = 5 to 20 in small vocabulary, K = 2 to 5 in large vobabulary.
What is the learning objective of GloVe?
= # of times appears in the context of
depending on the definition of "context", and may be symmetric.
Debiasing word embeddings
Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.
What are the steps in the debiasing word embeddings paper?
- Step 1: Identify gender subspace by SVD.
- Step 2a: Hard de-biasing (neutralize and equalize).
- Step 2b: Soft bias correction.
Determine gender specific words: first listed218 words from dictionary, then trained a SVM to classify 3M words in w2vNEWS, resulting in 6,449 gender-specific words.
is an embedding matrix, is a one-hot vector corresponding to word 4567. Can we call in Python to get the embedding of word 4567?
Click to see answer
The element-wise multiplication is extremely inefficient.
**What the four steps of sampling?**
Why not use greedy search?
Picking the best first word one by one does not maximize the conditional probability. The translation may be a common English sentence but not the succinct translation.
How to pick beam width?
- Large beam width: better result, slower
- Small beam width: worse result: faster
How to figure if it's RNN or beam search fails the translation task?
How does sentence normalization affect beam search result?
If we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.
What does denote in attention model?
denotes when computing the output first word, how much attention should be paid on the input second word.
The attention model performs the same as the encoder-decoder model, no matter the sentence length. True/False?
Click to see answer
False.
Sentence length ↑ , encoder-decoder model performance ↓
The attention model has the greatest advantage when the input sequence length is large.
The network learns where to “pay attention” by learning the values , which are computed using a small neural network: We can replace with as an input to this neural network because is independent of and . True/False?
Click to see answer
We can't replace with because depends on which in turn depends on and ; so at the time we need to evaluate this network, we haven't computed .
- : energy variable
- : hidden state of the post-attention LSTM
- : hidden state of the pre-attention LSTM
- and are fed into a simple neural network, which learns the function to output .
What are the steps of implementing attention with Keras?
Click to see answer
More Resources
- The Illustrated Transformer
- YouTube | Illustrated Guide to Transformers Neural Network: A step by step explanation
How does traditional attention and self-attention in Transformer differ?
Click to see answer
Traditional Attention was used in combination with RNNs to improve their performance. Self-attention is used INSTEAD OF RNNs and they do a much better job and are also much faster.
Why self-attention is used in Transformer?
Click to see answer
RNNs process sequences word by word, with the state of the RNN changing as each new word is processed. This allows the RNN to carry forward information from previous words. However, this kind of mechanism faces challenges like vanishing gradients when the sequence gets longer.
Self-attention mechanism allows the model to consider all words at once and assess the interdependencies between them, regardless of their distance in the sequence. This makes Transformers particularly useful for many natural language processing tasks, as the meaning of a word in natural language can depend on other words in the sentence, no matter how far apart they may be.
What do Q, K, V denote?
Click to see answer
Q = interesting questions about the words in a sentence
K = specific representations of words given a Q
V = qualities of words given a Q
What's the differences that RNN, Seq2Seq with attention, and transformer handle long sequences?
(May not be correct 😅)
Model | Approach to Handle Long Sequences | Main Issues |
---|---|---|
Traditional RNN | Info is passed through hidden states between time steps. | Difficulty handling long-distance dependencies, prone to vanishing or exploding gradients. |
Seq2Seq (with Attention) | Utilizes attention mechanism to reference all parts of the input sequence during decoding. | Still reliant on the recursive structure of RNNs, potentially leading to vanishing/exploding gradients when handling long sequences. The attention computation is constrained between the input sequence and output sequence, limiting parallel processing capability. |
Transformer | Processes all elements of the sequence in parallel through the self-attention mechanism, each element can directly reference all other elements regardless of sequence length. | Computational requirements grow with the square of the sequence length, which may pose difficulties in handling ultra-long sequences. Also, the lack of explicit sequential information encoding could lead to performance degradation in some tasks. |
|
What are criteria for a good positional encoding algorithm?
It should output a unique encoding for each time-step (word's position in a sentence).
Distance between any two time-steps should be consistent for all sentence length.
The algorithm should be able to generalize to longer sentences.
What is the formula of positional encoding and what is its logic?
It records the relative positions of tokens and sums to wording embeddings. So the initial representations moves a bit towards the other tokens that are close to them.
Resources:
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Brian Pulfer. 2022. Vision Transformers from Scratch (PyTorch): A step-by-step guide. MLearning.ai.
- Francesco Zuppichini. 2021. Vision Transformers from Scratch (PyTorch): A step-by-step guide. Towards Data Science