- t-SNE: A non-linear dimensionality reduction technique
- Word embedding: When learning word embeddings, we create an artificial task of estimating P(target∣context). It is okay if we do poorly on this artificial prediction task; the more important by-product of this task is that we learn a useful set of word embeddings.
- word2vec: In the word2vec algorithm, you estimate P(t∣c), c and t are chosen to be nearby words. θt and ec are both trained with an optimization algorithm such as Adam or gradient descent.
- GloVe: θi and ej should be initialized randomly at the beginning of training. Xij is the number of times word i appears in the context of word j. The weighting function f(.) must satisfy f(0)=0.
- Beam Search: Increase beam width won't converge after fewer steps. We have to use sentence normalization, otherwise, the algorithm will tend to output overly short translations.
- attention model: α<t,t′> is generally larger for values of a<t′> that are highly relevant to the value the network should output for y<t>. ∑t′α<t,t′>=1. The network learns where to “pay attention” by learning the values e<t,t′>, which are computed using a small neural network. We can't replace s<t−1> with s<t> as an input to this neural network. This is because s<t> depends on α<t,t′> which in turn depends on e<t,t′>; so at the time we need to evaluate this network, we haven’t computed s<t> yet.