The Llama model uses RoPE (Rotary Positional Embeddings) alongside the standard embedding layer to highlight the influence of token positions within a sequence.
For decades, embedding has been the most commonly used technique to represent words, concepts, or tokens in NLP (Natural Language Processing) world. Typically, an embedding model is trained to let them learn frequencies of use of tokens together. The tokens are placed in suitable positions within a multi-dimensional space, where distances reflect the difference or similarity between them.
There are a wide variety of different methods to implement embeddings. Some of them take token positions into account.
Taking the token positions into account is important. Think of, two sentences containing exactly the same words in different orders. If you don't take the positions into account, your system handles these two sentences (despite with different meanings) as the same thing.
Let's try to explain with an example. Think of, we have a sequence of 5 tokens and an embedding layer with the shape of {128256, 4096}
. So, we have:
- A token embedding sequence which was calculated using the embedding layer. Our input tensor will be with the shape of
{5, 4096}
, - We have
32
"attention heads" (according tomodelArgs.N_Heads = 32
). Our each attention head will have a dimensionmodelArgs.Dim / modelArgs.N_Heads
. In our case, it is4096 / 32 = 128
. So our positional embedding tensors will have128
dimensions. - We have an array of position indices of the tokens:
{1, 2, 3, 4, 5}
.
We have several alternatives to calculate the positional embeddings. Some of them are:
-
Taking the positions as they are:
{0, 1, 2, 3, 4}
, -
Taking the positions in normalized values between
0
and1
as{0., 0.25, 0.50, 0.75, 1.}
, -
As suggested in the paper Attention Is All You Need, using sinusoidal functions. Each dimension of the positional encoding corresponds to a sinusoid:
$$ \begin{gathered} PE_(pos,2i) = \sin(pos/10000^\frac{2i}{d_{model}}) \\ PE_(pos,2i+1) = \cos(pos/10000^\frac{2i}{d_{model}}) \end{gathered} $$ This means that, we have
128
dimensions for each position,i
will loop from0
to64
(half of128
).The original paper suggests using
10000
as base theta value, and the Llama 2 model uses this value. But newer versions of Llama (3 and higher) started to use500000
as base theta value, so, we will stick to using500000
.Update with Llama 3.1: The Llama 3.1 version comes with a small adjustment on frequencies. apply_scaling(...) method was added into original Llama 3.1 implementation, that calculates wavelengths from these frequencies and applies some limitations on them. Implementation detail will be discussed in the following subchapters. Currently we represent this operation with
scl(...)
.Our
PE
positional embedding array for3th
position will be like:$$ \begin{gathered} PE = \left\lbrace \begin{array}{l} \sin\left(scl\left(\frac{3}{500000^\frac{0}{128}}\right)\right), \cos\left(scl\left(\frac{3}{500000^\frac{0}{128}}\right)\right), \sin\left(scl\left(\frac{3}{500000^\frac{2}{128}}\right)\right), \cos\left(scl\left(\frac{3}{500000^\frac{2}{128}}\right)\right), \\ \dots, \\ \sin\left(scl\left(\frac{3}{500000^\frac{124}{128}}\right)\right), \cos\left(scl\left(\frac{3}{500000^\frac{124}{128}}\right)\right), \sin\left(scl\left(\frac{3}{500000^\frac{126}{128}}\right)\right), \cos\left(scl\left(\frac{3}{500000^\frac{126}{128}}\right)\right) \end{array} \right\rbrace \\ \end{gathered} $$ -
As suggested in the paper RoFormer, using sinusoidal functions in slightly different ways, as described following parts,
-
Using the output of a custom function that takes the position indices as input.
We have several alternatives to integrate the positional embeddings with the token embedding vectors. If our positional embedding vectors and the token embedding vectors both have same dimension, we can sum or multiply them:
- Summation: As suggested in the paper Attention Is All You Need too, we can simply add two vectors as
new_embeddings = token_embeddings + positional_embeddings
, - Multiplication: As suggested in the paper RoFormer too, we can simply multiply element-wise two vectors as
new_embeddings = token_embeddings * positional_embeddings
.
Think of, we can take the positions as they are, {0, 1, 2, 3, 4}
, and add them with each dimension of our embedding vectors. It may work, but sounds not so much meaningful, right? So, we need to find a more meaningful method.
You can find some introduction for absolute position embedding and relative position embedding in the paper RoFormer alongside the proposed approach RoPE (Rotary Positional Embeddings) by the same paper.
The main idea behind these approaches is to let the model effectively take into account token positions. RoPE (Rotary Positional Embeddings) approach represents the positions of tokens in the polar coordinate system, which employs angles and complex numbers.
With this approach, we have a chance to combine multiple approaches and we have the following advantages:
- To distribute our positional embeddings in the space in a limited range (
[-1, +1]
because of limits ofcos
andsin
functions), - Other approaches prefer summation to integrate the positional embeddings with the token embedding vectors. However, the summation corrupts the exact data that the vector has. But in this approach, we prefer multiplication. And the value we multiply is a sinusoidal function, a sinusoid, we can think that we are only rotating the original embedding vector by an angle. So, theoretically, we don't corrupt the original data,
- In the approach of RoFormer that we used in this project, takes the items of
128
dimension of an attention head as64
pairs. Then, it obtains a complex number from each pair by taking the first item of the pair as real part and the second item of the pair as imaginary part of a complex number.
- Taking the items of an attention head as float pairs and representing them as complex numbers in polar coordinate system makes our method more suitable for its mathematical nature. Also, this allows it to represent these complex numbers as a matrix, which allows us to perform matrix operations.
- With this approach, the influence of position on embeddings is high for lower dimensions (going from the first dimension throughout 128 dimensions) and converges to zero for higher dimensions. This makes higher dimensions of embeddings less sensitive to positional data than lower dimensions. Because the polar coordinates calculated for higher dimensions are nearly the same value.
Important note: I read it in a few sources then I saw it with the 3D charts that I drew in the notebook 10.BONUS-PRECOMPUTING-FREQUENCY-TENSOR.ipynb, also can be found the bottom of this chapter.
Note: In this approach, some concepts from computing, mathematics, geometry, and physics were combined. For example, we can think that:
- Our positions are points in a time series,
- Our positional encoding function as a function with respect to time, so we can say it is a signal,
- Our angles as angular frequency (
${\displaystyle \omega }$ ) and our positions as real independent variable/time of a sine wave.
In the following subchapters, we will see how these polar coordinates are precalculated. These values don't vary by input, so this calculation is made only once.
Sources:
- A Guide on Word Embeddings in NLP
- Word Embeddings in Natural Language Processing(NLP)
- RoFormer: Enhanced Transformer with Rotary Position Embedding Paper
- Youtube - RoPE (Rotary positional embeddings) explained: The positional workhorse of modern LLMs
- Youtube - Llama explained... - "Rotary Positional Embeddings" section
You can check out the following for more information:
- The Python codes that create the sample data and graphs used here with this Python Notebook: 10.BONUS-PRECOMPUTING-FREQUENCY-TENSOR.ipynb.
- A Gentle Introduction to Positional Encoding in Transformer Models, Part 1 article.
The precomputeFreqsCis(...) function takes four arguments: dim
, end
, theta
, and useScaled
.
The argument dim
is computed as int(modelArgs.Dim/modelArgs.N_Heads)
, end
is computed as modelArgs.MaxSequenceLength*2
, theta
is modelArgs.RopeTheta
, useScaled
is modelArgs.UseScaledRope
. In our case, dim = 4096/32 = 128
, end = 2048 * 2 = 4096
, theta = 500000
, useScaled = true
.
In our case, the precomputeFreqsCis(...) function is called with dim = 128
, end = 4096
, theta = 500000
, and useScaled = true
.
from src/model/llamatransformer.go
func NewLlamaTransformer(model *Model) (*LlamaTransformer, error) {
result := &LlamaTransformer{}
...
if result.PrecomputedFreqsCis, err = precomputeFreqsCis(int(dim/modelArgs.N_Heads), modelArgs.MaxSequenceLength*2, modelArgs.RopeTheta, modelArgs.UseScaledRope); err != nil {
return nil, err
}
...
}
In the original Llama 3.1 Python repository of Meta, this Python code initiates the freqs
array:
import torch
def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0, use_scaled: bool = False):
...
freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
...
Important note: The
theta
variables in both Go and Python code are not an angle. They are explained as: "Scaling factor for frequency computation. Defaults to 10000.0, but in our case, this value comes as 500000.0 for Llama 3.1.".
Instead, thefreqs
is an array of angles, that corresponds to$\Theta$ and each item offreqs
array corresponds to$\theta_i$ below.
Personally, at first sight, I was confused why they called scaling factor astheta
which is a term that made me think it is an angle, but it isn't, items of thefreqs
are in an angle unit (radians), but at the end, they are only premise value (scaling factor) for the real angles!
The original equation in section "3.2.2 General form" of RoFormer: Enhanced Transformer with Rotary Position Embedding:
If we expand it for dim=128
and use 500000
instead of 10000
in our case:
Update with Llama 3.1: The Llama 3.1 version comes with a small adjustment on frequencies. Implementation detail will be discussed in the following subchapters. Currently we represent this operation with scl(...)
.
If it will be expressed with variable names in the code and scaling is applied:
You can find original Python implementation of apply_scaling(...) function which is represented as
from src/model/llamatransformer.go
func applyScaling(freqs *ml.Tensor) error {
// See Llama 3.1 Code: https://github.com/meta-llama/llama-models/blob/f45cdfd624b98b6655540f7101d8d9cb432e631c/models/llama3_1/reference_impl/model.py#L45
// Values obtained from grid search
scaleFactor := float32(8.0)
lowFreqFactor := float32(1.0)
highFreqFactor := float32(4.0)
oldContextLen := float32(8192) // original llama3 length
lowFreqWavelen := oldContextLen / lowFreqFactor
highFreqWavelen := oldContextLen / highFreqFactor
for i := 0; i < freqs.Size[0]; i++ {
freq, err := freqs.GetItem_AsFloat32([]int{i})
if err != nil {
return err
}
var newFreq float32
wavelen := 2 * math.Pi / freq
if wavelen < highFreqWavelen {
newFreq = freq
} else if wavelen > lowFreqWavelen {
newFreq = freq / scaleFactor
} else {
smooth := (oldContextLen/wavelen - lowFreqFactor) / (highFreqFactor - lowFreqFactor)
newFreq = (1-smooth)*freq/scaleFactor + smooth*freq
}
if err := freqs.SetItem_FromFloat32([]int{i}, newFreq); err != nil {
return err
}
}
return nil
}
func precomputeFreqsCis(dim int, end int, theta float64, useScaled bool) (*ml.Tensor, error) {
...
dimFloat := float32(dim)
freqs, err := ml.ARange(0, dim, 2, ml.DT_BF16)
...
err = freqs.Apply_AsFloat32(func(val float32) float32 {
return float32(1.0 / math.Pow(theta, float64(val/dimFloat)))
})
...
if useScaled {
err = applyScaling(freqs)
if err != nil {
return nil, err
}
}
...
}
Go project (ours) values of the freqs
array:
freqs: {0, 2, 4, 6, 8, 10, 12, ..., 124, 126} # has 64 items, at this time, the freqs array contains "val" values that exists in the equations above.
# after running Apply_AsFloat32 and then applyScaling, freqs will be:
freqs: { # has 64 items, in radians.
1.0000e+00, 8.1250e-01, 6.6016e-01, 5.3906e-01, 4.3945e-01, 3.5742e-01,
2.9102e-01, 2.3730e-01, 1.9336e-01, 1.5723e-01, 1.2793e-01, 1.0449e-01,
8.4961e-02, 6.9336e-02, 5.6641e-02, 4.6143e-02, 3.7598e-02, 3.0518e-02,
2.4902e-02, 2.0264e-02, 1.6479e-02, 1.3489e-02, 1.0986e-02, 8.9111e-03,
7.2632e-03, 5.9204e-03, 4.8218e-03, 3.9368e-03, 3.2043e-03, 2.1515e-03,
1.3504e-03, 8.5068e-04, 5.1880e-04, 3.1090e-04, 1.7834e-04, 9.5367e-05,
7.7724e-05, 6.2943e-05, 5.1498e-05, 4.1962e-05, 3.4094e-05, 2.7895e-05,
2.2650e-05, 1.8477e-05, 1.5080e-05, 1.2279e-05, 1.0014e-05, 8.1062e-06,
6.6459e-06, 5.3942e-06, 4.4107e-06, 3.5912e-06, 2.9206e-06, 2.3842e-06,
1.9372e-06, 1.5795e-06, 1.2890e-06, 1.0431e-06, 8.5309e-07, 6.9663e-07,
5.6624e-07, 4.6194e-07, 3.7625e-07, 3.0547e-07
}
This is the output of Python + Pytorch environment. There are slight differences between them because of floating point precision differences.
In the table below, you can find approximate equivalents of radian angle values in degrees, with corresponding "val" indices:
val | rad | deg | val | rad | deg | val | rad | deg | val | rad | deg | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.00000000 | 57.29578 | 32 | 0.03760603 | 2.15467 | 64 | 0.00052485 | 0.03007 | 96 | 0.00000665 | 0.00038 | |||
2 | 0.81461722 | 46.67413 | 34 | 0.03063452 | 1.75523 | 66 | 0.00031269 | 0.01792 | 98 | 0.00000542 | 0.00031 | |||
4 | 0.66360128 | 38.02155 | 36 | 0.02495541 | 1.42984 | 68 | 0.00017851 | 0.01023 | 100 | 0.00000441 | 0.00025 | |||
6 | 0.54058099 | 30.97301 | 38 | 0.02032910 | 1.16477 | 70 | 0.00009556 | 0.00548 | 102 | 0.00000359 | 0.00021 | |||
8 | 0.44036663 | 25.23115 | 40 | 0.01656044 | 0.94884 | 72 | 0.00007785 | 0.00446 | 104 | 0.00000293 | 0.00017 | |||
10 | 0.35873023 | 20.55373 | 42 | 0.01349042 | 0.77294 | 74 | 0.00006342 | 0.00363 | 106 | 0.00000238 | 0.00014 | |||
12 | 0.29222783 | 16.74342 | 44 | 0.01098953 | 0.62965 | 76 | 0.00005166 | 0.00296 | 108 | 0.00000194 | 0.00011 | |||
14 | 0.23805381 | 13.63948 | 46 | 0.00895226 | 0.51293 | 78 | 0.00004208 | 0.00241 | 110 | 0.00000158 | 0.00009 | |||
16 | 0.19392276 | 11.11096 | 48 | 0.00729267 | 0.41784 | 80 | 0.00003428 | 0.00196 | 112 | 0.00000129 | 0.00007 | |||
18 | 0.15797281 | 9.05118 | 50 | 0.00594073 | 0.34038 | 82 | 0.00002793 | 0.00160 | 114 | 0.00000105 | 0.00006 | |||
20 | 0.12868738 | 7.37324 | 52 | 0.00483942 | 0.27728 | 84 | 0.00002275 | 0.00130 | 116 | 0.00000086 | 0.00005 | |||
22 | 0.10483095 | 6.00637 | 54 | 0.00394228 | 0.22588 | 86 | 0.00001853 | 0.00106 | 118 | 0.00000070 | 0.00004 | |||
24 | 0.08539710 | 4.89289 | 56 | 0.00321145 | 0.18400 | 88 | 0.00001510 | 0.00086 | 120 | 0.00000057 | 0.00003 | |||
26 | 0.06956595 | 3.98584 | 58 | 0.00216657 | 0.12414 | 90 | 0.00001230 | 0.00070 | 122 | 0.00000046 | 0.00003 | |||
28 | 0.05666962 | 3.24693 | 60 | 0.00137189 | 0.07860 | 92 | 0.00001002 | 0.00057 | 124 | 0.00000038 | 0.00002 | |||
30 | 0.04616405 | 2.64501 | 62 | 0.00085675 | 0.04909 | 94 | 0.00000816 | 0.00047 | 126 | 0.00000031 | 0.00002 |
Sources:
- RoFormer: Enhanced Transformer with Rotary Position Embedding: Paper | Papers with Code | LabML Annotated Implementation
- Llama 2: Open Foundation and Fine-Tuned Chat Models : Paper
- Llama: Open and Efficient Foundation Language Models : Paper
import torch
def precompute_freqs_cis(dim: int, end: int, theta: float = 500000.0):
...
t = torch.arange(end, device=freqs.device) # type: ignore
freqs = torch.outer(t, freqs).float() # type: ignore
...
from src/model/llamatransformer.go
func precomputeFreqsCis(dim int, end int) (*ml.Tensor, error) {
...
t, err := ml.ARange(0, end, 1, ml.DT_BF16)
if err != nil {
return nil, err
}
...
}
The end
is computed as modelArgs.MaxSequenceLength*2
. In our case, end = 2048 * 2 = 4096
.
⚠️ Note for weirdnes here: In original implementation,modelArgs.MaxSequenceLength
value is equal to512
which is limitation for input prompt token count. With multiplying to 2, they've aimed to avoid of unnecessary calculations.
However, in our implementation, we specifiedmodelArgs.MaxSequenceLength
as2048
, and when we multiply it with 2, we get4096
, an unnecessary and unmeaningful value. But I left it as it is, it doesn't hurt correctness, it causes only calculating unused unnecessary values.
We will continue with 4096, but know that, it is unnecessarily high.
On this issue, the original Python code has a comment (read this with considering this comment was taken from Llama 2 code, not Llama 3.1):
# Note that self.params.max_seq_len is multiplied by 2 because the token limit for the Llama 2 generation of models is 4096.
# Adding this multiplier instead of using 4096 directly allows for dynamism of token lengths while training or fine-tuning.```
At first, we generate a tensor named t
with 4096 items as: {0, 1, 2, 3, ..., 4093, 4094, 4095}
. This tensor contains our position indices.
from src/model/llamatransformer.go
func precomputeFreqsCis(dim int, end int) (*ml.Tensor, error) {
...
freqs, err = ml.Outer(t, freqs)
if err != nil {
return nil, err
}
...
}
By calling ml.Outer(t, freqs) function, a tensor with shape {4096, 64}
which is outer product of tensors t with shape {4096}
and freqs with shape {64}
.
This "outer product" function takes first argument as row vectors, second argument as column vectors.
In our case, we take items of t
as rows, items of freqs
as columns, then create 2D tensor called result
as follows:
row | column | set the result as | python equivalent (rad) | python equivalent (deg) |
---|---|---|---|---|
t[0] = 0 |
freqs[0] = 1.0000e+00 |
result[0][0] = 0 * 1.0000e+00 = 0 |
0.00000 | 0.00000 |
freqs[1] = 8.1250e-01 |
result[0][1] = 0 * 8.1250e-01 = 0 |
0.00000 | 0.00000 | |
... | ||||
t[1] = 1 |
freqs[0] = 1.0000e+00 |
result[1][0] = 1 * 1.0000e+00 = 1.0000e+00 |
1.00000000 | 57.29578 |
freqs[1] = 8.1250e-01 |
result[1][1] = 1 * 8.1250e-01 = 8.1250e-01 |
0.81461722 | 46.67413 | |
... | ||||
freqs[62] = 3.7625e-07 |
result[1][62] = 1 * 3.7625e-07 = 3.7625e-07 |
0.00000038 | 0.00002 | |
freqs[63] = 3.0547e-07 |
result[1][63] = 1 * 3.0547e-07 = 3.0547e-07 |
0.00000031 | 0.00002 | |
... | ||||
t[2] = 2 |
freqs[0] = 1.0000e+00 |
result[2][0] = 2 * 1.0000e+00 = 2.0000e+00 |
2.00000000 | 114.59155 |
freqs[1] = 8.1250e-01 |
result[2][1] = 2 * 8.1250e-01 = 1.6250e+00 |
1.62923443 | 93.34825 | |
... | ||||
freqs[62] = 3.7625e-07 |
result[2][62] = 2 * 3.7625e-07 = 7.5251e-07 |
0.00000075 | 0.00004 | |
freqs[63] = 3.0547e-07 |
result[2][63] = 2 * 3.0547e-07 = 6.1095e-07 |
0.00000061 | 0.00004 | |
... | ||||
t[4094] = 8190 |
freqs[0] = 1.0000e+00 |
result[4094][0] = 4094 * 1.0000e+00 = 4.0800e+03 |
4094.00000000 | 234568.90625 (normalized: -151.09375) |
freqs[1] = 8.1250e-01 |
result[4094][1] = 4094 * 8.1250e-01 = 3.3120e+03 |
3335.04296875 | 191083.87500 (normalized: -76.12500) | |
... | ||||
freqs[62] = 3.7625e-07 |
result[4094][62] = 4094 * 3.7625e-07 = 1.5335e-03 |
0.00154234 | 0.08837 | |
freqs[63] = 3.0547e-07 |
result[4094][63] = 4094 * 3.0547e-07 = 1.2436e-03 |
0.00125642 | 0.07199 | |
... | ||||
t[4095] = 4095 |
freqs[0] = 1.0000e+00 |
result[4095][0] = 4095 * 1.0000e+00 = 4.0800e+03 |
4095.00000000 | 234626.20312 (normalized: -93.79688) |
freqs[1] = 8.1250e-01 |
result[4095][1] = 4095 * 8.1250e-01 = 3.3120e+03 |
3335.85742188 | 191130.54688 (normalized: -29.45312) | |
... | ||||
freqs[62] = 3.7625e-07 |
result[4095][62] = 4095 * 3.7625e-07 = 1.5335e-03 |
0.00154272 | 0.00154272 | |
freqs[63] = 3.0547e-07 |
result[4095][63] = 4095 * 3.0547e-07 = 1.2436e-03 |
0.00125673 | 0.07201 | |
... |
from src/ml/operations_impl.go
func Outer(vec1 *Tensor, vec2 *Tensor) (*Tensor, error) {
if err := processErrors(
checkIsVector(vec1),
checkIsVector(vec2),
checkSameDataType(vec1, vec2),
); err != nil {
return nil, err
}
itemSize := vec1.DataType.ItemSize
result := NewEmptyTensor([]int{vec1.Size[0], vec2.Size[0]}, vec1.DataType)
for i := 0; i < vec1.Size[0]; i++ {
rowValF32, err := vec1.GetItemByOffset_AsFloat32(i * itemSize)
if err != nil {
return nil, err
}
for j := 0; j < vec2.Size[0]; j++ {
colValF32, err := vec2.GetItemByOffset_AsFloat32(j * itemSize)
if err != nil {
return nil, err
}
valF32 := rowValF32 * colValF32
if err := result.SetItem_FromFloat32([]int{i, j}, valF32); err != nil {
return nil, err
}
}
}
return result, nil
}
cis is described at Wikipedia:
cis is a mathematical notation defined by cis x = cos x + i sin x,[nb 1] where cos is the cosine function, i is the imaginary unit and sin is the sine function. x is the argument of the complex number (angle between line to point and x-axis in polar form).
With this notation, we can express a point's location in cartesian coordinate system with cosine and sine of one angle, which is called as polar coordinates.
We've calculated angles of our polar coordinate points as freqs
in previous chapter.
import torch
def precompute_freqs_cis(dim: int, end: int, theta: float = 500000.0):
...
freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64
return freqs_cis
from src/model/llamatransformer.go
func precomputeFreqsCis(dim int, end int) (*ml.Tensor, error) {
...
ones, err := ml.OnesLike(freqs)
if err != nil {
return nil, err
}
freqs_cis, err := ml.Polar(ones, freqs)
if err != nil {
return nil, err
}
return freqs_cis, nil
}
We create a tensor which contains all 1
values with same shape {4096, 64}
and data type as freqs
tensor, via ml.OnesLike(...)). These 1
values will be the magnitude of our vector in polar coordinate system. We use 1 for magnitude to get identity vector for the angle.
By calling ml.Polar(ones, freqs) function, a tensor with shape {4096, 64}
which is outer product of tensors t with shape {4096}
and freqs with shape {64}
is got.
Polar function is described at Pytorch TORCH.POLAR documentation:
Constructs a complex tensor whose elements are Cartesian coordinates corresponding to the polar coordinates with absolute value abs and angle.
In our case, the abs
argument is a tensor full of 1
values. angle
argument is our freqs
variable.
For each item in the angle
tensor:
- The cosine value of the angle is got, multiplied by
absItemF64 = 1
(it is always 1 in our case), set as the real part of the resulting complex number, - The sine value of the angle is got, multiplied by
absItemF64 = 1
(it is always 1 in our case), set as the imaginary part of the resulting complex number.
Then, the dst
tensor with DT_COMPLEX
data type has cosine and sine values of our angles as complex numbers.
from src/ml/operations_impl.go
func Polar(abs *Tensor, angle *Tensor) (*Tensor, error) {
// See: (For formula) https://pytorch.org/docs/stable/generated/torch.polar.html
...
for readOffset := 0; readOffset < abs.GetBytesCount(); readOffset += absItemSize {
absItemF32, err := abs.GetItemByOffset_AsFloat32(readOffset)
if err != nil {
return nil, err
}
angleItemF32, err := angle.GetItemByOffset_AsFloat32(readOffset)
if err != nil {
return nil, err
}
absItemF64 := float64(absItemF32)
angleItemF64 := float64(angleItemF32)
realPart := absItemF64 * math.Cos(angleItemF64)
imagPart := absItemF64 * math.Sin(angleItemF64)
resultItem := complex64(complex(realPart, imagPart))
if err := dst.SetItemByOffset(writeOffset, resultItem); err != nil {
return nil, err
}
writeOffset += dstItemSize
}
return dst, nil
}
In our case, with 1
is always as absItemF64
:
angleItemF64 | realPart | imagPart | resultItem |
---|---|---|---|
freqs[0][0] = 0 |
cos(0) = 1 |
sin(0) = 0 |
(1 + 0i) |
freqs[0][1] = 0 |
cos(0) = 1 |
sin(0) = 0 |
(1 + 0i) |
... | |||
freqs[1][0] = 1 |
cos(1) = 0.5403023 |
sin(1) = 0.84147096 |
(0.5403023 + 0.84147096i) |
... | |||
freqs[4095][63] = 0.0012436 |
cos(0.0012436) = 1 |
sin(0.0012436) = 0.0012436 |
(1 + 0.0012436i) |
Recap of what we have so far:
- An embedding layer with the shape of
{128256, 4096}
, that contains vectors that have4096
dimensions each,128256
different token vectors. - A token embedding sequence which was calculated using the embedding layer. Our input tensor will be with the shape of
{SequenceLength, 4096}
, - We have
32
"attention heads" (according tomodelArgs.N_Heads = 32
). Our each attention head will have a dimensionmodelArgs.Dim / modelArgs.N_Heads
. In our case, it is4096 / 32 = 128
. So our positional embedding tensors will have128
dimensions. - Because of we have
32
attention heads and the dimesion of our each attention head is128
, we will separate ourxq (queries)
matrix into 32 equal pieces, then because of we have8
key/value heads, we will separate ourxk (keys)
matrix into 8 equal pieces, that end up with128
at one dimension. Then, the integration with positional embeddings and the token embeddings is done on128
dimension. - Think of, we have 5 tokens to encode, so we have position indices of the tokens:
{1, 2, 3, 4, 5}
.
After all of these processes, we will end up with a "positional encoding tensor" for a sequence having 5
positions as follows:
Note: The
$\LaTeX$ support of Github web app lacks and gives non-explanatory errors when you have more than a limit of superscipts/subscripts/fraction notations. So, it was a must to separate the biggest set notation into chunks.
The operation of applyScaling(...)
function is represented with scl(...)
.
- Positional Encoding tensor for 5 positions, without converting to complex number:
- Positional Encoding tensor for 5 positions, after converting to complex number:
The charts below aim to give you some insight into the values of angles and corresponding polar coordinates in the frequency tensor. The chart titles contain which index ranges are taken as samples in a particular chart.
These charts are drawn to make it easy for you to compare changes between positions and dimensions.
You can check out the Python codes that create the sample data and charts used here with this Python Notebook: 10.BONUS-PRECOMPUTING-FREQUENCY-TENSOR.ipynb
3D charts of polar coordinates:
2D charts of angles: