In this chapter, we'll walk through the process of defining and implementing the Llama 3.1 Model architecture.
Llama 2, Llama 3 and Llama 3.1 transformer model architectures are very similar, but new versions have come with some improvements.
Inspired by original Llama 3.1 Python repository of Meta | llama.cpp
A Quick Reminder:
We've loaded 291 tensors from the model file into a map (PickleDict) of tensors per tensor name viatorchModelReader.Load()
.
Now, NewLlamaTransformer(...) is called to build operation sequence graph of Llama architecture.
from src/model/loader.go
func LoadModelEx(modelDir string, includeTensors bool, includeVocab bool) (*Model, error) {
model := &Model{}
if includeTensors {
...
modelTensors, err := torchModelReader.Load()
...
model.Tensors = modelTensors
...
}
...
if includeTensors {
...
if model.Transformer, err = NewLlamaTransformer(model); err != nil {
return nil, err
}
}
return model, nil
}
The model.ModelArgs was loaded from the JSON File, "params.json". But, some of them (N_Rep
and HeadDim
) are "fields that should be calculated" and some of others may have value -1
meaning "default value".
In our project, we used Meta-Llama-3.1-8B-Instruct model. This model has following parameters that are loaded into model.ModelArgs
:
Dim: 4096 //dim
N_Layers: 32 //n_layers
N_Heads: 32 //n_heads
N_KVHeads: 8 //n_kv_heads
VocabSize: 128256 //vocab_size
MultipleOf: 1024 //multiple_of
FFNDimMultiplier: 1.3 //ffn_dim_multiplier
NormEpsilon: 1e-5 //norm_eps
RopeTheta: 500000 //rope_theta
UseScaledRope: true //use_scaled_rope
MaxSequenceLength: //to be calculated
N_Rep: //to be calculated
HeadDim: //to be calculated
These preparations are done in this function:
- If
modelArgs.VocabSize
is-1
in the file, which indicates it wants us to set the default value. modelArgs wants to obey the "tokenizer.model" file. In our case, the tokenizer file contains 128,256 tokens. - If
modelArgs.N_KVHeads
is not specified in the file, which indicates it wants us to set the default value. The default value isN_Heads
.
ThisN_KVHeads
is equal to8
for 8B/8B-Instruct Llama models. modelArgs.N_Rep
is set to integer value ofN_Heads / N_KVHeads
, the repetition count for the following operation in original Llama code and also in our implementation. In our case, it is32 / 8 = 4
. This means, ourkeys
andvalues
have8
heads, other parts have32
heads, so the8
heads are repeated/copied4 times
to adapt32
heads.modelArgs.HeadDim
is set to integer value ofmodelArgs.Dim / modelArgs.N_Heads
. In our case, it is4096 / 32 = 128
. This means we have 32 differentattention heads
and the dimension of each of these heads is128
.
Also, you can check out sources for Grouped Multi-Query Attention which isn't described here:
from src/model/llamatransformer.go
func NewLlamaTransformer(model *Model) (*LlamaTransformer, error) {
result := &LlamaTransformer{}
modelArgs := model.ModelArgs
var err error
// Compare (VocabSize, Dim) vs. "tok_embeddings.weight" tensor shape
dim := modelArgs.Dim // 4096
vocabSize := modelArgs.VocabSize // 128256
if modelArgs.N_KVHeads < 0 {
modelArgs.N_KVHeads = modelArgs.N_Heads
}
modelArgs.N_Rep = int(modelArgs.N_Heads / modelArgs.N_KVHeads)
// Calculate dimension of each head
modelArgs.HeadDim = int(modelArgs.Dim / modelArgs.N_Heads) // 128
...
}
Yes! We are at the stage where we REALLY are starting to build the model by laying the first brick!
The getTensor function gets the weights tensor with the name we specified from Model.Tensors map, checks if it has really the expected shape we specified, then returns the ml.Tensor object, or returns "incorrect shape" error.
The weights tensor with name "tok_embeddings.weight" is taken and set to result.tok_embd
as first brick. Our result.tok_embd
weights tensor is with shape of {vocabSize, dim} = {128256, 4096}
.
from src/model/llamatransformer.go
func NewLlamaTransformer(model *Model) (*LlamaTransformer, error) {
result := &LlamaTransformer{}
...
if result.tok_embd, err = getTensor(model, "tok_embeddings.weight", []int{vocabSize, dim}); err != nil {
return nil, err
}
...
}
The most important part of the transformer models that provide accurate outputs is the attention mechanism. Each "block" of Llama consists of a self-attention and a feed-forward neural network parts. The details will be explained further, but also we call these "block"s as "layer"s.
The value of the modelArgs.N_Layers
variable corresponds to the number of blocks we have. It is 32
, so we will initiate 32 different transformer block LlamaTransformerBlock objects via NewLlamaTransformerBlock(...) function. To achieve this, we instantiate result.Layers
array with 32 items, then set each item with instantiating each block.
from src/model/llamatransformer.go
func NewLlamaTransformer(model *Model) (*LlamaTransformer, error) {
result := &LlamaTransformer{}
...
result.Layers = make([]*LlamaTransformerBlock, modelArgs.N_Layers)
for layerIdx := 0; layerIdx < modelArgs.N_Layers; layerIdx++ {
var layer *LlamaTransformerBlock
if layer, err = NewLlamaTransformerBlock(model, layerIdx); err != nil {
return nil, err
}
result.Layers[layerIdx] = layer
}
...
}
The LlamaTransformerBlock
object consists of attn_norm
(RMS normalization), attention
(Attention mechanism), ffn_norm
(RMS normalization), and feedForward
(Feed Forward Neural Network) modules. These modules operate respectively.
Type definition:
from src/model/llamatransformer.go
type LlamaTransformerBlock struct {
LayerIndex int
attn_norm *RMSNorm // Weights Original: "layers.0.attention_norm.weight" | ggml: "blk.0.attn_norm.weight" | shape: [4096] -> [Dim]
ffn_norm *RMSNorm // Weights Original: "layers.0.ffn_norm.weight" | ggml: "blk.0.ffn_norm.weight" | shape: [4096] -> [Dim]
attention *LlamaAttention
feedForward *LlamaFeedForward
}
In Llama models, these normalization modules are operated before their pair modules, for e.g., attn_norm
is operated before attention
module, ffn_norm
is operated before feedForward
. This approach is called as prenormalization
. Root Mean Square Layer Normalization is used as normalization technique.
We will dive into deeper the details in further chapters, at this stage, we should stay zoomed-out view.
At this stage, our steps are:
- Taking the weights tensor of attention norm corresponding to current layer index,
"layers.%d.attention_norm.weight"
. In Llama model, these weight tensors are named like "layers.0.attention_norm.weight", "layers.1.attention_norm.weight", "layers.2.attention_norm.weight", ..., "layers.31.attention_norm.weight". This weights tensor is with shape of{dim} = {4096}
, - Instantiating an RMSNorm object with specifying
modelArgs.NormEpsilon
(1e-5
as epsilon value) andattn_norm_weights
tensor via NewRMSNorm(...). Then it is set toresult.attn_norm
, - Instantiating a LlamaAttention object via NewLlamaAttention(...). Then it is set to
result.attention
, - Taking the weights tensor of feed-forward neural network norm corresponding to current layer index,
"layers.%d.ffn_norm.weight"
. In Llama model, these weight tensors are named like "layers.0.ffn_norm.weight", "layers.1.ffn_norm.weight", "layers.2.ffn_norm.weight", ..., "layers.31.ffn_norm.weight". This weights tensor is with shape of{dim} = {4096}
, - Instantiating an RMSNorm object with specifying
modelArgs.NormEpsilon
(1e-5
as epsilon value) andffn_norm_weights
tensor via NewRMSNorm(...). Then it is set toresult.ffn_norm
, - Instantiating a LlamaFeedForward object via NewLlamaFeedForward(...). Then it is set to
result.feedForward
,
from src/model/llamatransformer.go
func NewLlamaTransformerBlock(model *Model, layerIndex int) (*LlamaTransformerBlock, error) {
result := &LlamaTransformerBlock{
LayerIndex: layerIndex,
}
modelArgs := model.ModelArgs
dim := modelArgs.Dim // 4096
var err error
// attention normalization
attn_norm_weights, err := getLayerTensor(model, "layers.%d.attention_norm.weight", layerIndex, []int{dim})
if err != nil {
return nil, err
}
result.attn_norm = NewRMSNorm(modelArgs.NormEpsilon, attn_norm_weights)
if result.attention, err = NewLlamaAttention(model, layerIndex); err != nil {
return nil, err
}
// feed forward normalization
ffn_norm_weights, err := getLayerTensor(model, "layers.%d.ffn_norm.weight", layerIndex, []int{dim})
if err != nil {
return nil, err
}
result.ffn_norm = NewRMSNorm(modelArgs.NormEpsilon, ffn_norm_weights)
if result.feedForward, err = NewLlamaFeedForward(model, layerIndex); err != nil {
return nil, err
}
return result, nil
}
The LlamaAttention
object consists of:
attn_wq
: Attention query weights tensor with shape of{N_Heads * HeadDim, Dim} = {32 * 128, 4096} = {4096, 4096}
,attn_wk
: Attention key weights tensor with shape of{N_KVHeads * HeadDim, Dim} = {8 * 128, 4096} = {1024, 4096}
,attn_wv
: Attention value weights tensor with shape of{N_KVHeads * HeadDim, Dim} = {8 * 128, 4096} = {1024, 4096}
,attn_wo
: Attention output weights tensor with shape of{N_Heads * HeadDim, Dim} = {32 * 128, 4096} = {4096, 4096}
.
Type definition:
from src/model/llamatransformer.go
type LlamaAttention struct {
LayerIndex int
N_Heads int
N_KVHeads int
N_Rep int
HeadDim int
attn_wq *ml.Tensor // Original: "layers.0.attention.wq.weight" | ggml: "blk.0.attn_q.weight" | [out_features, in_features] -> shape: [4096 4096] -> [N_Heads * HeadDim, Dim]
attn_wk *ml.Tensor // Original: "layers.0.attention.wk.weight" | ggml: "blk.0.attn_k.weight" | [out_features, in_features] -> shape: [1024 4096] -> [N_KVHeads * HeadDim, Dim]
attn_wv *ml.Tensor // Original: "layers.0.attention.wv.weight" | ggml: "blk.0.attn_v.weight" | [out_features, in_features] -> shape: [1024 4096] -> [N_KVHeads * HeadDim, Dim]
attn_wo *ml.Tensor // Original: "layers.0.attention.wo.weight" | ggml: "blk.0.attn_output.weight" | [out_features, in_features] -> shape: [4096 4096] -> [N_Heads * HeadDim, Dim]
}
NewLlamaAttention(...)
is called to instantiate a new LlamaAttention object for current layer.
from src/model/llamatransformer.go
func NewLlamaTransformerBlock(model *Model, layerIndex int) (*LlamaTransformerBlock, error) {
result := &LlamaTransformerBlock{
LayerIndex: layerIndex,
}
...
if result.attention, err = NewLlamaAttention(model, layerIndex); err != nil {
return nil, err
}
...
}
In NewLlamaAttention(...)
:
- Calculating dimension of normal heads and KV heads (key-value heads). In our case, results are both 4096,
- Taking the weights tensor of attention query corresponding to current layer index,
"layers.%d.attention.wq.weight"
. Then it is set toresult.attn_wq
, - Taking the weights tensor of attention key corresponding to current layer index,
"layers.%d.attention.wk.weight"
. Then it is set toresult.attn_wk
, - Taking the weights tensor of attention value corresponding to current layer index,
"layers.%d.attention.wv.weight"
. Then it is set toresult.attn_wv
, - Taking the weights tensor of attention output corresponding to current layer index,
"layers.%d.attention.wo.weight"
. Then it is set toresult.attn_wo
,
from src/model/llamatransformer.go
func NewLlamaAttention(model *Model, layerIndex int) (*LlamaAttention, error) {
result := &LlamaAttention{
LayerIndex: layerIndex,
}
modelArgs := model.ModelArgs
dim := modelArgs.Dim // 4096
var err error
result.N_Heads = modelArgs.N_Heads
result.N_KVHeads = modelArgs.N_KVHeads
result.N_Rep = modelArgs.N_Rep
// Calculate dimension of each head
result.HeadDim = modelArgs.HeadDim // 128
normalHeadsTotalDim := modelArgs.N_Heads * result.HeadDim // 4096
kvHeadsTotalDim := result.N_KVHeads * result.HeadDim // 4096
// attn_wq, attn_wk, attn_wv, attn_wo are Linear units, so weight shapes are ordered reversely as [out_features, in_features]
if result.attn_wq, err = getLayerTensor(model, "layers.%d.attention.wq.weight", layerIndex, []int{normalHeadsTotalDim, dim}); err != nil {
return nil, err
}
if result.attn_wk, err = getLayerTensor(model, "layers.%d.attention.wk.weight", layerIndex, []int{kvHeadsTotalDim, dim}); err != nil {
return nil, err
}
if result.attn_wv, err = getLayerTensor(model, "layers.%d.attention.wv.weight", layerIndex, []int{kvHeadsTotalDim, dim}); err != nil {
return nil, err
}
if result.attn_wo, err = getLayerTensor(model, "layers.%d.attention.wo.weight", layerIndex, []int{normalHeadsTotalDim, dim}); err != nil {
return nil, err
}
return result, nil
}
The LlamaFeedForward
object consists of:
ffn_gate
: Feed-forward gate weights tensor with shape of{FFNHiddenDim, Dim} = {14336, 4096}
,ffn_down
: Feed-forward down weights tensor with shape of{Dim, FFNHiddenDim} = {4096, 14336}
,ffn_up
: Feed-forward up weights tensor with shape of{FFNHiddenDim, Dim} = {14336, 4096}
,
Note:
FFNHiddenDim
value is calculated as14336
, we will see how is it calculated below.
Type definition:
from src/model/llamatransformer.go
type LlamaFeedForward struct {
FFNHiddenDim int
ffn_gate *ml.Tensor // Original: "layers.0.feed_forward.w1.weight" | ggml: "blk.0.ffn_gate.weight" | [out_features, in_features] -> shape: [14336 4096] -> [FFNHiddenDim, Dim] | w1
ffn_down *ml.Tensor // Original: "layers.0.feed_forward.w2.weight" | ggml: "blk.0.ffn_down.weight" | [out_features, in_features] -> shape: [4096 14336] -> [Dim, FFNHiddenDim] | w2
ffn_up *ml.Tensor // Original: "layers.0.feed_forward.w3.weight" | ggml: "blk.0.ffn_up.weight" | [out_features, in_features] -> shape: [14336 4096] -> [FFNHiddenDim, Dim] | w3
}
NewLlamaFeedForward(...)
is called to instantiate a new LlamaFeedForward object for current layer.
from src/model/llamatransformer.go
func NewLlamaTransformerBlock(model *Model, layerIndex int) (*LlamaTransformerBlock, error) {
result := &LlamaTransformerBlock{
LayerIndex: layerIndex,
}
...
if result.feedForward, err = NewLlamaFeedForward(model, layerIndex); err != nil {
return nil, err
}
...
}
In NewLlamaFeedForward(...)
:
- Calculating dimension of feed forward neural network's hidden layer
result.FFNHiddenDim
. Actually, I couldn't reasonate well this part, calculation method was taken directly from here and here, - Taking the weights tensor of Feed-forward gate corresponding to current layer index,
"layers.%d.feed_forward.w1.weight"
. Then it is set toresult.ffn_gate
, - Taking the weights tensor of Feed-forward down corresponding to current layer index,
"layers.%d.feed_forward.w2.weight"
. Then it is set toresult.ffn_down
, - Taking the weights tensor of Feed-forward up corresponding to current layer index,
"layers.%d.feed_forward.w3.weight"
. Then it is set toresult.ffn_up
,
Note:
ffn_gate
,ffn_down
,ffn_up
are Linear units, so weight shapes are ordered reversely as [out_features, in_features]. At first sight, it may confuse.
from src/model/llamatransformer.go
func NewLlamaFeedForward(model *Model, layerIndex int) (*LlamaFeedForward, error) {
result := &LlamaFeedForward{}
modelArgs := model.ModelArgs
dim := modelArgs.Dim // 4096
var err error
// See: https://github.com/meta-llama/llama-models/blob/f45cdfd624b98b6655540f7101d8d9cb432e631c/models/llama3_1/reference_impl/model.py#L256
// Set it to 4 * dim at first
result.FFNHiddenDim = 4 * modelArgs.Dim
// See: https://github.com/meta-llama/llama-models/blob/f45cdfd624b98b6655540f7101d8d9cb432e631c/models/llama3_1/reference_impl/model.py#L227
// Then, do this calculation below:
result.FFNHiddenDim = int(2 * result.FFNHiddenDim / 3)
if modelArgs.FFNDimMultiplier > -1 {
result.FFNHiddenDim = int(modelArgs.FFNDimMultiplier * float64(result.FFNHiddenDim))
}
// Ensure ffnHiddenDim is multiple of modelArgs.MultipleOf value
result.FFNHiddenDim = int(modelArgs.MultipleOf * ((result.FFNHiddenDim + modelArgs.MultipleOf - 1) / modelArgs.MultipleOf))
// ffn_gate, ffn_down, ffn_up are Linear units, so weight shapes are ordered reversely as [out_features, in_features]
if result.ffn_gate, err = getLayerTensor(model, "layers.%d.feed_forward.w1.weight", layerIndex, []int{result.FFNHiddenDim, dim}); err != nil {
return nil, err
}
if result.ffn_down, err = getLayerTensor(model, "layers.%d.feed_forward.w2.weight", layerIndex, []int{dim, result.FFNHiddenDim}); err != nil {
return nil, err
}
if result.ffn_up, err = getLayerTensor(model, "layers.%d.feed_forward.w3.weight", layerIndex, []int{result.FFNHiddenDim, dim}); err != nil {
return nil, err
}
return result, nil
}
After completion of this stage, our LlamaTransformerBlock
object of the first layer has been built.
This part will be recurred for 32 times for the Llama 3.1 8B models.
A Quick Reminder:
We've done following things until now:
- Built the embedding layer,
- Built 32 LlamaTransformerBlock objects, each containing an attention module and a feed-forward module with RMS prenormalization.
After executing these layers, we have a currentTensor
object as output of previous "transformer blocks". Then, we need to prenormalize our tensor, then process it with the "output weights".
We continue with:
- Taking the weights tensor of output norm,
"norm.weight"
. This weights tensor is with shape of{dim} = {4096}
, - Instantiating an RMSNorm object with specifying
modelArgs.NormEpsilon
(1e-5
as epsilon value) andoutput_norm_weights
tensor via NewRMSNorm(...). Then it is set toresult.output_norm
, - Taking the weights tensor of output,
"output.weight"
. This weights tensor is with shape of{vocabSize, dim} = {128256, 4096}
. Then it is set toresult.output
,
Note: The
output
is a Linear unit, so weight shapes are ordered reversely as [out_features, in_features]. At first sight, it may confuse.
from src/model/llamatransformer.go
func NewLlamaTransformer(model *Model) (*LlamaTransformer, error) {
result := &LlamaTransformer{}
...
output_norm_weights, err := getTensor(model, "norm.weight", []int{dim})
if err != nil {
return nil, err
}
result.output_norm = NewRMSNorm(modelArgs.NormEpsilon, output_norm_weights)
// output is a Linear unit, so weight shape is ordered reversely as [out_features, in_features]
if result.output, err = getTensor(model, "output.weight", []int{vocabSize, dim}); err != nil {
return nil, err
}
...
}
The code comment from the original Llama 2 Python code that explains this:
"""
Precompute the frequency tensor for complex exponentials (cis) with given dimensions.
This function calculates a frequency tensor with complex exponentials using the given dimension 'dim'
and the end index 'end'. The 'theta' parameter scales the frequencies.
The returned tensor contains complex values in complex64 data type.
Args:
dim (int): Dimension of the frequency tensor.
end (int): End index for precomputing frequencies.
theta (float, optional): Scaling factor for frequency computation. Defaults to 500000.0.
Returns:
torch.Tensor: Precomputed frequency tensor with complex exponentials.
"""
precomputeFreqsCis(...)
is called to calculate the frequency tensor for complex exponentials (cis). This tensor values will be used by applyRotaryEmbeddings while applying Rotary Embeddings further.
from src/model/llamatransformer.go
func NewLlamaTransformer(model *Model) (*LlamaTransformer, error) {
result := &LlamaTransformer{}
...
if result.PrecomputedFreqsCis, err = precomputeFreqsCis(int(dim/modelArgs.N_Heads), modelArgs.MaxSequenceLength*2, modelArgs.RopeTheta, modelArgs.UseScaledRope); err != nil {
return nil, err
}
return result, nil
}
The details of precomputeFreqsCis(...)
function is discussed in a dedicated chapter: 10. RoPE (ROTARY POSITIONAL EMBEDDINGS).
Now, we have a complete Model object that contains model arguments, the tokenizer, the LlamaTransformer
object at its model.Transformer
field, which has a complete Llama 3.1 8B model architecture.