In this chapter, we'll walk through the process of reading a tensor stored in the "consolidated.00.pth" file, step by step.
A Quick Reminder:
Pickle uses a simple stack-based virtual machine that records the instructions used to reconstruct the object. The file is just a list of serialized opcodes, the first one being expected to be the protocol version and the last one a stop opcode. When the stop opcode is met, the current object on the stack is popped.
See: Wikipedia | Diving into the Python Pickle formatPickle format consists of pairs of opcode and data bytes. Each pair starts with one byte that stands for an "opcode" and then continues with data bytes varying by each type of opcode.
PROTO
stands for Pickle Protocol version number. We start to reading with it.
- Read key byte: 0x80, corresponding opcode: PROTO, function: load_proto
- Read one byte: 0x02, identifies Pickle Protocol v2.
pr.stack: {}
pr.metastack: {}
memo: {}
The file continues with data of the first weights tensor. Because of the Pickle format is generic, the file contains some instructions to construct some data structures.
- Read key byte: 0x7D, char: '}', corresponding opcode: EMPTY_DICT, function: load_empty_dictionary
- Call
NewPickleDict[interface{}]()
to create a new empty pickle.PickleDict[interface {}] object - Push it into the
pr.stack
- Call
pr.stack: {
PickleDict{}
}
pr.metastack: {}
memo: {}
- Read key byte: 0x71, char: 'q', corresponding opcode: BINPUT, function: load_binput
- Read one byte: 0x00 as memo map key
- Take last element of
pr.stack
and assigns it as 0th item ofpr.memo
pr.stack: {
PickleDict{}
}
pr.metastack: {}
memo: {
0: PickleDict{}
}
- Read key byte: 0x28, char: '(', corresponding opcode: MARK, function: load_mark
- Append the
pr.stack
array intometastack
array (array of arrays), - Create new
pr.stack
array.
- Append the
pr.stack: {}
pr.metastack: {
PickleDict{}
}
memo: {
0: PickleDict{}
}
The file continues with the name of the first tensor.
- Read key byte: 0x58, char: 'X', corresponding opcode: BINUNICODE, function: load_binunicode
- Read 4 bytes: [0x15, 0x00, 0x00, 0x00], convert it to int32 as little-endian: 21 (decimal). Identifies length of unicode string.
- Read 21 bytes, convert it to string identifies the name of upcoming tensor: "tok_embeddings.weight".
- Push it into the
pr.stack
pr.stack: {
"tok_embeddings.weight"
}
pr.metastack: {
{
PickleDict{}
}
}
memo: {
0: PickleDict{}
}
- Read key byte: 0x71, char: 'q', corresponding opcode: BINPUT, function: load_binput
- Read one byte: 0x01 as memo map key
- Take last element of
pr.stack
and assigns it as 1st item ofpr.memo
pr.stack: {
"tok_embeddings.weight"
}
pr.metastack: {
{
PickleDict{}
}
}
memo: {
0: PickleDict{}
1: "tok_embeddings.weight"
}
The file continues with the
torch.rebuild_tensor_v2(...)
function. Then, we will read some instructions that constructtorch.TorchStorage
object. This object will be one of arguments of fortorch.rebuild_tensor_v2(...)
function.
- Read key byte: 0x63, char: 'c', corresponding opcode: GLOBAL, function: load_global
- Read one line string (until '\n' byte): "torch._utils", identifies Python module name
- Read one line string (until '\n' byte): "_rebuild_tensor_v2", identifies Python class/function name
- Call pr.findClass to get corresponding Go object to the module and name:
torch.rebuild_tensor_v2 function
. - Push it into the
pr.stack
pr.stack: {
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
}
pr.metastack: {
{
PickleDict{}
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight"
}
- Read key byte: 0x71, char: 'q', corresponding opcode: BINPUT, function: load_binput
- Read one byte: 0x02 as memo map key
- Take last element of
pr.stack
and assigns it as 2nd item ofpr.memo
pr.stack: {
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
}
pr.metastack: {
{
PickleDict{}
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...)
}
- Read key byte: 0x28, char: '(', corresponding opcode: MARK, function: load_mark
- Append the
pr.stack
array intometastack
array (array of arrays), - Create new
pr.stack
array.
- Append the
pr.stack: {}
pr.metastack: {
{
PickleDict{}
},
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...)
}
- Read key byte: 0x28, char: '(', corresponding opcode: MARK, function: load_mark
- Append the
pr.stack
array intometastack
array (array of arrays), - Create new
pr.stack
array.
- Append the
pr.stack: {}
pr.metastack: {
{
PickleDict{}
},
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
},
{}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...)
}
- Read key byte: 0x58, char: 'X', corresponding opcode: BINUNICODE, function: load_binunicode
- Read 4 bytes: [0x07, 0x00, 0x00, 0x00], convert it to int32 as little-endian: 7 (decimal). Identifies length of unicode string.
- Read 7 bytes, convert it to string identifies the name of upcoming tensor: "storage".
- Push it into the
pr.stack
pr.stack: {
"storage"
}
pr.metastack: {
{
PickleDict{}
},
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
},
{}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...)
}
..... Some steps were taken
- Current state:
pr.stack: {
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
}
pr.metastack: {
{
PickleDict{}
},
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
},
{}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...),
3: "storage",
4: StorageKind{ml.DT_BF16},
5: "0",
6: "cpu"
}
- Read key byte: 0x74, char: 't', corresponding opcode: TUPLE, function: load_tuple
- Call pop_mark function to build tuple from topmost stack items
- Backup current
pr.stack
- Pop last item from
pr.metastack
, result is an array - Assign the result to
pr.stack
- Return backed up stack array
- Backup current
- Push the returned stack array into the
pr.stack
as an array
- Call pop_mark function to build tuple from topmost stack items
pr.stack: {
{
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
}
}
pr.metastack: {
{
PickleDict{}
}
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...),
3: "storage",
4: StorageKind{ml.DT_BF16},
5: "0",
6: "cpu"
}
- Read key byte: 0x71, char: 'q', corresponding opcode: BINPUT, function: load_binput
- Read one byte: 0x07 as memo map key
- Take last element of
pr.stack
and assigns it as 7th item ofpr.memo
pr.stack: {
{
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
}
}
pr.metastack: {
{
PickleDict{}
},
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...),
3: "storage",
4: StorageKind{ml.DT_BF16},
5: "0",
6: "cpu",
7: {
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
}
}
- Read key byte: 0x51, char: 'Q', corresponding opcode: BINPERSID, function: load_binpersid
- Pop last item from
pr.stack
, result is an array:{"storage", StorageKind{ml.DT_BF16}, "0", "cpu", 525336576}
. Identifies "pid" argument forpr.persistentLoad(...)
function pr.persistentLoad(...)
function callspr.PersistentLoadFn(...)
custom function with "pid" array argumentTorchModelReader.persistentLoad(...)
function is called with "pid" array argument- This function parses the pid array
{"storage", StorageKind{ml.DT_BF16}, "0", "cpu", 525336576}
- pid[0] = "storage", it must be
- pid[1] = StorageKind{ml.DT_BF16}, data type kind of defined storage
- pid[2] = "0", filenameStem, filename is defined as: "consolidated/data/0"
- pid[3] = "cpu", identifies the tensor device, we don't use this data
- pid[4] = 525336576, identifies element count of the tensor contained by "consolidated/data/0" file
- Find "consolidated/data/0" file entry in the ZIP file, get its storage offset, 34304 (starting location of the tensor bytes)
- Create a TorchStorage object with given data type and storage offset
- Calculate byte locations (starting location, end location) with given storage offfset and given element count
- Do memory-mapping between
TorchStorage.rawData
and bytes in calculated locations. Now we have a memory-mapped to the file[]byte
array for bytes of the current tensor - Return the TorchStorage object
- This function parses the pid array
- Push the TorchStorage object into the
pr.stack
- Pop last item from
pr.stack: {
torch.TorchStorage {
filename: "consolidated/data/0",
kind: torch.StorageKind{dataType: ml.DT_BF16},
storageOffset: 34304
}
}
pr.metastack: {
{
PickleDict{}
},
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...),
3: "storage",
4: StorageKind{ml.DT_BF16},
5: "0",
6: "cpu",
7: {
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
}
}
Now, we have
torch.TorchStorage
object. The file continues with other arguments of thetorch.rebuild_tensor_v2(...)
function. We will read some instructions that construct other arguments. We will call this function with these gathered arguments further.
- Read key byte: 0x4b, char: 'K', corresponding opcode: BININT1, function: load_binint1
- Push 1-byte unsigned int
- Read one byte: 0x00
- Push it into the
pr.stack
pr.stack: {
torch.TorchStorage {
filename: "consolidated/data/0",
kind: torch.StorageKind{dataType: ml.DT_BF16},
storageOffset: 34304
}
0
}
pr.metastack: {
{
PickleDict{}
},
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...),
3: "storage",
4: StorageKind{ml.DT_BF16},
5: "0",
6: "cpu",
7: {
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
}
}
- Read key byte: 0x4d, char: 'M', corresponding opcode: BININT2, function: load_binint2
- Push 2-byte unsigned int
- Read 2 bytes: [0x00, 0x7D], convert it to uint16 as little-endian: 128256 (decimal).
- Push it into the
pr.stack
pr.stack: {
torch.TorchStorage {
filename: "consolidated/data/0",
kind: torch.StorageKind{dataType: ml.DT_BF16},
storageOffset: 34304
},
0,
128256
}
pr.metastack: {
{
PickleDict{}
},
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...),
3: "storage",
4: StorageKind{ml.DT_BF16},
5: "0",
6: "cpu",
7: {
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
}
}
- Read key byte: 0x4d, char: 'M', corresponding opcode: BININT2, function: load_binint2
- Push 2-byte unsigned int
- Read 2 bytes: [0x00, 0x10], convert it to uint16 as little-endian: 4096 (decimal).
- Push it into the
pr.stack
pr.stack: {
torch.TorchStorage {
filename: "consolidated/data/0",
kind: torch.StorageKind{dataType: ml.DT_BF16},
storageOffset: 34304
},
0,
128256,
4096,
}
pr.metastack: {
{
PickleDict{}
},
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...),
3: "storage",
4: StorageKind{ml.DT_BF16},
5: "0",
6: "cpu",
7: {
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
}
}
..... Some steps were taken
- Current state:
pr.stack: {
torch.TorchStorage {
filename: "consolidated/data/0",
kind: torch.StorageKind{dataType: ml.DT_BF16},
storageOffset: 34304
},
0,
{128256, 4096},
{4096, 1},
false,
}
pr.metastack: {
{
PickleDict{}
},
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...),
3: "storage",
4: StorageKind{ml.DT_BF16},
5: "0",
6: "cpu",
7: {
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
}
}
- Read key byte: 0x63, char: 'c', corresponding opcode: GLOBAL, function: load_global
- Read one line string (until '\n' byte): "collections", identifies Python module name
- Read one line string (until '\n' byte): "OrderedDict", identifies Python class/function name
- Call pr.findClass to get corresponding Go object to the module and name:
pickle.NewPickleDict[interface {}]() function
. - Push it into the
pr.stack
pr.stack: {
torch.TorchStorage {
filename: "consolidated/data/0",
kind: torch.StorageKind{dataType: ml.DT_BF16},
storageOffset: 34304
},
0,
{128256, 4096},
{4096, 1},
false,
pickle.NewPickleDict[interface {}]()
}
pr.metastack: {
{
PickleDict{}
},
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...),
3: "storage",
4: StorageKind{ml.DT_BF16},
5: "0",
6: "cpu",
7: {
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
}
}
..... Some steps were taken
- Current state:
pr.stack: {
torch.TorchStorage {
filename: "consolidated/data/0",
kind: torch.StorageKind{dataType: ml.DT_BF16},
storageOffset: 34304
},
0,
{128256, 4096},
{4096, 1},
false,
pickle.NewPickleDict[interface {}](),
{}
}
pr.metastack: {
{
PickleDict{}
},
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...),
3: "storage",
4: StorageKind{ml.DT_BF16},
5: "0",
6: "cpu",
7: {
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
},
8: {128256, 4096},
9: {4096, 1},
10: pickle.NewPickleDict[interface {}]()
}
- Read key byte: 0x52, char: 'R', corresponding opcode: REDUCE, function: load_reduce
- Apply callable to argtuple, both on stack
- Pop last item from
pr.stack
, result is an empty array, identifies rawArgsArr array will be passed to upcoming function - Take the last item in the
pr.stack
(this time we don't remove with pop) - Taken item is our function:
pickle.NewPickleDict[interface {}]()
- Convert
rawArgsArr
items to expected data types of reflected function object (an empty array) - Call the
pickle.NewPickleDict[interface {}]()
with passing converted arguments, then replace the last item in thepr.stack
with the function's result: an emptypickle.PickleDict[interface {}]{}
pr.stack: {
torch.TorchStorage {
filename: "consolidated/data/0",
kind: torch.StorageKind{dataType: ml.DT_BF16},
storageOffset: 34304
},
0,
{128256, 4096},
{4096, 1},
false,
PickleDict{}
}
pr.metastack: {
{
PickleDict{}
},
{
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...)
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...),
3: "storage",
4: StorageKind{ml.DT_BF16},
5: "0",
6: "cpu",
7: {
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
},
8: {128256, 4096},
9: {4096, 1},
10: pickle.NewPickleDict[interface {}]()
}
..... Some steps were taken
- Current state:
pr.stack: {
"tok_embeddings.weight",
torch.rebuild_tensor_v2(...),
{
torch.TorchStorage {
filename: "consolidated/data/0",
kind: torch.StorageKind{dataType: ml.DT_BF16},
storageOffset: 34304
},
0,
{128256, 4096},
{4096, 1},
false,
PickleDict{}
}
}
pr.metastack: {
{
PickleDict{}
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...),
3: "storage",
4: StorageKind{ml.DT_BF16},
5: "0",
6: "cpu",
7: {
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
},
8: {128256, 4096},
9: {4096, 1},
10: pickle.NewPickleDict[interface {}]()
}
Now, we have all required arguments to call
torch.rebuild_tensor_v2(...)
function in ourpr.stack
.
- Read key byte: 0x52, char: 'R', corresponding opcode: REDUCE, function: load_reduce
- Apply callable to argtuple, both on stack
- Pop last item from
pr.stack
, result is an array array, identifies rawArgsArr array will be passed to upcoming function, result array:{torch.TorchStorage {...}, 0, {128256, 4096}, {4096, 1}, false, PickleDict{}}
- Take the last item in the
pr.stack
(this time we don't remove with pop) - Taken item is our function:
torch.rebuild_tensor_v2(...)
- Convert
rawArgsArr
items to expected data types of reflected function object - Call the
torch.rebuild_tensor_v2(...)
with passing converted arguments, then replace the last item in thepr.stack
with the function's result: a ml.Tensor object with Size=[128256,4096]
, Stride=[4096,1]
, DataType=ml.DT_BF16
, RawData=(memory mapped []byte)
pr.stack: {
"tok_embeddings.weight"
ml.Tensor{Size=[128256,4096], ...}
}
pr.metastack: {
{
PickleDict{}
}
}
memo: {
0: PickleDict{},
1: "tok_embeddings.weight",
2: torch.rebuild_tensor_v2(...),
3: "storage",
4: StorageKind{ml.DT_BF16},
5: "0",
6: "cpu",
7: {
"storage",
StorageKind{ml.DT_BF16},
"0",
"cpu",
525336576
},
8: {128256, 4096},
9: {4096, 1},
10: pickle.NewPickleDict[interface {}](),
11: PickleDict{}
}
..... Now,
- The first item of our
pr.stack
is"tok_embeddings.weight"
(the name of the tensor) - The second item of our
pr.stack
isml.Tensor{Size=[128256,4096], ...}
(the tensor itself) - In the next steps, this flow will be recurred for other tensors.
Now, we have all of 291 tensors with names in our
pr.stack
. A healthy Pickle file ends with a StopSignal instruction.
..... Some steps were taken
- Current state:
pr.stack: {
PickleDict {
a PickleDict (map) of 291 tensor names and corresponding ml.Tensor objects
}
}
pr.metastack: {}
memo: {an array of 2334 items}
-
Read key byte: 0x2e, char: '.', corresponding opcode: STOP, function: load_stop
- Every pickle ends with STOP
- Pop last item from
pr.stack
, result is aPickleDict
(map) of tensor names and corresponding ml.Tensor objects - Create a StopSignal signal object with
Value
of the resultPickleDict
(map) and return it
-
Finished.