Skip to content

Commit

Permalink
MINOR: Clarify offsets etc are unsigned integers (#475)
Browse files Browse the repository at this point in the history
* clarify offsets etc are unsigned integers

* Apply suggestions from code review

Co-authored-by: Gene Pang <[email protected]>

---------

Co-authored-by: Gene Pang <[email protected]>
  • Loading branch information
emkornfield and gene-db authored Dec 6, 2024
1 parent c702813 commit 4a17d6b
Showing 1 changed file with 23 additions and 23 deletions.
46 changes: 23 additions & 23 deletions VariantEncoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,17 +69,17 @@ The entire metadata is encoded as the following diagram shows:
metadata | header |
+-----------------------+
| |
: dictionary_size : <-- little-endian, `offset_size` bytes
: dictionary_size : <-- unsigned little-endian, `offset_size` bytes
| |
+-----------------------+
| |
: offset : <-- little-endian, `offset_size` bytes
: offset : <-- unsigned little-endian, `offset_size` bytes
| |
+-----------------------+
:
+-----------------------+
| |
: offset : <-- little-endian, `offset_size` bytes
: offset : <-- unsigned little-endian, `offset_size` bytes
| | (`dictionary_size + 1` offsets)
+-----------------------+
| |
Expand All @@ -88,9 +88,9 @@ metadata | header |
+-----------------------+
```

The metadata is encoded first with the `header` byte, then `dictionary_size` which is a little-endian value of `offset_size` bytes, and represents the number of string values in the dictionary.
The metadata is encoded first with the `header` byte, then `dictionary_size` which is an unsigned little-endian value of `offset_size` bytes, and represents the number of string values in the dictionary.
Next, is an `offset` list, which contains `dictionary_size + 1` values.
Each `offset` is a little-endian value of `offset_size` bytes, and represents the starting byte offset of the i-th string in `bytes`.
Each `offset` is an unsigned little-endian value of `offset_size` bytes, and represents the starting byte offset of the i-th string in `bytes`.
The first `offset` value will always be `0`, and the last `offset` value will always be the total length of `bytes`.
The last part of the metadata is `bytes`, which stores all the string values in the dictionary.
All string values must be UTF-8 encoded strings.
Expand All @@ -105,9 +105,9 @@ header: 1 byte (<version> | <sorted_strings> << 4 | (<offset_size_minus_one> <<
version: a 4-bit version ID. Currently, must always contain the value 1
sorted_strings: a 1-bit value indicating whether metadata strings are sorted
offset_size_minus_one: 2-bit value providing the number of bytes per dictionary size and offset field.
dictionary_size: `offset_size` bytes. little-endian value indicating the number of strings in the dictionary
dictionary_size: `offset_size` bytes. unsigned little-endian value indicating the number of strings in the dictionary
dictionary: <offset>* <bytes>
offset: `offset_size` bytes. little-endian value indicating the starting position of the ith string in `bytes`. The list should contain `dictionary_size + 1` values, where the last value is the total length of `bytes`.
offset: `offset_size` bytes. unsigned little-endian value indicating the starting position of the ith string in `bytes`. The list should contain `dictionary_size + 1` values, where the last value is the total length of `bytes`.
bytes: UTF-8 encoded dictionary string values
```

Expand Down Expand Up @@ -220,27 +220,27 @@ The encoding format is shown in the following diagram:
7 0
+-----------------------+
object value_data | |
: num_elements : <-- little-endian, 1 or 4 bytes
: num_elements : <-- unsigned little-endian, 1 or 4 bytes
| |
+-----------------------+
| |
: field_id : <-- little-endian, `field_id_size` bytes
: field_id : <-- unsigned little-endian, `field_id_size` bytes
| |
+-----------------------+
:
+-----------------------+
| |
: field_id : <-- little-endian, `field_id_size` bytes
: field_id : <-- unsigned little-endian, `field_id_size` bytes
| | (`num_elements` field_ids)
+-----------------------+
| |
: field_offset : <-- little-endian, `field_offset_size` bytes
: field_offset : <-- unsigned little-endian, `field_offset_size` bytes
| |
+-----------------------+
:
+-----------------------+
| |
: field_offset : <-- little-endian, `field_offset_size` bytes
: field_offset : <-- unsigned little-endian, `field_offset_size` bytes
| | (`num_elements + 1` field_offsets)
+-----------------------+
| |
Expand All @@ -254,13 +254,13 @@ object value_data | |
| |
+-----------------------+
```
An object `value_data` begins with `num_elements`, a 1-byte or 4-byte little-endian value, representing the number of elements in the object.
An object `value_data` begins with `num_elements`, a 1-byte or 4-byte unsigned little-endian value, representing the number of elements in the object.
The size in bytes of `num_elements` is indicated by `is_large` in the `value_header`.
Next, is a list of `field_id` values.
There are `num_elements` number of entries and each `field_id` is a little-endian value of `field_id_size` bytes.
There are `num_elements` number of entries and each `field_id` is an unsigned little-endian value of `field_id_size` bytes.
A `field_id` is an index into the dictionary in the metadata.
The `field_id` list is followed by a `field_offset` list.
There are `num_elements + 1` number of entries and each `field_offset` is a little-endian value of `field_offset_size` bytes.
There are `num_elements + 1` number of entries and each `field_offset` is an unsigned little-endian value of `field_offset_size` bytes.
A `field_offset` represents the byte offset (relative to the first byte of the first `value`) where the i-th `value` starts.
The last `field_offset` points to the byte after the end of the last `value`.
The `field_offset` list is followed by the `value` list.
Expand Down Expand Up @@ -289,17 +289,17 @@ When `basic_type` is `3`, `value_data` encodes an array. The encoding format is
7 0
+-----------------------+
array value_data | |
: num_elements : <-- little-endian, 1 or 4 bytes
: num_elements : <-- unsigned little-endian, 1 or 4 bytes
| |
+-----------------------+
| |
: field_offset : <-- little-endian, `field_offset_size` bytes
: field_offset : <-- unsigned little-endian, `field_offset_size` bytes
| |
+-----------------------+
:
+-----------------------+
| |
: field_offset : <-- little-endian, `field_offset_size` bytes
: field_offset : <-- unsigned little-endian, `field_offset_size` bytes
| | (`num_elements + 1` field_offsets)
+-----------------------+
| |
Expand All @@ -313,10 +313,10 @@ array value_data | |
| |
+-----------------------+
```
An array `value_data` begins with `num_elements`, a 1-byte or 4-byte little-endian value, representing the number of elements in the array.
An array `value_data` begins with `num_elements`, a 1-byte or 4-byte unsigned little-endian value, representing the number of elements in the array.
The size in bytes of `num_elements` is indicated by `is_large` in the `value_header`.
Next, is a `field_offset` list.
There are `num_elements + 1` number of entries and each `field_offset` is a little-endian value of `field_offset_size` bytes.
There are `num_elements + 1` number of entries and each `field_offset` is an unsigned little-endian value of `field_offset_size` bytes.
A `field_offset` represents the byte offset (relative to the first byte of the first `value`) where the i-th `value` starts.
The last `field_offset` points to the byte after the last byte of the last `value`.
The `field_offset` list is followed by the `value` list.
Expand All @@ -341,9 +341,9 @@ primitive_val: see table for binary representation
short_string_val: UTF-8 encoded bytes
object_val: <num_elements> <field_id>* <field_offset>* <fields>
array_val: <num_elements> <field_offset>* <fields>
num_elements: a 1 or 4 byte little-endian value (depending on is_large in <object_header>/<array_header>)
field_id: a 1, 2, 3 or 4 byte little-endian value (depending on field_id_size_minus_one in <object_header>), indexing into the dictionary
field_offset: a 1, 2, 3 or 4 byte little-endian value (depending on field_offset_size_minus_one in <object_header>/<array_header>), providing the offset in bytes within fields
num_elements: a 1 or 4 byte unsigned little-endian value (depending on is_large in <object_header>/<array_header>)
field_id: a 1, 2, 3 or 4 byte unsigned little-endian value (depending on field_id_size_minus_one in <object_header>), indexing into the dictionary
field_offset: a 1, 2, 3 or 4 byte unsigned little-endian value (depending on field_offset_size_minus_one in <object_header>/<array_header>), providing the offset in bytes within fields
fields: <value>*
```

Expand Down

0 comments on commit 4a17d6b

Please sign in to comment.