-
Notifications
You must be signed in to change notification settings - Fork 6
Binary file format
The way you add events to ESDB has a drastic impact on how the events are stored, as well as how fast and in what order you can retrieve them. The go
function for adding an event is defined as:
writer.Add(blockId []byte, timestamp int, data []byte, grouping string, indexes string)
Events for each block
and grouping
combination are stored sequentially and ordered by timestamp. This makes it very fast to scan through events in this order.
Indexes allow you to retrieve a timestamp ordered list of a subset of the block
's events which may exist anywhere in the block
.
This inherently makes scanning indexes less performant than scanning the grouping (as we have to do a disk seek for each event we read), but still pretty decent as the benchmarks suggest.
Your grouping
should be used for the most heavily used access pattern to take advantage of the performance gain.
At the highest level, an .esdb
file is made up of just a few sections.
header
block
block
...
block
blockindex
- a small header which identifies encoding and any options specified when the file was written (such as gzip/snappy compression, etc)
- a series of
block
sections, each containing all events for a particular value ofblockId
specified when adding events to the file - an index for quickly finding the file offset of a particular
block
in the file
Note: Any newlines or line breaks you see in the document are purely for visual affect. Each area in the file is a contiguous set of bytes, which no separator in between.
Note on syntax: When describing byte sequences (for instance, down in event section below), I'll be using the following syntax:
Each meaningful group of bytes will be in encoded like:
interpretedType(fixedLength):description
fixedLength
refers to the fixed number of bytes in the group. If the group of bytes doesn't have a fixed length, the (fixedLength)
section will be absent. A few examples:
uint8(1):numIndexes
uvarint:eventLength
bytes(48):header
Let's dive into each second of the file and talk bytes.
TODO :(
A block has a similar structure to the overall file: a header, a series of grouping
sections (each containing all events for a particular value of grouping
specified when adding events to the file), and an index for quickly finding the file offset of a particular grouping or secondary index:
header
grouping
grouping
...
grouping
groupingindex
A magic number as a simple indicator this is the start of a block: byte(1):42
A grouping contains a list of all events added to the blockId
/grouping
sorted ascending by their timestamp. For this section, there is no header, or index, just the sorted events.
An event is encoded in the format:
uvarint:eventLength``bytes(eventLength):event
The event
portion has the format:
uvarint:valueLength``bytes(valueLength):value``uint8(1):indexCount
Followed by indexCount
sections in the format:
uvarint:key``uint64(4):previousOffset``uint64(4):nextOffset
This specifies file offsets for the previous and next events in each particular index the event is apart, allowing us to scan through an index forward and backwards through the file.
The index is a sorted list of key/value pairs in an sstable
format. The term (I think) was first described as part of Google's Bigtable, and similar formats are used in many open source projects such as LevelDb and Cassandra. It allows O(log n)
look up for a given key by performing a binary search on the key space.
Details of the underlying sstable format we use.
In this index, we map the named groupings and indexes to file offsets of the first and last event in the grouping sections above. An example of key/value pairs we're storing:
If you grouped events by a user's email address, and created an index based on a user's company the following key/value pairs would exists in our index:
[email protected]
: uvarint:key``uint64(4):firstEventOffset``uint64(4):lastEventOffset
[email protected]
: uvarint:key``uint64(4):firstEventOffset``uint64(4):lastEventOffset
icustomer.io
: uvarint:key``uint64(4):firstEventOffset``uint64(4):lastEventOffset
igoogle
: uvarint:key``uint64(4):firstEventOffset``uint64(4):lastEventOffset
The first and last offsets allow us to jump to the first and last event in the grouping or index. For groupings, the previous and next events are located directly before and after the event itself. For indexes, you can find the previous and next events in the index by consulting the index key's offset metadata which apart of each event.
This index is also an sstable using the same format as the groupingindex
inside of each block
.
In this index, we map each blockId
to it's starting file offset and total length in bytes.
1
: uint64(4):fileOffset``uint64(4):byteLength
2
: uint64(4):fileOffset``uint64(4):byteLength