Skip to content
John Allison edited this page Jun 25, 2014 · 4 revisions

Background

The way you add events to ESDB has a drastic impact on how the events are stored, as well as how fast and in what order you can retrieve them. The go function for adding an event is defined as:

writer.Add(blockId []byte, timestamp int, data []byte, grouping string, indexes string)

Events for each block and grouping combination are stored sequentially and ordered by timestamp. This makes it very fast to scan through events in this order.

Indexes allow you to retrieve a timestamp ordered list of a subset of the block's events which may exist anywhere in the block.

This inherently makes scanning indexes less performant than scanning the grouping (as we have to do a disk seek for each event we read), but still pretty decent as the benchmarks suggest.

Your grouping should be used for the most heavily used access pattern to take advantage of the performance gain.

Format

At the highest level, an .esdb file is made up of just a few sections.

header
block
block
...
block
blockindex
  1. a small header which identifies encoding and any options specified when the file was written (such as gzip/snappy compression, etc)
  2. a series of block sections, each containing all events for a particular value of blockId specified when adding events to the file
  3. an index for quickly finding the file offset of a particular block in the file

Note: Any newlines or line breaks you see in the document are purely for visual affect. Each area in the file is a contiguous set of bytes, which no separator in between.

Note on syntax: When describing byte sequences (for instance, down in event section below), I'll be using the following syntax:

Each meaningful group of bytes will be in encoded like:

interpretedType(fixedLength):description

fixedLength refers to the fixed number of bytes in the group. If the group of bytes doesn't have a fixed length, the (fixedLength) section will be absent. A few examples:

uint8(1):numIndexes

uvarint:eventLength

bytes(48):header

Let's dive into each second of the file and talk bytes.

Header

TODO :(

Block

A block has a similar structure to the overall file: a header, a series of grouping sections (each containing all events for a particular value of grouping specified when adding events to the file), and an index for quickly finding the file offset of a particular grouping or secondary index:

header
grouping
grouping
...
grouping
groupingindex
block header

A magic number as a simple indicator this is the start of a block: byte(1):42

grouping

A grouping contains a list of all events added to the blockId/grouping sorted ascending by their timestamp. For this section, there is no header, or index, just the sorted events.

event

An event is encoded in the format:

uvarint:eventLength``bytes(eventLength):event

The event portion has the format:

uvarint:valueLength``bytes(valueLength):value``uint8(1):indexCount

Followed by indexCount sections in the format:

uvarint:key``uint64(4):previousOffset``uint64(4):nextOffset

This specifies file offsets for the previous and next events in each particular index the event is apart, allowing us to scan through an index forward and backwards through the file.

groupingindex

The index is a sorted list of key/value pairs in an sstable format. The term (I think) was first described as part of Google's Bigtable, and similar formats are used in many open source projects such as LevelDb and Cassandra. It allows O(log n) look up for a given key by performing a binary search on the key space.

Details of the underlying sstable format we use.

In this index, we map the named groupings and indexes to file offsets of the first and last event in the grouping sections above. An example of key/value pairs we're storing:

If you grouped events by a user's email address, and created an index based on a user's company the following key/value pairs would exists in our index:

[email protected]: uvarint:key``uint64(4):firstEventOffset``uint64(4):lastEventOffset

[email protected]: uvarint:key``uint64(4):firstEventOffset``uint64(4):lastEventOffset

icustomer.io: uvarint:key``uint64(4):firstEventOffset``uint64(4):lastEventOffset

igoogle: uvarint:key``uint64(4):firstEventOffset``uint64(4):lastEventOffset

The first and last offsets allow us to jump to the first and last event in the grouping or index. For groupings, the previous and next events are located directly before and after the event itself. For indexes, you can find the previous and next events in the index by consulting the index key's offset metadata which apart of each event.

Blockindex

This index is also an sstable using the same format as the groupingindex inside of each block.

In this index, we map each blockId to it's starting file offset and total length in bytes.

1: uint64(4):fileOffset``uint64(4):byteLength

2: uint64(4):fileOffset``uint64(4):byteLength