Skip to content

Commit 54eda63

Browse files
committed
doc: document TableConfig
1 parent 844168f commit 54eda63

File tree

10 files changed

+784
-268
lines changed

10 files changed

+784
-268
lines changed

README.md

Lines changed: 196 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -104,37 +104,39 @@ The documentation provides two measures of complexity:
104104
The complexities are described in terms of the following variables and
105105
constants:
106106

107-
- The variable *n* refers to the number of *physical* table entries. A
107+
- The variable $`n`$ refers to the number of *physical* table entries. A
108108
*physical* table entry is any key–operation pair, e.g., `Insert k v`
109109
or `Delete k`, whereas a *logical* table entry is determined by all
110-
physical entries with the same key. If the variable *n* is used to
110+
physical entries with the same key. If the variable $`n`$ is used to
111111
describe the complexity of an operation that involves multiple tables,
112112
it refers to the sum of all table entries.
113113

114-
- The variable *o* refers to the number of open tables and cursors in
114+
- The variable $`o`$ refers to the number of open tables and cursors in
115115
the session.
116116

117-
- The variable *s* refers to the number of snapshots in the session.
117+
- The variable $`s`$ refers to the number of snapshots in the session.
118118

119-
- The variable *b* usually refers to the size of a batch of
119+
- The variable $`b`$ usually refers to the size of a batch of
120120
inputs/outputs. Its precise meaning is explained for each occurrence.
121121

122-
- The constant *B* refers to the size of the write buffer, which is a
123-
configuration parameter.
122+
- The constant $`B`$ refers to the size of the write buffer, which is
123+
determined by the `TableConfig` parameter `confWriteBufferAlloc`.
124124

125-
- The constant *T* refers to the size ratio of the table, which is a
126-
configuration parameter.
125+
- The constant $`T`$ refers to the size ratio of the table, which is
126+
determined by the `TableConfig` parameter `confSizeRatio`.
127127

128-
- The constant *P* refers to the the average number of key–value pairs
128+
- The constant $`P`$ refers to the the average number of key–value pairs
129129
that fit in a page of memory.
130130

131131
#### Disk I/O cost of operations <span id="performance_time" class="anchor"></span>
132132

133-
The following table summarises the cost of the operations on LSM-trees
134-
measured in the number of disk I/O operations. If the cost depends on
135-
the merge policy or merge schedule, then the table contains one entry
136-
for each relevant combination. Otherwise, the merge policy and/or merge
137-
schedule is listed as N/A.
133+
The following table summarises the worst-case cost of the operations on
134+
LSM-trees measured in the number of disk I/O operations. If the cost
135+
depends on the merge policy or merge schedule, then the table contains
136+
one entry for each relevant combination. Otherwise, the merge policy
137+
and/or merge schedule is listed as N/A. The merge policy and merge
138+
schedule are determined by the `TableConfig` parameters
139+
`confMergePolicy` and `confMergeSchedule`.
138140

139141
<table>
140142
<thead>
@@ -143,7 +145,7 @@ schedule is listed as N/A.
143145
<th>Operation</th>
144146
<th>Merge policy</th>
145147
<th>Merge schedule</th>
146-
<th>Cost in disk I/O operations</th>
148+
<th>Worst-case disk I/O complexity</th>
147149
</tr>
148150
</thead>
149151
<tbody>
@@ -273,39 +275,37 @@ schedule is listed as N/A.
273275
</tbody>
274276
</table>
275277

276-
(\*The variable *b* refers to the number of entries retrieved by the
278+
(\*The variable $`b`$ refers to the number of entries retrieved by the
277279
range lookup.)
278280

279-
TODO: Document the average-case behaviour of lookups.
280-
281281
#### In-memory size of tables <span id="performance_size" class="anchor"></span>
282282

283283
The in-memory size of an LSM-tree is described in terms of the variable
284-
*n*, which refers to the number of *physical* database entries. A
284+
$`n`$, which refers to the number of *physical* database entries. A
285285
*physical* database entry is any key–operation pair, e.g., `Insert k v`
286286
or `Delete k`, whereas a *logical* database entry is determined by all
287287
physical entries with the same key.
288288

289-
The worst-case in-memory size of an LSM-tree is *O*(*n*).
289+
The worst-case in-memory size of an LSM-tree is $`O(n)`$.
290290

291-
- The worst-case in-memory size of the write buffer is *O*(*B*).
291+
- The worst-case in-memory size of the write buffer is $`O(B)`$.
292292

293293
The maximum size of the write buffer on the write buffer allocation
294-
strategy, which is determined by the `confWriteBufferAlloc` field of
295-
`TableConfig`. Regardless of write buffer allocation strategy, the
296-
size of the write buffer may never exceed 4GiB.
294+
strategy, which is determined by the `TableConfig` parameter
295+
`confWriteBufferAlloc`. Regardless of write buffer allocation
296+
strategy, the size of the write buffer may never exceed 4GiB.
297297

298298
`AllocNumEntries maxEntries`
299299
The maximum size of the write buffer is the maximum number of entries
300300
multiplied by the average size of a key–operation pair.
301301

302-
- The worst-case in-memory size of the Bloom filters is *O*(*n*).
302+
- The worst-case in-memory size of the Bloom filters is $`O(n)`$.
303303

304304
The total in-memory size of all Bloom filters is the number of bits
305305
per physical entry multiplied by the number of physical entries. The
306306
required number of bits per physical entry is determined by the Bloom
307-
filter allocation strategy, which is determined by the
308-
`confBloomFilterAlloc` field of `TableConfig`.
307+
filter allocation strategy, which is determined by the `TableConfig`
308+
parameter `confBloomFilterAlloc`.
309309

310310
`AllocFixed bitsPerPhysicalEntry`
311311
The number of bits per physical entry is specified as
@@ -318,20 +318,20 @@ The worst-case in-memory size of an LSM-tree is *O*(*n*).
318318
The false-positive rate scales exponentially with the number of bits
319319
per entry:
320320

321-
| False-positive rate | Bits per entry |
322-
|---------------------|----------------|
323-
| 1 in 10 |  ≈ 4.77 |
324-
| 1 in 100 |  ≈ 9.85 |
325-
| 1 in 1, 000 |  ≈ 15.79 |
326-
| 1 in 10, 000 |  ≈ 22.58 |
327-
| 1 in 100, 000 |  ≈ 30.22 |
321+
| False-positive rate | Bits per entry |
322+
|---------------------------|--------------------|
323+
| $`1\text{ in }10`$ | $`\approx 4.77 `$ |
324+
| $`1\text{ in }100`$ | $`\approx 9.85 `$ |
325+
| $`1\text{ in }1{,}000`$ | $`\approx 15.79 `$ |
326+
| $`1\text{ in }10{,}000`$ | $`\approx 22.58 `$ |
327+
| $`1\text{ in }100{,}000`$ | $`\approx 30.22 `$ |
328328

329-
- The worst-case in-memory size of the indexes is *O*(*n*).
329+
- The worst-case in-memory size of the indexes is $`O(n)`$.
330330

331331
The total in-memory size of all indexes depends on the index type,
332-
which is determined by the `confFencePointerIndex` field of
333-
`TableConfig`. The in-memory size of the various indexes is described
334-
in reference to the size of the database in [*memory
332+
which is determined by the `TableConfig` parameter
333+
`confFencePointerIndex`. The in-memory size of the various indexes is
334+
described in reference to the size of the database in [*memory
335335
pages*](https://en.wikipedia.org/wiki/Page_%28computer_memory%29 "https://en.wikipedia.org/wiki/Page_%28computer_memory%29").
336336

337337
`OrdinaryIndex`
@@ -346,11 +346,166 @@ The worst-case in-memory size of an LSM-tree is *O*(*n*).
346346
a negligible amount of memory for tie breakers. The total in-memory
347347
size of all indexes is approximately 66 bits per memory page.
348348

349-
The total size of an LSM-tree must not exceed 2<sup>41</sup> physical
349+
The total size of an LSM-tree must not exceed $`2^{41}`$ physical
350350
entries. Violation of this condition *is* checked and will throw a
351351
`TableTooLargeError`.
352352

353-
### Implementation
353+
#### Fine-tuning Table Configuration <span id="fine_tuning" class="anchor"></span>
354+
355+
##### Table Layout: Merge Policy, Merge Schedule, Size Ratio, and Write Buffer Size
356+
357+
The table configuration paramters `confMergePolicy`,
358+
`confMergeSchedule`, `confSizeRatio`, and `confWriteBufferAlloc` affect
359+
how the table organises its data. To understand what effect these
360+
parameters have, one must have a basic understand of how an LSM-tree
361+
stores its data. An LSM-tree stores key–operation pairs, which pair a
362+
key with an operation such as an `Insert` with a value or a `Delete`.
363+
These key–operation pairs are organised into *runs*, which are sequences
364+
of key–operation pairs sorted by their key. Runs are organised into
365+
*levels*, which are unordered sequences or runs. Levels are organised
366+
hierarchically. Level 0 is kept in memory, and is referred to as the
367+
*write buffer*. All subsequent levels are stored on disk, with each run
368+
stored in its own file. The following shows an example LSM-tree layout,
369+
with each run as a boxed sequence of keys and each level as a row.
370+
371+
``` math
372+
373+
\begin{array}{l:l}
374+
\text{Level}
375+
&
376+
\text{Data}
377+
\\
378+
0
379+
&
380+
\fbox{\(\texttt{4}\,\_\)}
381+
\\
382+
1
383+
&
384+
\fbox{\(\texttt{1}\,\texttt{3}\)}
385+
\quad
386+
\fbox{\(\texttt{2}\,\texttt{7}\)}
387+
\\
388+
2
389+
&
390+
\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
391+
\end{array}
392+
```
393+
394+
The data in an LSM-tree is *partially sorted*: only the key–operation
395+
pairs within each run are sorted and deduplicated. As a rule of thumb,
396+
keeping more of the data sorted means lookup operations are faster but
397+
update operations are slower.
398+
399+
The configuration parameters `confMergePolicy`, `confSizeRatio`, and
400+
`confWriteBufferAlloc` directly affect the table layout. Let $`B`$ refer
401+
to the value of `confWriteBufferAlloc`. Let $`T`$ refer to the value of
402+
`confiSizeRatio`. The write buffer can contain at most $`B`$ entries.
403+
The size ratio $`T`$ determines the ratio between the maxmimum number of
404+
entries in each level. For instance, if $`B = 2`$ and $`T = 2`$, then
405+
406+
``` math
407+
408+
\begin{array}{l:l}
409+
\text{Level} & \text{Maximum Size}
410+
\\
411+
0 & B \cdot T^0 = 2
412+
\\
413+
1 & B \cdot T^1 = 4
414+
\\
415+
2 & B \cdot T^2 = 8
416+
\\
417+
\ell & B \cdot T^\ell
418+
\end{array}
419+
```
420+
421+
The merge policy `confMergePolicy` determines the number of runs per
422+
level. In a *tiering* LSM-tree, each level contains $`T`$ runs. In a
423+
*levelling* LSM-tree, each level contains one single run. The *lazy
424+
levelling* policy uses levelling only for the last level and uses
425+
tiering for all preceding levels. The previous example used lazy
426+
levelling. The following examples illustrate the different merge
427+
policies using the same data, assuming $`B = 2`$ and $`T = 2`$.
428+
429+
``` math
430+
431+
\begin{array}{l:l:l:l}
432+
\text{Level}
433+
&
434+
\text{Tiering}
435+
&
436+
\text{Levelling}
437+
&
438+
\text{Lazy Levelling}
439+
\\
440+
0
441+
&
442+
\fbox{\(\texttt{4}\,\_\)}
443+
&
444+
\fbox{\(\texttt{4}\,\_\)}
445+
&
446+
\fbox{\(\texttt{4}\,\_\)}
447+
\\
448+
1
449+
&
450+
\fbox{\(\texttt{1}\,\texttt{3}\)}
451+
\quad
452+
\fbox{\(\texttt{2}\,\texttt{7}\)}
453+
&
454+
\fbox{\(\texttt{1}\,\texttt{2}\,\texttt{3}\,\texttt{7}\)}
455+
&
456+
\fbox{\(\texttt{1}\,\texttt{3}\)}
457+
\quad
458+
\fbox{\(\texttt{2}\,\texttt{7}\)}
459+
\\
460+
2
461+
&
462+
\fbox{\(\texttt{4}\,\texttt{5}\,\texttt{7}\,\texttt{8}\)}
463+
\quad
464+
\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{9}\)}
465+
&
466+
\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
467+
&
468+
\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
469+
\end{array}
470+
```
471+
472+
Tiering favours the performance of updates. Levelling favours the
473+
performance of lookups. Lazy levelling strikes a middle ground between
474+
tiering and levelling. It favours the performance of lookup operations
475+
for the oldest data and enables more deduplication, without the impact
476+
that full levelling has on update operations.
477+
478+
The configuration parameter `confMergeSchedule` affects the worst-case
479+
performance of lookup and update operations and the structure of runs.
480+
Regardless of the merge schedule, the amortised disk I/O complexity of
481+
lookups and updates is *logarithmic* in the size of the table. When the
482+
write buffer fills up, its contents are flushed to disk as a run and
483+
added to level 1. When some level fills up, its contents are flushed
484+
down to the next level. Eventually, as data is flushed down, runs must
485+
be merged. This package supports two schedules for merging:
486+
487+
- Using the `OneShot` merge schedule, runs must always be kept fully
488+
sorted and deduplicated. However, flushing a run down to the next
489+
level may cause the next level to fill up, in which case it too must
490+
be flushed and merged futher down. In the worst case, this can cascade
491+
down the entire table. Consequently, the worst-case disk I/O
492+
complexity of updates is *linear* in the size of the table. This is
493+
unacceptable for real-time systems and other use cases where
494+
unresponsiveness is unacceptable.
495+
496+
- Using the `Incremental` merge schedule, runs can be *partially merged*
497+
with the merging work spead out evenly across all update operations.
498+
This aligns the worst-case and average-case disk I/O complexity of
499+
updates—both are *logarithmic* in the size of the table. The cost is a
500+
small constant overhead for both lookup and update operations.
501+
502+
The merge schedule does not affect the performance of table unions. The
503+
amortised disk I/O complexity of one-shot unions is *linear* in the size
504+
of the tables. Instead, there are separate operations for incremental
505+
and oneshot unions. For incremental unions, it is up to the user to
506+
spread the merging work out evenly over time.
507+
508+
### References
354509

355510
The implementation of LSM-trees in this package draws inspiration from:
356511

bench/macro/lsm-tree-bench-wp8.hs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -227,7 +227,7 @@ cmdP = O.subparser $ mconcat
227227

228228
setupOptsP :: O.Parser SetupOpts
229229
setupOptsP = pure SetupOpts
230-
<*> O.option O.auto (O.long "bloom-filter-alloc" <> O.value LSM.defaultBloomFilterAlloc <> O.showDefault <> O.help "Bloom filter allocation method [AllocFixed n | AllocRequestFPR d]")
230+
<*> O.option O.auto (O.long "bloom-filter-alloc" <> O.value (LSM.confBloomFilterAlloc LSM.defaultTableConfig) <> O.showDefault <> O.help "Bloom filter allocation method [AllocFixed n | AllocRequestFPR d]")
231231

232232
runOptsP :: O.Parser RunOpts
233233
runOptsP = pure RunOpts

0 commit comments

Comments
 (0)