@@ -104,37 +104,39 @@ The documentation provides two measures of complexity:
104
104
The complexities are described in terms of the following variables and
105
105
constants:
106
106
107
- - The variable * n * refers to the number of * physical* table entries. A
107
+ - The variable $ ` n ` $ refers to the number of * physical* table entries. A
108
108
* physical* table entry is any key–operation pair, e.g., ` Insert k v `
109
109
or ` Delete k ` , whereas a * logical* table entry is determined by all
110
- physical entries with the same key. If the variable * n * is used to
110
+ physical entries with the same key. If the variable $ ` n ` $ is used to
111
111
describe the complexity of an operation that involves multiple tables,
112
112
it refers to the sum of all table entries.
113
113
114
- - The variable * o * refers to the number of open tables and cursors in
114
+ - The variable $ ` o ` $ refers to the number of open tables and cursors in
115
115
the session.
116
116
117
- - The variable * s * refers to the number of snapshots in the session.
117
+ - The variable $ ` s ` $ refers to the number of snapshots in the session.
118
118
119
- - The variable * b * usually refers to the size of a batch of
119
+ - The variable $ ` b ` $ usually refers to the size of a batch of
120
120
inputs/outputs. Its precise meaning is explained for each occurrence.
121
121
122
- - The constant * B * refers to the size of the write buffer, which is a
123
- configuration parameter.
122
+ - The constant $ ` B ` $ refers to the size of the write buffer, which is
123
+ determined by the ` TableConfig ` parameter ` confWriteBufferAlloc ` .
124
124
125
- - The constant * T * refers to the size ratio of the table, which is a
126
- configuration parameter.
125
+ - The constant $ ` T ` $ refers to the size ratio of the table, which is
126
+ determined by the ` TableConfig ` parameter ` confSizeRatio ` .
127
127
128
- - The constant * P * refers to the the average number of key–value pairs
128
+ - The constant $ ` P ` $ refers to the the average number of key–value pairs
129
129
that fit in a page of memory.
130
130
131
131
#### Disk I/O cost of operations <span id =" performance_time " class =" anchor " ></span >
132
132
133
- The following table summarises the cost of the operations on LSM-trees
134
- measured in the number of disk I/O operations. If the cost depends on
135
- the merge policy or merge schedule, then the table contains one entry
136
- for each relevant combination. Otherwise, the merge policy and/or merge
137
- schedule is listed as N/A.
133
+ The following table summarises the worst-case cost of the operations on
134
+ LSM-trees measured in the number of disk I/O operations. If the cost
135
+ depends on the merge policy or merge schedule, then the table contains
136
+ one entry for each relevant combination. Otherwise, the merge policy
137
+ and/or merge schedule is listed as N/A. The merge policy and merge
138
+ schedule are determined by the ` TableConfig ` parameters
139
+ ` confMergePolicy ` and ` confMergeSchedule ` .
138
140
139
141
<table >
140
142
<thead >
@@ -143,7 +145,7 @@ schedule is listed as N/A.
143
145
<th >Operation</th >
144
146
<th >Merge policy</th >
145
147
<th >Merge schedule</th >
146
- <th >Cost in disk I/O operations </th >
148
+ <th >Worst-case disk I/O complexity </th >
147
149
</tr >
148
150
</thead >
149
151
<tbody >
@@ -273,39 +275,37 @@ schedule is listed as N/A.
273
275
</tbody >
274
276
</table >
275
277
276
- (\* The variable * b * refers to the number of entries retrieved by the
278
+ (\* The variable $ ` b ` $ refers to the number of entries retrieved by the
277
279
range lookup.)
278
280
279
- TODO: Document the average-case behaviour of lookups.
280
-
281
281
#### In-memory size of tables <span id =" performance_size " class =" anchor " ></span >
282
282
283
283
The in-memory size of an LSM-tree is described in terms of the variable
284
- * n * , which refers to the number of * physical* database entries. A
284
+ $ ` n ` $ , which refers to the number of * physical* database entries. A
285
285
* physical* database entry is any key–operation pair, e.g., ` Insert k v `
286
286
or ` Delete k ` , whereas a * logical* database entry is determined by all
287
287
physical entries with the same key.
288
288
289
- The worst-case in-memory size of an LSM-tree is * O * ( * n * ) .
289
+ The worst-case in-memory size of an LSM-tree is $ ` O(n) ` $ .
290
290
291
- - The worst-case in-memory size of the write buffer is * O * ( * B * ) .
291
+ - The worst-case in-memory size of the write buffer is $ ` O(B) ` $ .
292
292
293
293
The maximum size of the write buffer on the write buffer allocation
294
- strategy, which is determined by the ` confWriteBufferAlloc ` field of
295
- ` TableConfig ` . Regardless of write buffer allocation strategy, the
296
- size of the write buffer may never exceed 4GiB.
294
+ strategy, which is determined by the ` TableConfig ` parameter
295
+ ` confWriteBufferAlloc ` . Regardless of write buffer allocation
296
+ strategy, the size of the write buffer may never exceed 4GiB.
297
297
298
298
` AllocNumEntries maxEntries `
299
299
The maximum size of the write buffer is the maximum number of entries
300
300
multiplied by the average size of a key–operation pair.
301
301
302
- - The worst-case in-memory size of the Bloom filters is * O * ( * n * ) .
302
+ - The worst-case in-memory size of the Bloom filters is $ ` O(n) ` $ .
303
303
304
304
The total in-memory size of all Bloom filters is the number of bits
305
305
per physical entry multiplied by the number of physical entries. The
306
306
required number of bits per physical entry is determined by the Bloom
307
- filter allocation strategy, which is determined by the
308
- ` confBloomFilterAlloc ` field of ` TableConfig ` .
307
+ filter allocation strategy, which is determined by the ` TableConfig `
308
+ parameter ` confBloomFilterAlloc ` .
309
309
310
310
` AllocFixed bitsPerPhysicalEntry `
311
311
The number of bits per physical entry is specified as
@@ -318,20 +318,20 @@ The worst-case in-memory size of an LSM-tree is *O*(*n*).
318
318
The false-positive rate scales exponentially with the number of bits
319
319
per entry:
320
320
321
- | False-positive rate | Bits per entry |
322
- | ---------------------| ----------------|
323
- | 1 in 10 | ≈ 4.77 |
324
- | 1 in 100 | ≈ 9.85 |
325
- | 1 in 1, 000 | ≈ 15.79 |
326
- | 1 in 10, 000 | ≈ 22.58 |
327
- | 1 in 100, 000 | ≈ 30.22 |
321
+ | False-positive rate | Bits per entry |
322
+ | --------------------------- | ---- ----------------|
323
+ | $ ` 1\text{ in }10 ` $ | $ ` \approx 4.77 ` $ |
324
+ | $ ` 1\text{ in } 100` $ | $ ` \approx 9.85 ` $ |
325
+ | $ ` 1\text{ in }1{,} 000` $ | $ ` \approx 15.79 ` $ |
326
+ | $ ` 1\text{ in }10{,} 000` $ | $ ` \approx 22.58 ` $ |
327
+ | $ ` 1\text{ in } 100{,} 000` $ | $ ` \approx 30.22 ` $ |
328
328
329
- - The worst-case in-memory size of the indexes is * O * ( * n * ) .
329
+ - The worst-case in-memory size of the indexes is $ ` O(n) ` $ .
330
330
331
331
The total in-memory size of all indexes depends on the index type,
332
- which is determined by the ` confFencePointerIndex ` field of
333
- ` TableConfig ` . The in-memory size of the various indexes is described
334
- in reference to the size of the database in [ * memory
332
+ which is determined by the ` TableConfig ` parameter
333
+ ` confFencePointerIndex ` . The in-memory size of the various indexes is
334
+ described in reference to the size of the database in [ * memory
335
335
pages* ] ( https://en.wikipedia.org/wiki/Page_%28computer_memory%29 " https://en.wikipedia.org/wiki/Page_%28computer_memory%29 ") .
336
336
337
337
` OrdinaryIndex `
@@ -346,11 +346,166 @@ The worst-case in-memory size of an LSM-tree is *O*(*n*).
346
346
a negligible amount of memory for tie breakers. The total in-memory
347
347
size of all indexes is approximately 66 bits per memory page.
348
348
349
- The total size of an LSM-tree must not exceed 2< sup >41</ sup > physical
349
+ The total size of an LSM-tree must not exceed $ ` 2^{41} ` $ physical
350
350
entries. Violation of this condition * is* checked and will throw a
351
351
` TableTooLargeError ` .
352
352
353
- ### Implementation
353
+ #### Fine-tuning Table Configuration <span id =" fine_tuning " class =" anchor " ></span >
354
+
355
+ ##### Table Layout: Merge Policy, Merge Schedule, Size Ratio, and Write Buffer Size
356
+
357
+ The table configuration paramters ` confMergePolicy ` ,
358
+ ` confMergeSchedule ` , ` confSizeRatio ` , and ` confWriteBufferAlloc ` affect
359
+ how the table organises its data. To understand what effect these
360
+ parameters have, one must have a basic understand of how an LSM-tree
361
+ stores its data. An LSM-tree stores key–operation pairs, which pair a
362
+ key with an operation such as an ` Insert ` with a value or a ` Delete ` .
363
+ These key–operation pairs are organised into * runs* , which are sequences
364
+ of key–operation pairs sorted by their key. Runs are organised into
365
+ * levels* , which are unordered sequences or runs. Levels are organised
366
+ hierarchically. Level 0 is kept in memory, and is referred to as the
367
+ * write buffer* . All subsequent levels are stored on disk, with each run
368
+ stored in its own file. The following shows an example LSM-tree layout,
369
+ with each run as a boxed sequence of keys and each level as a row.
370
+
371
+ ``` math
372
+
373
+ \begin{array}{l:l}
374
+ \text{Level}
375
+ &
376
+ \text{Data}
377
+ \\
378
+ 0
379
+ &
380
+ \fbox{\(\texttt{4}\,\_\)}
381
+ \\
382
+ 1
383
+ &
384
+ \fbox{\(\texttt{1}\,\texttt{3}\)}
385
+ \quad
386
+ \fbox{\(\texttt{2}\,\texttt{7}\)}
387
+ \\
388
+ 2
389
+ &
390
+ \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
391
+ \end{array}
392
+ ```
393
+
394
+ The data in an LSM-tree is * partially sorted* : only the key–operation
395
+ pairs within each run are sorted and deduplicated. As a rule of thumb,
396
+ keeping more of the data sorted means lookup operations are faster but
397
+ update operations are slower.
398
+
399
+ The configuration parameters ` confMergePolicy ` , ` confSizeRatio ` , and
400
+ ` confWriteBufferAlloc ` directly affect the table layout. Let $` B ` $ refer
401
+ to the value of ` confWriteBufferAlloc ` . Let $` T ` $ refer to the value of
402
+ ` confiSizeRatio ` . The write buffer can contain at most $` B ` $ entries.
403
+ The size ratio $` T ` $ determines the ratio between the maxmimum number of
404
+ entries in each level. For instance, if $` B = 2 ` $ and $` T = 2 ` $, then
405
+
406
+ ``` math
407
+
408
+ \begin{array}{l:l}
409
+ \text{Level} & \text{Maximum Size}
410
+ \\
411
+ 0 & B \cdot T^0 = 2
412
+ \\
413
+ 1 & B \cdot T^1 = 4
414
+ \\
415
+ 2 & B \cdot T^2 = 8
416
+ \\
417
+ \ell & B \cdot T^\ell
418
+ \end{array}
419
+ ```
420
+
421
+ The merge policy ` confMergePolicy ` determines the number of runs per
422
+ level. In a * tiering* LSM-tree, each level contains $` T ` $ runs. In a
423
+ * levelling* LSM-tree, each level contains one single run. The * lazy
424
+ levelling* policy uses levelling only for the last level and uses
425
+ tiering for all preceding levels. The previous example used lazy
426
+ levelling. The following examples illustrate the different merge
427
+ policies using the same data, assuming $` B = 2 ` $ and $` T = 2 ` $.
428
+
429
+ ``` math
430
+
431
+ \begin{array}{l:l:l:l}
432
+ \text{Level}
433
+ &
434
+ \text{Tiering}
435
+ &
436
+ \text{Levelling}
437
+ &
438
+ \text{Lazy Levelling}
439
+ \\
440
+ 0
441
+ &
442
+ \fbox{\(\texttt{4}\,\_\)}
443
+ &
444
+ \fbox{\(\texttt{4}\,\_\)}
445
+ &
446
+ \fbox{\(\texttt{4}\,\_\)}
447
+ \\
448
+ 1
449
+ &
450
+ \fbox{\(\texttt{1}\,\texttt{3}\)}
451
+ \quad
452
+ \fbox{\(\texttt{2}\,\texttt{7}\)}
453
+ &
454
+ \fbox{\(\texttt{1}\,\texttt{2}\,\texttt{3}\,\texttt{7}\)}
455
+ &
456
+ \fbox{\(\texttt{1}\,\texttt{3}\)}
457
+ \quad
458
+ \fbox{\(\texttt{2}\,\texttt{7}\)}
459
+ \\
460
+ 2
461
+ &
462
+ \fbox{\(\texttt{4}\,\texttt{5}\,\texttt{7}\,\texttt{8}\)}
463
+ \quad
464
+ \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{9}\)}
465
+ &
466
+ \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
467
+ &
468
+ \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
469
+ \end{array}
470
+ ```
471
+
472
+ Tiering favours the performance of updates. Levelling favours the
473
+ performance of lookups. Lazy levelling strikes a middle ground between
474
+ tiering and levelling. It favours the performance of lookup operations
475
+ for the oldest data and enables more deduplication, without the impact
476
+ that full levelling has on update operations.
477
+
478
+ The configuration parameter ` confMergeSchedule ` affects the worst-case
479
+ performance of lookup and update operations and the structure of runs.
480
+ Regardless of the merge schedule, the amortised disk I/O complexity of
481
+ lookups and updates is * logarithmic* in the size of the table. When the
482
+ write buffer fills up, its contents are flushed to disk as a run and
483
+ added to level 1. When some level fills up, its contents are flushed
484
+ down to the next level. Eventually, as data is flushed down, runs must
485
+ be merged. This package supports two schedules for merging:
486
+
487
+ - Using the ` OneShot ` merge schedule, runs must always be kept fully
488
+ sorted and deduplicated. However, flushing a run down to the next
489
+ level may cause the next level to fill up, in which case it too must
490
+ be flushed and merged futher down. In the worst case, this can cascade
491
+ down the entire table. Consequently, the worst-case disk I/O
492
+ complexity of updates is * linear* in the size of the table. This is
493
+ unacceptable for real-time systems and other use cases where
494
+ unresponsiveness is unacceptable.
495
+
496
+ - Using the ` Incremental ` merge schedule, runs can be * partially merged*
497
+ with the merging work spead out evenly across all update operations.
498
+ This aligns the worst-case and average-case disk I/O complexity of
499
+ updates—both are * logarithmic* in the size of the table. The cost is a
500
+ small constant overhead for both lookup and update operations.
501
+
502
+ The merge schedule does not affect the performance of table unions. The
503
+ amortised disk I/O complexity of one-shot unions is * linear* in the size
504
+ of the tables. Instead, there are separate operations for incremental
505
+ and oneshot unions. For incremental unions, it is up to the user to
506
+ spread the merging work out evenly over time.
507
+
508
+ ### References
354
509
355
510
The implementation of LSM-trees in this package draws inspiration from:
356
511
0 commit comments