Triedent Refactor #533

bytemaster · 2023-12-09T05:30:05Z

This pull request is to start the code review process and is not yet ready to be merged as it has not been tested with psibase integration.

The primary changes a in the following areas:

removing the cache_allocator, object_db, ring_allocator, and region_allocator and gc
adding a new block_allocator, id_allocator and seg_allocator

It maintains the existing API for database so shouldn't require any major changes to the rest of psibase

Motivation

The ring buffer system was a fixed size cache which required a lot of pinned memory. Under heavy load, especially once data no longer fits in RAM, the old system would have the write thread waiting on the background thread which in turn was waiting on the read threads. Transaction rates fell very low and the majority of the time was spent waiting on mutex. There was no good way to know how to size the ring buffers which meant that the region allocator did most of the heavy lifting.

The old system was fragile, requiring sessions to unlock on certain allocations and invalidating the cached reads. Aside from the pinning of Hot/Warm there was no good way to tell the OS how to page. To make matters worse, the hot rings were filled with mostly dead data caused by the churn of allocating and freeing. It took a long period of time for the ring allocator to get around to reusing that RAM causing a waste of scarce pinned pages.

Results

The code in this branch can sustain 2M random reads from 4 threads while doing 200k random writes on a database that is 272GB with 22GB of IDs holding 338M records. The vast majority of segments end up being 99.9% full and there was limited wasted space. At the end of the insertion of 272GB there were only 6GB of segments ready to be reused and a large part of that was each of the 6 threads personal 128MB write segments. Overall less than 5% wasted space. Future updates could easily trim the database down in size if there were too many empty segments. This was on a M3 Macbook Pro with 128GB of RAM.

After creating that large database, I was able to perform 3.8M sequential inserts per second from a single thread, followed by 6M sequential queries per second. I could update sequential keys at 5M keys per second. Doing single-threaded random inserts achieved over 350k per second.

Block Allocator

Allocates data in chunks of 128MB (configurable compile time)
Chunks have independent mmap address ranges so new chunks can be allocated without having to remap the entire file
Responsible for converting a "location" in a logical range into a segment/offset and resolving the pointer

ID Allocator

Uses the block allocator to reserve space for a growing ID database
mlock's the blocks provided
Responsible for allocating new IDs in a thread-safe manner and recycling unused ids using similar linked list to old version

Seg Allocator

This is the work horse that builds on the block allocator and id allocator to allocate large segments when any thread needs a new place to write. The segments do not use mlock and use madvise to tune paging based upon whether the segment is being used for allocation or being compacted and can factor in other things such as object density.

The seg_allocator implements sessions which allow a thread to request a read_lock to prevent the allocator from reusing a segment. Requests to access data can only be made via the read_lock which returns an object_ref.

Testing

The code was mostly tested via programs/tdb.cpp and it was built with Thread Sanitizer to remove all detectible data races.

Design

1. use alignas() to prevent false sharing 2. use stack allocated buffer for temp key6 during look ups (13% perf gain) 3. updated big test to support read only mode 4. updated big test to support reads 5. increase the ringbuffer space from 32M to 128m 6. added soem coments for review

1. put temp base6 key on stack instead of heap 2. disable copy-to-hot

1. new block allocator doesn't require remapping the entire range to grow 2. new id allocator that *should be* thread safe for multiple writers by treading the ID space as a hash table and growing it when collision rate starts to slow down alloc (this to be changed in the future as it consumes 25% of write thread) 3. new database API abstraction on top of database 4. replace global/generalized GC with one based upon seg manager 5. enforce that the session lock is in place by putting the necessary function calls on the "lock object" so it is impossible to use the API without maintaining the invariants. Currently maintains 6M reads/sec across 10 threads while wirting 185/items per second while 280M items are in the database and with no mlocking on the database except for the object id table.

Used Thread Sanatiser to remove all detected data races Uses fetch_or/and for locking and fetch_add/sub for retaining/releasing

updated treidentdb (tdb.cpp) to have more options to configure how agressive data is synced, cached, etc.

- added % free to db dump - fixed double-check lock on object id

Allocate object id before allocating space Set the object_header before advancing the alloc_ptr Change alloc_ptr to 32 bit

- fixing bugs in alloc - making compact optional / manual call from main thread for deterministic testing

updated release() to not require lock by having compactor check to see if the object was released after it was moved.

- add release() background thread - fixed bugs with compactor moving objects

git add include/triedent/xxhash.h

bytemaster added 9 commits November 25, 2023 13:16

adding new benchmark program, crashes

21f9d0d

Performance Improvments before Refactor

2095b45

1. put temp base6 key on stack instead of heap 2. disable copy-to-hot

update id alloc to reuse ids like before

9510159

remove unnecessary locking

0a885f7

code that might fix crash on move

0f675a8

Stable & Fast

357dec1

Used Thread Sanatiser to remove all detected data races Uses fetch_or/and for locking and fetch_add/sub for retaining/releasing

general cleanup and read-thread caching

21ac91a

bytemaster requested review from swatanabe and tbfleming December 9, 2023 05:30

bytemaster added 13 commits December 9, 2023 09:19

fix some build errors on linux

76e6b0a

fix segment data leak, add options to tester

911beee

Added msync() support for ACID compliance

69023d1

updated treidentdb (tdb.cpp) to have more options to configure how agressive data is synced, cached, etc.

Fix some memory ordering

63b1ceb

- added % free to db dump - fixed double-check lock on object id

Fix Race / Consistency

4fbdb6a

Allocate object id before allocating space Set the object_header before advancing the alloc_ptr Change alloc_ptr to 32 bit

Don't notify compactor until next alloc

1b4b668

Tracking age of segments

7f5edd4

- fixing bugs in alloc - making compact optional / manual call from main thread for deterministic testing

Remove spinlock, add mutex pool

56ef1af

updated release() to not require lock by having compactor check to see if the object was released after it was moved.

Fix Crash, add fields to object header, add valiate

78532db

validate checksum on all objects

89e304f

- add release() background thread - fixed bugs with compactor moving objects

Formatting changes, memory hit update, mlock

269a65e

adding missing header

badfa6d

git add include/triedent/xxhash.h

fix wrapping issue on free seg buf

9045b78

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triedent Refactor #533

Triedent Refactor #533

bytemaster commented Dec 9, 2023

Triedent Refactor #533

Are you sure you want to change the base?

Triedent Refactor #533

Conversation

bytemaster commented Dec 9, 2023

Motivation

Results

Block Allocator

ID Allocator

Seg Allocator

Testing

Design