Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triedent Refactor #533

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Triedent Refactor #533

wants to merge 22 commits into from

Conversation

bytemaster
Copy link
Contributor

This pull request is to start the code review process and is not yet ready to be merged as it has not been tested with psibase integration.

The primary changes a in the following areas:

  1. removing the cache_allocator, object_db, ring_allocator, and region_allocator and gc
  2. adding a new block_allocator, id_allocator and seg_allocator

It maintains the existing API for database so shouldn't require any major changes to the rest of psibase

Motivation

The ring buffer system was a fixed size cache which required a lot of pinned memory. Under heavy load, especially once data no longer fits in RAM, the old system would have the write thread waiting on the background thread which in turn was waiting on the read threads. Transaction rates fell very low and the majority of the time was spent waiting on mutex. There was no good way to know how to size the ring buffers which meant that the region allocator did most of the heavy lifting.

The old system was fragile, requiring sessions to unlock on certain allocations and invalidating the cached reads. Aside from the pinning of Hot/Warm there was no good way to tell the OS how to page. To make matters worse, the hot rings were filled with mostly dead data caused by the churn of allocating and freeing. It took a long period of time for the ring allocator to get around to reusing that RAM causing a waste of scarce pinned pages.

Results

The code in this branch can sustain 2M random reads from 4 threads while doing 200k random writes on a database that is 272GB with 22GB of IDs holding 338M records. The vast majority of segments end up being 99.9% full and there was limited wasted space. At the end of the insertion of 272GB there were only 6GB of segments ready to be reused and a large part of that was each of the 6 threads personal 128MB write segments. Overall less than 5% wasted space. Future updates could easily trim the database down in size if there were too many empty segments. This was on a M3 Macbook Pro with 128GB of RAM.

After creating that large database, I was able to perform 3.8M sequential inserts per second from a single thread, followed by 6M sequential queries per second. I could update sequential keys at 5M keys per second. Doing single-threaded random inserts achieved over 350k per second.

Block Allocator

Allocates data in chunks of 128MB (configurable compile time)
Chunks have independent mmap address ranges so new chunks can be allocated without having to remap the entire file
Responsible for converting a "location" in a logical range into a segment/offset and resolving the pointer

ID Allocator

Uses the block allocator to reserve space for a growing ID database
mlock's the blocks provided
Responsible for allocating new IDs in a thread-safe manner and recycling unused ids using similar linked list to old version

Seg Allocator

This is the work horse that builds on the block allocator and id allocator to allocate large segments when any thread needs a new place to write. The segments do not use mlock and use madvise to tune paging based upon whether the segment is being used for allocation or being compacted and can factor in other things such as object density.

The seg_allocator implements sessions which allow a thread to request a read_lock to prevent the allocator from reusing a segment. Requests to access data can only be made via the read_lock which returns an object_ref.

Testing

The code was mostly tested via programs/tdb.cpp and it was built with Thread Sanitizer to remove all detectible data races.

Design

image

1. use alignas() to prevent false sharing
2. use stack allocated buffer for temp key6 during look ups (13% perf
   gain)
3. updated big test to support read only mode
4. updated big test to support reads
5. increase the ringbuffer space from 32M to 128m
6. added soem coments for review
1. put temp base6 key on stack instead of heap
2. disable copy-to-hot
1. new block allocator doesn't require remapping the entire range to
   grow
2. new id allocator that *should be* thread safe for multiple writers by
   treading the ID space as a hash table and growing it when collision
   rate starts to slow down alloc (this to be changed in the future as
                                   it consumes 25% of write thread)
3. new database API abstraction on top of database
4. replace global/generalized GC with one based upon seg manager
5. enforce that the session lock is in place by putting the necessary
   function calls on the "lock object" so it is impossible to use the
   API without maintaining the invariants.

   Currently maintains 6M reads/sec across 10 threads while wirting
   185/items per second while 280M items are in the database and with no
   mlocking on the database except for the object id table.
Used Thread Sanatiser to remove all detected data races
Uses fetch_or/and for locking and fetch_add/sub for retaining/releasing
updated treidentdb (tdb.cpp) to have more options to configure how
agressive data is synced, cached, etc.
- added % free to db dump
- fixed double-check lock on object id
Allocate object id before allocating space
Set the object_header before advancing the alloc_ptr
Change alloc_ptr to 32 bit
- fixing bugs in alloc
- making compact optional / manual call from main thread for
  deterministic testing
updated release() to not require lock by having compactor check to
see if the object was released after it was moved.
- add release() background thread
- fixed bugs with compactor moving objects
git add include/triedent/xxhash.h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant