WorkTable macro for generating type alias and row type.
worktable! (
name: Test,
columns: {
id: u64 primary_key,
test: i64
}
);
Expanded as:
#[derive(Debug, Clone)]
pub struct TestRow {
id: u64,
test: i64,
}
impl worktable::TableRow<u64> for TestRow { fn get_primary_key(&self) -> &u64 { &self.id } }
type TestWorkTable<I> = worktable::WorkTable<TestRow, u64, I>;
- Underlying structure as
Vec<Row>
with no sync algorithms,BTreeMap
for primary key.
We have big amount off nearly small objects, so we need to control allocations to optimize theirs creation. For
achieving this we can store table's data on Pages
.
Pages are byte arrays of some size (4Kb as minimal value, it's better to choose this value from disk storage page size). Data stored in these pages in some deserialized format (some kind of binary format. rkyv can be used for serialization and deserialization).
struct Page {
data: [u8; PAGE_SIZE],
info: PageInfo, // Some info about `Page` like it's index etc.
}
To navigate on pages link's can be used.
struct Link {
/// Id of the page where data is located.
page_id: u64,
/// Offset ona page (< PAGE_SIZE, so u16 will be enough up to 64Kb page)
offset: u16,
/// Length of the data. For rows this will be same, so maybe not used.
len: u16,
}
When we added some data and then deleted we will have a gap in bytes, and we need to control this gaps and fill them with data. If we have rows-based storage, all rows will have same length, so when old is deleted, we can easily replace it with new one. So we need some storage (stack) for this empty links.
Lock-free stack (using atomics) can be used here, because we don't want to lock on new row addition.
When count of empty link will massively grow, we will have empty pages or big gaps in pages data. So, if we need, we can have defragmentation algorithm that will tighten data and delete gaps.
As was said before we don't want to lock on addition, so we can use atomic for page's tail, and lock-free stack for empty links. On row addition we wil first check empty link registry, and use link popped from it if there is some. If there is no links, we will use current tails index (which will be stored in atomic) and use it for new data storage. We possibly can have locks only on new page allocation.
Updates and deletes will be always lock actions.
So, for lock control we can use map that will map row's primary key to it's RwLock
. On new data addition. For indexes
lock-free map can be used.
Upd. Maybe atomic pointers can be used here, but I'm no sure.
So, modifying algorithm will look like: we get RwLock
by row's primary key (or index key), then modify row, and
releasing lock.
Upd. No locks for updating/deleting. Delete as flag for row, update as delete + insert.
I think for filtering we will need to copy table, which is bad. I need to think about filtering more to make it more effective.
Foreign keys can be implemented as map from key-id to other's table row link. So join operation on row can be done by O(1).
We need to extend macro usage to minimize client boilerplate code. Example:
{
worktable!(
name: Price
columns: {
id: u64 primary key autoincrement,
exchange: u8,
level: u8,
asks_price: f64,
bids_price: f64,
asks_qty: f64,
bids_qty: f64,
timestamp: u64,
}
queries: {
select: {
// similar to SELECT bids_price, bids_qty FROM price where exchange=$1
BidsPriceQty("bids_price", "bids_qty") by "exchange" as bids_price_by_exchange, // name override
// similar to SELECT bids_price, bids_qty FROM price where bids_price>$1
BidsPriceQty("bids_price", "bids_qty") by "bids_price" > as bids_price_above,
// similar to SELECT bids_price, bids_qty FROM price where timestamp>$1 and timestamp<$2
BidsPriceQty("bids_price", "bids_qty") by "timestamp" > and "timestamp" < as bids_price_by_date,
}
}
);
let price_table = PriceWorkTable::new();
// Result is multiple rows.
// without override price_table.bids_price_qty_by_exchange()
let binance_orders: BidsPriceByExchange = price_table.bids_price_by_exchange(Exchange::BinanceSpot as u8);
// Result is multiple rows.
// without override price_table.bids_price_qty_by_bids_price_more()
let binance_orders: BidsPriceAbove = price_table.bids_price_above(1000.0);
// Result is still multiple rows.
// without override price_table.bids_price_qty_by_timestamp_more_and_timestamp_less()
let binance_orders: BidsPriceByDate = price_table.bids_price_by_date(123312341, 1234128345);
}
As inspiration for macro interfaces/design this crate can be used. It contains derive macro implementation, but it's still usable for our case.
Next step after in-memory storage we need to add persistence support.
For starting point we can use mmap-sync which has mapped files implementation and read/write interface. We will need pages reader/writer for our storage engine.
As starting point innodb format was chosen. We can use it for storing tables data (leaf pages of b-tree must have nearly same layout).
Original innodb format
General Page
layout:
+----------------------+---------+
| Offset (Page number) | 4 bytes |
+----------------------+---------+
| Previous page ID | 4 bytes |
+----------------------+---------+
| Next page ID | 4 bytes |
+----------------------+---------+
| Page type | 2 bytes |
+----------------------+---------+
| Space ID | 4 bytes |
+----------------------+---------+
| Page data | n bytes |
+----------------------+---------+
Total header length is 18 bytes
.
- Offset is current
Page
's ID, in code will be represented asu32
(4,294,967,295 available pages). - Previous page ID is ID of previous logical
Page
. - Next page ID is ID of next logical
Page
. These IDs are used to form doubly-linked list from pages in one file. - Page type describes type of this page (TODO: Describe pages types)
- Space ID is ID of file (Space) to which this
Page
is related to. - Page data is just pages internal data.
Comparison with original InnoDB:
- No checksum part in header (we don't care about this).
- No LSN page modification header
- No flush LSN header
- No
Page
trailer
Original InnoDB format
For each table separate file will be used. This files will be named as Space
's. Each space will have some general
structure.
General Space
layout:
+-------------------------------+---------+
| Space internals page | 1 Page |
+-------------------------------+---------+
| Space Pages | n Pages |
+-------------------------------+---------+
Space
internal page:
+-----------------------+-----------+
| General page header | 18 bytes |
+-----------------------+-----------+
| Space header | 12 bytes |
+-----------------------+-----------+
| Table schema | n bytes |
+-----------------------+-----------+
As each Space
is related to separate Table
, it must contain this Table
's schema to validate data and row structure.
Space
header:
+-------------------------------+---------+
| Space ID | 4 bytes |
+-------------------------------+---------+
| Highest used Page number | 4 bytes |
+-------------------------------+---------+
| Highest allocated Page number | 4 bytes |
+-------------------------------+---------+
Original InnoDB page types
#[repr(u16)]
enum PageType {
/// Newly allocated pages.
Free = 0,
/// Space header `Page` type.
SpaceHeader = 1,
/// Table data `Page` type.
Data = 2,
/// Index `Page` type.
Index = 3,
}
Strings must be stored as varchars, we don't need to have empty padding in rows to have same width. We can have different row width, because we will hae links for the rows which contain its length. So we don't need this empty tail.
Same logic can be used for not sized array data types.
For fastest way possible to read bytes we need multiple opened read/write threads for one file. Using this we can update different parts of file at one time. To use this io_uring can be used. For this approach we have multiple solutions:
- https://github.com/ringbahn/iou - Interface to Linux's io_uring interface
- https://github.com/bytedance/monoio - A thread-per-core Rust runtime with io_uring/epoll/kqueue. It's goal is to replace Tokio. They have benchmarks that proves that they are blazingly fast wow.
- https://github.com/tokio-rs/io-uring - tokios io-uring raw interface. Is unsafe.
- https://github.com/tokio-rs/tokio-uring - io-uring in Tokio runtime. Safe.
- https://github.com/compio-rs/compio - inspired by MonoIO crate. Don't think we rally need this because it's feature is that this crates supports Windows. Do we need Windows?
Discussion about difference between sync and async i/o (5 years ago before io_uring was added).
https://crates.io/crates/faster-hex - where do we need to use hex? I don't know............