Replies: 2 comments 7 replies
-
The theory so far sounds solid. Still, I have some questions (of course :))
|
Beta Was this translation helpful? Give feedback.
3 replies
-
Is Cassandra or ScyllaDB a good persistence solution? |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Motivation
Moquette has always used a local key value storage for its necessity, started with MapDB and then switched to H2 MVStore. The things that the broker stores are:
This type of storage are essentially some form of BTrees, which is ideal to keep things ordered and with fast random access time.
While this characteristic is useful for session state information, it's sub-optimal for queues (FIFO in out case).
This discussion is to describe the design I've in mind for this particular aspect, shaped as a draft work in progress PR #638.
Requirements
The implementation should resort to OS features to speed up the I/O operations, permit concurrent access in read and write and have a design that's easy to grasp, because the maintenance of it is part of the Moquette project.
High level point of view
The idea is to use memory mapped files, to store parts of the queues, named
page
s. All the queues are contained in the same file structures, there isn't the concept of a file per queue, because that means to have an extra file descriptor for each client, and the clients could be a lot.Each
page
( 64Mb wide) is subdivided into fixedsegment
s (4 Mb), asegment
is owned exclusively by only a queue. A queue is composed of multiplesegment
s and an index file keeps the list of all segments for each queue.Pages are named with incremental integer name, for example
page.0
,page.1
and so one.Each queue has a tail and head pointers, that refers
(page, offset)
inside the page.We have the invariant, for each queue
The write operation to a queue move forward the
head
pointer, the same happens for the read against thetail
pointer.Write
The write could happen with 2 cases, respect of the payload it's saving:
In case 1, the write happens in the current segment, so it happens with a move forward of the head pointer (CAS operation) and then the bytes are effectively written to the mapped region. If the CAS operation on the head pointer doesn't success then simply retry, in a form of spin loop.
In case 2, it needs to write across two segments, so this time with a global lock (used by all the queues) it gets a new fresh segment, and write the payload split across the segments and release the lock.
The motivation for this, is that majority of the writes happens with CAS move of the head pointer. In particular, related to the new session event loop concept, given that the queues are per session, we are sure that there is no contention on the CAS.
Read
The read operation is symmetric to write, the
tail
point point to first byte to read, so conceptually it should read the 4 length bytes that prepend every payload and move forward thetail
pointer with a CAS operation.Like the write, there could be 2 cases:
In the first, the reader, read 4 bytes from the ail pointer, calculate the new position and if fall in the same segment, it does CAS move operation, if it fails, restart.
In the second case, if the pointer exceeds the tail segment limit, then the read operation enters the global lock, and retrieve the next segment, reading the data inside the lock and then releasing it.
Beta Was this translation helpful? Give feedback.
All reactions