-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direction towards allocation-free operations #1184
Comments
👋 Thanks for opening this issue! Get help or engage by:
|
I have not reviewed it thoroughly yet, but there are a few design aspects that In general, a
It is not feasible to keep Entry objects until they are fully processed. If a
There is a problem here because
A log entry that is submitted is visible to
The follower's buffer may be truncated if the Leader switches. Such an approach |
I'm using Also,
Agreed. But, that's also not a problem, IMHO. The entries don't strictly need to be kept longer than they are today. Today we also keep entries around until they are replicated, unless the follower is lagging too much (at the latest, the log implementation does that, since otherwise it would be very inefficient). If the entries are gone they can still be loaded from the log. Anyway, barring offline followers, the entries are processed rather quickly. Another option is to simply delegate the entry buffer to the log implementation, basically what we have today. Then, the log implementation can force the slow down of the processing by simply blocking append to the log (and thus overflowing the bounded channel directing requests to the log, which would block further requests). I didn't think about exact design here, though. My point is, we don't need to collect batches of entries in vectors, we just need to update pointers to the log/entry buffer to instruct the state machine and/or replication what to do. And, if we just pass it the
Correct. That's a problem which needs to be solved. A trivial solution I outlined above is simply stopping the replication task (closing the channel) before writing further log after leader change. This guarantees that the However, this would require awaiting joining of the replication task in raft core, which is contraproductive. Your point above with the log entry buffer being the problem is taken. So if we use the log as we use it today, the replication task would try to read the log, but the log was already asked to truncate itself. So the user-specific log implementation needs to resolve this properly (as it is today, no change in log storage concepts).
Strictly speaking, it does not. But the problem arises if the process is killed after applying the entry to the state machine, but before flushing the log entry. Restarting the process will find applied log ID > flushed log ID, which might lead to issues. Isn't there also somewhere an assertion regarding this in
Why? If the leader dies, all followers either have the newest (uncommitted) entries or do not. The followers then hold an election and the one having the longest log will win and replicate it to others. So there is no truncation on the follower. When the original leader rejoins (as a follower) and it has yet another (uncommitted, but flushed) log entry, which has NOT been replicated yet (and thus is unknown to all former followers), then this node needs to truncate the log. This will be necessary no matter what, it must be done also today. Or am I missing something here? Thanks. Based on the discussion so far, I'd modify the approach outlined above by removing the explicit entry buffer and delegating it to the log implementation. I.e., the log implementation is responsible for any optimizations of reading log items and ideally keeping them in memory for sufficient time. Still, the communication between tasks would be modified from unbounded channels to simply communicating pointers/simple state over |
The problem is that after setting the
I'm talking about the current Openraft implementation. logs that are submitted to This is because Openraft aims to separate the logic from the IO operations. When input messages are received, it simply places the IO commands into the output queue instead of executing them immediately. This approach allows for the reordering or batching of IO commands, enabling more efficient processing. When an IO command is placed into the output queue, it is considered accepted(not visible to reader yet). When the command is executed, such as when commands are pushed to the |
Yes the current implementation is using
This is almost(exactly?) same as the way Openraft does it now. |
This is the way Openraft does. The problem is that joining then rebuilding replication streams introduces a small delay. One solution is to let |
In the current code base, it is possible for the applied log ID to be greater than the flushed log ID. For example, when a snapshot is installed, the state machine is entirely replaced, and the last log ID may lag behind the snapshot's last log ID. Openraft has already dealt with these cases. |
What I mean is that using a log entries buffer requires the log to be truncated at a very early stage. |
Oh, a lot of comments...
Sure, we can only inform the consumer after the entry is actually written into the log/log buffer. As I mentioned, I think the term I used is wrong (accepted/submitted).
That's right. And my intention is to optimize it :-).
and
Correct. But these queues are unbounded, so it's a problem to run in constant memory. Also, the batching is at the moment NOT at the storage/log level, but rather at the task driving the storage/log and the batches constructed by We have multiple use cases here. Let's look first at the very simplest, which is the storage implementation doing My suggestion is NOT to pass entries, but rather to only pass the log pointer (committed log ID) to the user and consume the log pointer (applied log ID) from the user. I.e., there are no "commands", but rather there is a simple Also, let the user code apply the entries read from the log (most likely from memory) and then just announce applied log ID to Similarly, for snapshots, the With that, on the one hand, the storage implementation has the full flexibility to implement any I/O optimizations, etc., and the communication between Raft core and the storage task is in constant memory. The replication tasks are the same in green, at least for append entries.
Yes, that's a possibility. But,
OK, as mentioned, we can simply go with log storage as the buffer (as it is today in Where I don't have a good idea yet is how to optimize it with constant memory is the communication from the client, through the raft core task to the log writer. Communication from the client to the raft core task can be done using bounded buffer, but that helps only partially. Writing to the log is still unbounded and can overflow and/or block the raft core task if we'd use a bounded command buffer from raft core to log writer (which is a no-go). The same holds for append entries on the follower. I was thinking about short-cutting this to have only one bounded queue from client write/append entries directly to the log write task and log write task informing the raft core about accepted entries. Since we anyway need to deal with election and truncating the log, the log write task could simply poll both the entry stream from the user (either client write or append entries) and the When the Again, this concept would allow running in constant memory. There are still some other things like voting, but those can be also handled in constant memory fairly easily. |
Aggree.
It's acceptable to pass a log pointer to the StateMachine instead of log entries. However, pushing commands makes it easier to manage multiple types of operations, such as applying log entries, building a snapshot, or retrieving a snapshot. Generally speaking, the watch-driven architecture could work, but there are quite a few order-related details to consider. I'm going to rewrite the part that SM responds apply result to
Good.
Good.
Good.
The
Good. This should be feasible when all of the
It is indeed a problem unless
Yes a exclusive single client would be solution.
Make sense. |
Right, pushing command does have it's advantages :-). But, it can likely be handled by a different means.
Well, instead of pushing individual commands, one can send the required state. This has also a huge advantage that it would "cancel" commands which do not make sense to be executed anymore. So in the end, it likely is a win-win scenario. We just need to ensure
Oh yes, I know. For one, the current architecture requires passing a Possibly, some similar concept could be used here. For example, do not send |
Ah, forgot to comment on this one:
Yes, but what is a |
The log id has the id of the Leader that proposed this log. Not the one that replicates this log. A log can be proposed by Leader-1 and replicated by Leader-2.
An IO pointer must be monotonic increasing, Thus a complete |
OK, thanks. The problem is, if we push the log to the writer via short circuit, the |
So, back to the topic...
I don't quite get it. Yes, the same
Yes, but how can the writing leader/ We can only truncate that portion of the log, which was written after we became leader (and managed to persist something in the log, which isn't committed by the quorum and will be truncated out when we become follower). Basically, a truncate request is always a new So I believe it's sufficient to work like this:
For the optimized leader election, replace Am I missing something? |
Let me take an example to explain what could happen to a certain log id: Scenario: LogId Truncated and Appended Multiple TimesConsider a scenario where a specific The following discussion is in Raft terminology, a
Given the initial cluster state built with the following steps:
As a result, N1's log will be truncated and appended multiple times:
This scenario can repeat, where a log entry with the same However, it's important to note that a specific Therefore, the pointer for an IO operation must be represented as |
Uff, quite a complex example :-). But I understand what you mean. Nevertheless, it should be fairly easy to implement it as well. Basically, we have two sources feeding entries into the log writer:
In both cases, at the time the entry would be put to the log buffer the The log write task would truncate the log to the Or am I still missing something? |
It looks like a working approach. But I can not say for sure for now. It involves too many modification to the codebase. Can you give some of the top level API signature, such as for Follower to accept AppendEntries. |
Sorry for the delay. First, let's look at replication. My suggestion would be on these lines: trait RaftNetworkFactory<C> {
/// Create replication stream streaming log entries to a remote node.
fn new_replication_stream(
&mut self,
target: C::NodeId,
node: C::Node
) -> impl RaftReplicationStream {
DefaultReplicationStream(self.new_client(target, node));
}
/// State of the replication for a replication stream.
trait RaftReplicationState {
/// Get the next target log ID to replicate up to, the current commit ID and the current Vote to replicate to the follower.
async fn next_request(&self) -> Result<(LogId, LogId, Vote), ReplicationClosed>;
/// Report progress of the replication to the leader.
fn report_progress(&self, flushed_up_to: LogId) -> Result<(), ReplicationClosed>;
}
/// Replication stream.
trait RaftReplicationStream {
/// Run the replication loop.
///
/// Read entries from the `log_reader` and send them to the `target`.
/// Use `state` object to get the new log ID to replicate to and to report progress.
async fn replicate(
self,
log_reader: RaftLogReader,
start_log_id: LogId,
state: impl RaftReplicationState,
) -> Result<(), ConflictOrHigherVoteOrNetworkError>;
} where The When some entries are replicated and logged remotely, the remote node sends a confirmation in application-specific way, which will result in either reporting progress or by reporting an error (conflict/higher vote). In case of error, there is no way for replication to continue, so the loop ends and the "right" error is returned. Upon returning, the driver of the replication in the replication task would inform the raft core and/or restart the replication as needed. Reporting the progress would just entail setting a new value for the No memory for actual data channels and no entry copying is involved here (aside from reading from the log, but the application should optimize that). |
Now, for trait RaftStateMachine {
fn new_apply_stream(&mut self) -> impl RaftApplyStream;
}
/// State for applying committed entries.
trait RaftApplyState: 'static {
/// Get the next committed log ID to apply up to.
async fn next_request(&self) -> Result<LogId, RaftStopped>;
/// Report apply progress.
fn report_progress(last_applied: LogId) -> Result<(), RaftStopped>;
}
// Entry apply stream.
trait RaftApplyStream: 'static {
/// Run the apply loop.
///
/// Read entries from the `log_reader` and apply them to the state machine.
/// Use `state` object to get the new log ID to apply up to to and to report progress.
async fn apply(
self,
log_reader: RaftLogReader,
start_log_id: LogId,
state: impl RaftApplyState,
) -> Result<(), StorageError>;
} The apply stream is started at Raft startup in a new task (or runs in the storage task interleaved with other processing as needed). It uses the In case of error, the apply loop terminates and the Raft stops. Same as for replication, there are no actual data moved between tasks, just the Same as for replication, it would be possible to write a default implementation which simply calls legacy What is different here, we need to find a way to send responses to the client. I didn't think this through yet, how to post response objects needed for reporting replies w/o the use of an extra channel. Probably it will require moving reporting the response to a callback on the |
What about snapshot? IMHO, there should be a wrapper like To me the major changes include two parts: replace channel based communication with There are a few related detail to be consider. I can not tell if it is totally a working pattern.
The data flow from |
One minor issues is that The |
Currently,
openraft
allocates at many places, which is a bit contraproductive, since any OOM will panic and tear down the state machine unnecessarily. We should look at possibilities how to operate in constant memory.Basically, we have following parts/task that cooperate:
openraft
) receiving logs from the leader and pushing them toopenraft
Now, all of these tasks basically operate on the same list of entries, which are either received from the client (on the leader) or replicated from the leader (on the follower).
The idea is to have one buffer which accepts logs, which then are pushed through the consensus (replication), log and state machine. This buffer could be also fixed-size, so it can accept all the
Entry
objects for the relevant logs and keep them until they are fully processed.I.e., on the leader:
accepted
pointer to changeaccepted
pointer and when it changes, starts writing the log and updatessubmitted
pointerflushed
pointer is updatedaccepted
pointer as well and send entries to the respective followerflushed
pointer and the pointers from followers and computes thecommitted
pointerto_apply
pointer to the minimum offlushed
andcommitted
pointer (we cannot apply locally until the local log was written)to_apply
pointer and similar to log and replication tasks, applies log entriesapplied
pointer is updated and async client callback in the entry is calledThe API for respective user code should provide entries as a pair of an async
Stream
and a notification object, where the stream internally awaits the pointer change and then feeds the entries from the buffer until this pointer, then awaiting the next change. When the needful is done, the notification object is informed about processed entries (not necessarily for each single entry). This notification object could be a wrapper overwatch::Sender
or similar for the first implementation.On the follower:
openraft
into the same buffer as the client write would put it (unless it's already there or known to be committed - when the leader retries after disconnect/reconnect)accepted
pointer will change, causing the log to be writtenflushed
pointer will change, which can be exposed for example as awatch::Receiver
to the task handling incoming entriesflushed
state back in an implementation-specific wayWhat about elections?
VoteRequest
upon a timeoutVoteRequest
is sent via a dedicated method, basically similar to what we have todayVoteRequest
received by the user-specific task is stored in peer's state and the raft core task is woken up (on aWatch
, for example)Vote
change and sends theVoteReply
back over anoneshot
channel or similar to the requestor user task, which sends it back over the networkWhat happens upon leader change?
watch::Sender
sending theaccepted
pointer, after which the task is reinitialized/restarted and first truncates the log)What about the snapshot?
snapshot_needed
flagSnapshotRequest
in theStream
feeding entries to the storage task)What about purging the log?
purged
pointer as usualpurged
andaccepted
in a common state object, so the log task can do the needfulStream
reading entries for the log can intersperse them withPurge
requestsSo we have following state to watch:
flushed
pointers of individual followers (also local one), including their votes, theapplied
pointer andsnapshot_log_id
accepted
andpurged
pointers (in one state object)to_apply
pointer andsnapshot_needed
flag (in one state object)accepted
pointerAll tasks read entries from the shared buffer up to
accepted
, only the raft core task adds entries to the buffer and updatesaccepted
afterwards, so the Rust AXM rules are followed. This will likely need a bit ofunsafe
code to implement the buffer optimally.The raft core task listens on multiple event sources:
Since the
watch
state for the core task is relatively large and frequently updated, I'd implement it a bit differently - push the most of the individual requests via atomic change to the state member (up to 16B can be updated in one go, which should be sufficient for most items) and a simple signalization semaphore to ensure the task is running or will wake up after the update. This is also true for other signalization - for example, signaling the log task needs two 8B values, so it can be also done atomically by a simple atomic write to the respective value followed by task wakeup.With this, we have no unbounded channels or other unbounded allocations in the entire state in the
openraft
itself (except for leader change, where we restart replication tasks - that needs to be addressed separately). It's up to the user to implement the handling in an OOM-safe way.I hope this all is somewhat understandable. We can start implementing it also piecewise - for example, first handle the local log writing and state machine update in this way (which can by default call legacy methods easily or alternatively we can provide a wrapper calling legacy interface to allow for a smooth transition). Then, continue with replication tasks.
Comments? Did I forget about something?
The text was updated successfully, but these errors were encountered: